AI Workflows  /  For Attorneys

Claude Fable 5 Tops Harvey's Legal Agent Benchmark

Anthropic's newest model just took the top spot on Harvey's own legal benchmark, and the honest number underneath the headline is worth reading closely.

Claude Fable 5 Tops Harvey's Legal Agent Benchmark
The Leveraged Years AI Workflows

Claude Fable 5, released by Anthropic on June 9, 2026, scored 13.3% on Harvey's open-source Legal Agent Benchmark, the highest result ever recorded on that test and up from Opus 4.8's prior top score of about 10%. The benchmark ranks foundation models, not legal software products, and uses an all-pass standard where missing one required criterion fails the entire task, so the record 13.3% still means the model completes fewer than one in seven complex legal workflows flawlessly.

What actually happened

On June 9, 2026, Anthropic released Claude Fable 5, the most capable model in the Claude family. Within hours, Harvey updated its open, publicly documented Legal Agent Benchmark to show Fable 5 at the top of the leaderboard with a score of 13.3%, an all-time high on that test. The previous best belonged to Claude Opus 4.8, which had recently become the first model to clear the 10% mark. Harvey reported Opus 4.8 at 10.4% (Harvey), and Artificial Lawyer described Opus as the first model to crack 10% before Fable 5 pushed it to 13.3% (Artificial Lawyer).

That is a real jump, roughly three points on a benchmark that is brutally hard to move. It is also a result that gets misread constantly, so it is worth being precise about what the number does and does not say.

Read the headline carefully: models, not products

A quick glance might suggest Fable 5 outperforms Harvey the product. Here is the accurate version. Harvey built and publicly documented the Legal Agent Benchmark. The leaderboard ranks foundation models, the underlying engines like Claude, that power legal AI tools. It is not a head-to-head between Anthropic's model and Harvey's own agent product.

So the honest reading is this. Claude Fable 5 topped Harvey's own benchmark, beating every other frontier model Harvey tested, including the prior leader, Opus 4.8. It did not beat "Harvey," the platform. In fact Harvey runs on models like this one and immediately made Fable 5 available to its customers. If anything, the story is a vendor publishing a test on which its own preferred supplier just set a record. That is not a knock on the result. It is a reason to read the framing with a lawyer's eye.

If you want the practical version of this distinction, we walk through it in [Harvey or Claude: a routing rule for your firm's AI](/ai-workflows/harvey-or-claude-law-firm-ai-routing-rule-2026).

What the benchmark actually measures

The Legal Agent Benchmark, or LAB, is not a multiple-choice quiz. Per Mark Pike, who leads Claude for Legal, and Harvey's own description, it runs more than 1,200 realistic tasks across 24 practice areas, graded against expert rubrics under what Harvey calls an all-pass standard. Missing a single required criterion means the whole task is scored as a failure (Harvey).

That design is the key to reading the score. LAB is not asking whether the model was mostly helpful. It is asking whether the model completed an entire, complex legal workflow to a standard where a partner would not need to fix anything material. Under that bar, 13.3% is the fraction of full tasks the model got completely right.

Harvey also ran Fable 5 on BigLaw Bench, its internal proprietary test, where the model scored 93.4% overall, a new high for the Anthropic family. The two numbers are not in conflict. BigLaw Bench credits partial, criterion-by-criterion performance, so it lands in the 90s. LAB demands a clean sweep of every criterion on every task, so it lands in the low teens. Same model, two very different questions.

What 13.3% really tells a working lawyer

Strip away the leaderboard drama and here is the useful takeaway. On complete, multi-step legal matters graded to a no-mistakes standard, the best model available finishes fewer than one in seven tasks flawlessly. Harvey's own Head of Applied Research, Niko Grupen, framed it as "a meaningful step up" for the complex, multi-step matters customers run every day, with particular strength in drafting and long-horizon agent work.

Both of those things are true at once. The model is genuinely better at first-draft contracts, markup analysis, and tracing defined terms across large document sets. Harvey's evaluators singled out Fable 5's ability to catch off-market provisions, term-sheet deviations, and internal inconsistencies. And it still stumbles on complex quantitative work like multi-step tax calculations and fund waterfall modeling, which Harvey notes is consistent with what it sees across frontier models.

For adoption, that points to a clear posture. Use these agents where a strong, structured first draft saves you an hour and you were always going to review the output anyway. Do not hand them a full workflow and assume the last mile is covered. The whole point of the all-pass standard is that the last mile is exactly where these models still miss.

There is a genuinely optimistic reading buried in the trend, though. If scores on benchmarks like this keep rising meaningfully over time, and if that curve ever bends upward instead of staying linear, the low-teens score that looks unimpressive today becomes the early part of a steep line. That is worth watching, not because it is here, but because the direction is consistent.

How to use this in your practice this week

You do not need to enable Fable 5 to act on this. Use it as a prompt to pressure-test how your team already relies on AI. A simple framework:

If you want the underlying skill of writing those all-pass criteria and reviewing agent output like an editor, that is the core of what we teach in [The Leveraged Attorney](/leveraged-attorney). And if you are not sure where you sit on the AI curve yet, our [two-minute quiz](/quiz) points you to the right starting point.

The fine print worth knowing before you enable it

One practical caveat that got less attention than the score. Harvey is offering Fable 5 as opt-in rather than on by default, and flagged data terms that differ from its standard customer agreements. Per Harvey, Anthropic will retain inputs, outputs, and documents and may review that data for safety reasons, generally deleting it after 30 days unless a longer period is legally required. There is no regional processing option for Fable 5, so all data is processed in the United States (Harvey).

For a lot of matters that is a non-issue. For privileged, cross-border, or highly sensitive work, it is a conversation to have with your firm's risk and data-governance people before anyone flips the switch. The best model on the leaderboard is not automatically the right model for a given client's data.

Where this fits in the bigger picture

Fable 5 topping LAB is one data point in a fast-moving race among legal AI agents. It rhymes with the broader push toward agents that execute end to end, which we cover in [the next generation of legal AI co-counsel](/ai-workflows/next-generation-cocounsel-legal-ai-agent-2026). The pattern to internalize is not "this tool won." It is that the ceiling on autonomous legal work is rising steadily, the honest scores are still low, and the firms that benefit are the ones building disciplined review habits now, while the models are clearly fallible and the stakes of over-trusting them are obvious.

Frequently Asked Questions

Did Claude Fable 5 actually beat Harvey's own product?

No. The Legal Agent Benchmark ranks foundation models, and Fable 5 topped that leaderboard by beating other models like Opus 4.8. Harvey built the benchmark and runs on models like Fable 5, so this is not a contest between Anthropic and Harvey's platform.

Is a 13.3% score good or bad?

It is the highest ever recorded on this test, and it is still low in absolute terms. The benchmark fails a task if the model misses even one required criterion, so 13.3% means the model completes fewer than one in seven full workflows flawlessly.

Why does Harvey report 93.4% on one test and 13.3% on another?

They measure different things. BigLaw Bench credits partial, criterion-by-criterion performance, which produces a high percentage. The Legal Agent Benchmark uses an all-pass standard where one miss fails the whole task, which produces a much lower number for the same model.

Should my firm switch to Fable 5 right away?

Not automatically. It tested strongest on drafting, redline review, and multi-document consistency, and weaker on quantitative modeling. It also carries different data-retention terms and US-only processing inside Harvey, so loop in your risk team before enabling it for sensitive matters.

Does this mean AI can replace associates now?

No. A clean-sweep rate in the low teens on complex tasks is strong evidence for keeping a lawyer in the loop, not removing one. The realistic use is faster first drafts and review support, with human sign-off on anything that matters.

Where can I verify these numbers myself?

Harvey published the score and methodology on its blog, and Artificial Lawyer reported the launch and the underlying benchmark details.

Browse all AI tool workflows

Informational tool analysis for working professionals, not legal, medical, or financial advice. AI tools do not replace your professional judgment.