XO Docs
The Future of WorkExperiments

Fable 5 vs Opus 4.8: A Coding-Agent Evaluation

Comparing Claude Fable 5 and Claude Opus 4.8 on real engineering tasks with a reproducible harness — gated success, run observability, captured diffs, and blind judging — cut short when Fable was suspended before the harder tasks ran.

A head-to-head between Claude Fable 5 and Claude Opus 4.8 on real engineering work, run through a reproducible harness built to make the comparison actually mean something. The hard part was never running the models — it was removing the four things that quietly make most model comparisons wrong.

Incomplete by force. Of 10 planned tasks, only 4 (easy-to-medium) ran before Claude Fable 5 was suspended on June 12, 2026 under a US export-control directive. The harder, long-horizon tasks — the ones built to separate the two models — never ran. This is not a final verdict. It is a snapshot of where Fable and Opus stood when one of them vanished mid-experiment, with the early signal pointing toward Fable.

Why "run both and eyeball it" fails

Handing two models the same task and picking the nicer output gives a confident answer that is usually wrong. Four problems sink that approach, and the harness is built to remove each one:

  • Contamination — if a model trained on the fix, you're testing memory, not skill.
  • No objective "done" — "I prefer this output" is a mood, not a measurement.
  • Inconsistent help — nudging a struggling model contaminates what you're measuring.
  • Bias — knowing which model wrote what tilts the scoring.

The core idea

A task is a base commit + a prompt + a gate — a test that is red before the work and green after it.

The model gets a real codebase at a known starting point and a description of the symptom, never the fix. It is "done" only when the gate test, which failed at the start, now passes — and the harness re-runs that gate itself rather than trusting the model's claim. Everything else is machinery around that one idea, and the harness is reusable: the same engine can be pointed at a different project or a different pair of models without changing how it works.

What the harness measures

Gates (the objective floor). Before each run, the harness confirms the task's test fails — proof the problem is real. After the run, it re-runs the gate and the full test suite to catch anything the fix broke elsewhere. Efficiency is only ever compared across runs that passed, so a model can't look cheap by failing fast.

Run observability. Every run executes through Claude Code automatically, with no human in the loop and everything held identical except for the model. The harness records tokens used, how many tool calls the model made and which tools, the number of back-and-forth turns, retries, and time taken. Each task runs 5× per model, because these models are non-deterministic — one run is an anecdote, five gives you a real average and spread.

Captured changes. Every solution is saved as a diff — the exact code change the model made. That gives two independent ways to judge quality: a human can apply the change and read it line by line, and an automated reviewer can score it. When the two disagree, that gap is a signal worth chasing.

Blind, independent, verified judging. A separate model (GPT-5.5, on neither side of the contest) scores every solution against a fixed rubric, with all identities stripped and the order shuffled so it can't tell which model wrote what. And the judge is verified: whenever its verdict hinges on a factual claim about the code, we open the actual change and check. A judge you don't verify is just a more confident guess.

What we tested it on

We ran the comparison on Click (pallets/click), a widely used open-source Python library for building command-line tools. It was a deliberate choice: it's real, production-grade code that thousands of projects depend on, it sets up quickly, and it has a fast, thorough test suite — which gives us an objective way to check whether a fix actually worked.

Rather than inventing problems, we pulled real, recently merged changes from Click's own history — actual bug fixes and features that had shipped. For each one, we reverted the change to recreate the original problem, kept that change's test as the gate, and withheld the real fix. Recency was the point: a recent change is far less likely to be in either model's training data, so we were testing problem-solving, not memorization. (The harness also supports tasks written from scratch or drawn from private code, which is safer still against memorization. This mirrors how standard coding benchmarks like SWE-bench are built — but run as a controlled head-to-head between two specific models rather than a public leaderboard.)

We described only the symptom to each model — "this crashes on empty input," never "change line 40." Then we picked four tasks, each a different kind of engineering work:

  • a small, localized bug fix
  • a multi-file feature addition
  • a debug-from-a-traceback task (the model got only the failing test output, no description)
  • a deliberately open-ended design task with no single right answer

What we found

Result (4 completed tasks, conditioned on success)Claude Fable 5Claude Opus 4.8
Gated tasks passedallall
Avg tool calls / task~18.4~19.6
Avg output tokens / task~8.3k~8.4k
Multi-file feature task~23 calls · ~8.9k tok~28 calls · ~11.6k tok
Solution style (change review)smaller, scope-disciplinedbroader rewrites
Tool calls per tasklower is more efficient · average of 5 runsFable 5Opus 4.80102030405.66.2Localizedbug fix2328Multi-filefeature11.810.2Debug fromtraceback33.434.2Open-endeddesign Output tokens per tasklower is more economical · average of 5 runsFable 5Opus 4.805k10k15k20k1.4k1.4kLocalizedbug fix8.9k11.6kMulti-filefeature5.2k4.4kDebug fromtraceback17.6k16.2kOpen-endeddesign

The efficiency edge is real but not uniform. Fable's advantage is decisive on the multi-file feature and slight on the localized fix; the two are effectively even on the open-ended task; and on the debug task Opus actually did the leaner job. Averaged across the four, Fable comes out modestly ahead on both tool calls and tokens — a directional lead, not a rout.

On easy-to-medium tasks, both models clear the gates — so pass/fail alone doesn't separate them, and the rubric scores on this set were close. The differentiating signal lives in efficiency, code discipline, and long-horizon behavior. On all three, the early read favored Fable.

Fable did more with less. It reached the same passing results with fewer tool calls and fewer output tokens, clearest on the multi-file feature, where it finished in roughly 23 tool calls and 8.9k output tokens against Opus's 28 and 11.6k.

Fable's solutions were cleaner. Reviewing the actual code changes, Fable consistently made the smallest change that solved the problem — surgical, in-scope edits — while Opus reached for broader rewrites that sometimes did more than the task required. Less code to review, less surface area to break. Fewer tokens and tighter solutions.

On the open-ended task, two independent judges split — expected once a task has no checkable answer, and a reminder that subjective work doesn't resolve to a single number.

Where Fable fit best. Our strongest read — from hands-on use, and the hypothesis the harder tasks were built to confirm — is that Fable 5 suited long-running, low-supervision work: continuous, multi-step tasks with little or no human in the loop, where staying coherent over a long horizon matters more than any single edit. The completed tasks were too short to stress this directly; the second wave of harder, longer tasks was designed to prove it. Fable was suspended before we could run a single one.

Read this as directional, not decided. 4 tasks, 1 codebase, 5 runs each. The long-horizon tasks — where the gap should widen — never ran. Dollar costs were list-price equivalents (Opus was the cheaper model per task at these sizes, since Fable carries a higher per-token price), so the comparison here is in raw tokens, tool calls, and code quality rather than spend.

Built to reuse

The harness was designed so the experiment isn't a one-off. Swapping the two models or pointing it at a different codebase is a configuration change, not a rewrite, and adding a new task just means supplying a prompt and a test. Whenever a capable long-horizon model is available, the harder tasks we never reached can run unchanged.

The comparison was cut short, but the method wasn't. The runs are saved and the harness is reusable — the deliverable was always the rigor of the measurement, not a single verdict.

On this page