The Incurious Agent
We gave two coding agents a steadily richer home — a README, an AGENTS.md contract, a project brief, a full scaffold, even a curated memory of how the codebase works — and measured what they actually opened. They read the surface and skip the substance. 84 controlled runs on environmental curiosity.
We spend enormous effort building agents a better home — contracts, project briefs, a structured memory of how a codebase works. So we ran a blunt test: hand a coding agent more and more of that context and watch what it actually reads. Across 84 controlled runs and two different agents, the answer is uncomfortable. They skim what sits on the surface and almost never open the deeper, hand-curated knowledge we prepared for them. The environment we carefully designed performs about the same as an empty folder — not because the design is wrong, but because nothing reads it.
Krish Bhimani, Ankit Dwivedi, Rohini Pedamkar, Suraj Sharma · XO Labs Inc. · June 2026
TL;DR: We ran two coding agents — OpenAI Codex and Claude (Sonnet) — against the same two tasks on a real FastAPI codebase, under seven environments that add progressively more project context, from an empty workspace up to a full scaffold with a seeded memory/ of how the project works. Three findings. First, more context did not buy better outcomes: every one of the 84 runs passed, so the only thing context moved was cost — and for Codex it moved cost up, by as much as 150%. Second, the two agents have opposite reading habits: Codex opens more files the more you give it; Claude reads the same four or five files no matter what's in the workspace. Third, and most telling, both agents fall off a depth cliff — they open top-level docs sometimes, but the curated memory/ we seeded specifically to help was read in 1 of 12 runs of the condition built around it. We call the missing capability environmental curiosity, and we think it, not raw model strength, is the binding constraint on structured agent workspaces.
1. A ladder of context
The test is deliberately boring, because boring is what makes it clean. We took a small, real FastAPI service (notes-api), pinned it to a single commit, and wrote two engineering tasks against it: an easy one (add a GET /notes/{id}/history endpoint) and a medium one (replace naive search with SQLite FTS5). Each task is gated — a test that is red before the work and green after — and the harness re-runs the gate itself rather than trusting the agent. A do-nothing submission scores 0.33; a real solution scores 1.0.
Then we built a ladder of seven environments, each adding one more layer of standing context to the same workspace:
- E0 — empty repo, code only.
- E1 —
+ README. - E2 —
+ AGENTS.md(a contract telling the agent how the project works). - E3 —
+ PROJECT.md(a project brief). - E4 — the full old scaffold (contract, brief, objectives, plan, progress, an empty
memory/tree). - E5 — the full scaffold plus a seeded
memory/: hand-written episodic and procedural notes about exactly this codebase and these kinds of tasks. - E6 — a leaner "DOX" scaffold (a small
AGENTS.mdrail, a childsrc/AGENTS.md, a split.xo/, on-demand memory).
We ran every environment with both agents, on both tasks, three times each: 2 agents × 2 tasks × 7 environments × 3 replications = 84 runs. The model, the prompt, the task, and the codebase are held fixed within each agent. The only thing that changes across a task's runs is how much project context is sitting in the folder when the agent wakes up.
2. More context, more spending — same result
Start with cost. Here is the average input-token bill per run, by environment, for each agent. Every bar below also passed its gate — success was pinned at 100% across all 84 runs, so nothing here is bought with a worse outcome.
Two very different shapes, one shared punchline. Codex climbs steeply: from 228k tokens in the empty workspace to 570k in the full scaffold (E4) — a +150% increase for context the task did not need. Claude is almost flat, and flat at a much higher floor: ~610k–680k everywhere, only +12% from E0 to E4. Adding structure barely moves Claude's bill in either direction.
The shared punchline is the success row that isn't drawn because it's a straight line: all 84 runs passed. So whatever the context did, it did not change the outcome — it changed only how many tokens were burned getting to the same green test. (For the curious: this is robust to caching. ~86% of Codex's input was served from cache, and the effect survives cache-adjustment; Claude's is cached even more heavily. Cheap tokens are still spent tokens, and they bought nothing.)
3. Two agents, two reading habits
Cost is a symptom. The cause is reading — how many of the files in the workspace the agent actually opens. Same ladder, but now counting distinct files read per run:
This is the same story told as behavior. Codex's reading scales with the workspace — 5 files when it's empty, 11–12 when it's full (a 2.2× swing). It is responsive to context: put a document in front of it and there's a good chance it opens it. Claude's reading is a flat line at four-and-a-half files — 4.3 to 4.7, a 1.08× swing, statistically indistinguishable across all seven conditions. Claude reads its handful of code files and gets to work; the scaffold may as well not exist.
Look closer and these two policies turn out to be two different cost engines. Plot every one of the 84 runs by how many files it opened (across) against how many tokens it spent (up):
The two clouds barely touch. Claude's runs collapse into a vertical stack at four or five files — it reads the same handful every time, yet its bill swings from 390k to over a million tokens. That cost is not coming from the workspace; it is intrinsic, the price of how much the model deliberates. Codex's runs fan out along a diagonal — its bill rides on how many files it opens, climbing rightward as it reads its way through whatever you put in the folder. One agent pays to read the room; the other pays to think in it. And — back to the first chart — neither way of spending bought a better result. Every dot, low or high, left or right, ended at the same passing test.
But both policies share a blind spot, and it only shows up when you stop counting files and ask which files get read.
4. The depth cliff
Group every readable file by how deep it sits in the environment's intended hierarchy — from the source code the agent must touch, out to the surface docs, out to the curated memory/ that was the whole point of the rich conditions — and measure how often each layer gets opened at least once:
This is the result the whole study is about. Both agents always read the code — they have to. From there it falls away. Codex opens the surface docs most of the time (AGENTS.md 67%, PROJECT.md 78%) and then drops to 28% for the curated memory/. Claude opens AGENTS.md in 20% of runs, PROJECT.md in 0%, and memory/ in 0% — a clean cliff straight off the edge.
The single sharpest number: E5 is the condition we built entirely around the seeded memory — episodic and procedural notes written specifically to brief the agent on this codebase. Across the twelve E5 runs of the two agents, the memory/ directory was opened in one of them. We wrote the agent a letter explaining the job; eleven times out of twelve, it never opened the envelope.
5. Environmental curiosity
Put the three charts together and a single trait explains all of them.
Environmental curiosity — an agent's tendency to seek out the parts of its environment that would help it, in proportion to how much they would help, rather than reading whatever happens to be in front of it (or nothing at all).
Our two agents fail this in mirror-image ways:
- Claude doesn't read the map. Its reading is invariant to the environment (a flat 4.5 files), so the contract, the brief, and the memory all go unopened. Context is free for Claude — and useless, because it's ignored. It pays a high flat token bill for its own reasoning and treats the workspace as decoration.
- Codex reads the signposts but never walks the path. It is curious about the surface — it opens the docs sitting at the root and pays for them in tokens — but its curiosity is shallow and undirected. It explores more without exploring deeper, so it pays the +150% bill of a rich environment while still skipping the curated knowledge that the richness was for.
Neither agent does the thing the environment was designed to reward: notice that a folder called memory/ full of notes about this exact task is the highest-value thing in the room, and go read it first. And so the punchline of §2 stops being a surprise. The carefully structured environment performs almost exactly like the empty one — because, functionally, nothing reads the structure. You cannot get credit for context the agent never consumes. The treatment was placed in the room; it was never delivered to the model.
This reframes a lot of "context engineering." Most of the effort goes into authoring better environments — sharper contracts, richer memory, cleaner scaffolds. Our data says the authoring is not the bottleneck. Consumption is. A perfect memory/ that is opened 8% of the time is, in expectation, 92% wasted, no matter how good the 8% is. The leverage is in getting the agent to navigate toward relevance — or, until agents do that on their own, in the environment pulling its own highest-value context into the agent's hands instead of waiting to be asked.
6. How to read this honestly
This is a small, controlled study and we report it as one. The caveats are not footnotes — they shape what you're allowed to conclude.
- Success was at the ceiling. Every run passed, which is exactly why we can talk about cost but not about benefit. These two tasks are well-specified enough that a bare agent already aces them, so there was never any outcome headroom for context to improve. The fair reading is "on tasks the agent can already do, context it doesn't read is pure overhead" — not "context never helps." The interesting experiment is the harder task where the bare agent fails and the memory would have saved it; that is the next one to run.
- A single session, a five-file repo. Episodic memory earns its keep across sessions ("I've worked here before"); every run here started cold and ended in one shot, so the design can't show memory's main payoff. And a tiny codebase gives curiosity little to be rewarded for — there is no deep module to discover. A larger, deeper repo is where a read-everything habit should start to really hurt, and where directed curiosity should start to really pay.
- Two agents, two tasks, three reps each. Enough to see a 2.5× cost spread and a clean depth cliff; not enough to rank the middle rungs of the ladder against each other. We're reporting the large, robust effects, not the fine ones.
- One blind spot is structural: Codex's transcripts carry no timing, so the original "does context help the agent orient faster" question is unanswerable on that side. We measure what gets read and what it costs, not how quickly.
This complicates our own earlier pilot. Our first, single-task pilot (How the Environment Affects Agent Performance and Token Cost) found richer context coming out cheaper — a hopeful early read. The fuller matrix here, with three replications and a functional gate, does not reproduce that headline: across both agents and both tasks, more context trends costlier, and the cause is now visible. We'd rather correct ourselves in public than keep a tidy story. The method held; the early number didn't generalize.
7. Why we think this is the real frontier
It is tempting to read these charts as a contest — Codex spends less, Claude reads less — and miss the thing they agree on. Both of the strongest coding agents we have, handed a workspace engineered to help them, left the most useful part of it unopened. The bottleneck on structured agent environments isn't the quality of the structure we can author; we can already author more than the agents will read. It's whether the agent is curious in the right direction — and today, neither reading-everything nor reading-nothing is.
That points the work in two directions, and we're pursuing both: building agents that triage their environment by expected value before they start (curiosity as a skill), and building environments that don't wait to be explored — that surface their own highest-relevance context into the agent's hands at the moment of need. The harness, the tasks, and all 84 runs are saved and reusable; the next round raises the task difficulty until the bare agent breaks, so that for the first time there's something for curiosity to win.
Fable 5 vs Opus 4.8: A Coding-Agent Evaluation
Comparing Claude Fable 5 and Claude Opus 4.8 on real engineering tasks with a reproducible harness — gated success, run observability, captured diffs, and blind judging — cut short when Fable was suspended before the harder tasks ran.
Others
Perspectives we keep returning to: the XO view, a view from Claude, and a view from GPT.