The Self-Sufficient Agent
We promised to raise the difficulty until the bare agent broke. We did, with harder tasks on a real 170-file service, and on these tasks it didn't break. With no documentation at all, two coding agents satisfied nine of nine non-obvious functional requirements, identically. On this kind of work, the bottleneck was never curiosity.
Last time we suspected the agent of not reading its environment. So we did what we said we'd do: we made the tasks hard enough that a bare agent should fail, on a real codebase big enough to get lost in. On these tasks the bare agent didn't fail. Handed no documentation at all, two coding agents (OpenAI's Codex and Anthropic's Claude) satisfied nine out of nine non-obvious functional requirements, and they did it identically, task for task. On this kind of work, the thing we kept trying to give them they already had.
Krish Bhimani, Ankit Dwivedi, Rohini Pedamkar, Suraj Sharma · XO Labs Inc. · June 2026
TL;DR: Our last study (The Incurious Agent) ended on a promise: raise the task difficulty until a bare agent breaks, so there's finally something for context to win. We did exactly that, with a real ~170-file FastAPI service and a set of landmine tasks, each carrying a hidden requirement that a naive solution silently violates (a do-nothing answer scores zero). The result reframes the program. On nine functional landmines (idempotency, replay-safety, atomic writes, deduplication, defensive parsing, registration invariants), a bare agent with no documentation scored a clean 9 / 9, and Codex and Claude were identical on every task. On these tasks the agents weren't failing for lack of context; there was no capability gap to fill. The only place a bare agent slips is a narrow band of project-specific format conventions, and that, not curiosity, is the one thing context actually moves here. A caveat we keep throughout: this is a result about these functional invariants on these repositories, not a verdict on coding agents in general.
1. The promise we made
The Incurious Agent left a loose end on purpose. Every task in it was easy enough that a bare agent already passed, so we could only ever measure cost, never benefit. We said the next step was to "raise the task difficulty until the bare agent breaks, so that for the first time there's something for curiosity to win."
So we changed three things. We moved off the toy notes-api and onto xo-cowork-api, a real ~170-file FastAPI control plane, deep enough to get lost in. We wrote landmine tasks: the prompt asks for something ordinary ("add an endpoint that lists events with paging"), but there is a hidden requirement underneath it that the obvious solution gets wrong. And we gated each one with a check that is red on a do-nothing submission and green only when the hidden requirement is actually met, validated both ways before we ran a single agent. These are not tasks you pass by showing up.
The landmines come in two flavours, and the distinction is the whole story:
- Functional landmines (9). A behavioral invariant the agent has to infer: make project creation idempotent; make an event log replay-safe so a redelivered event isn't double-recorded; write files atomically; don't double-count a running total; parse an auth header defensively. Nothing in the prompt names these. A naive solution looks right and is wrong.
- Format landmines (5). A project's own house wire-format: the exact shape of an error body, whether a model rejects unknown fields, the envelope a list endpoint returns. These are conventions, not logic, and unknowable from the code alone.
2. The tasks have teeth, and the agent has the logic
Here is what a bare agent (empty workspace, no README, no contract, no memory) solves on each kind of task. A do-nothing answer scores 0% on every one of them, so anything above the floor is real work.
Two bars at the ceiling, two on the floor, and the two agents stacked exactly on top of each other. On the functional work the bare agent is already perfect; on the format conventions it is almost helpless. The gap between them is not difficulty (both kinds of task were built to the same do-nothing-fails standard), it is knowability. Logic you can reason out from the code. A house convention you cannot.
The functional result is not a fluke of a forgiving grader. On the replay-safety task, Codex, unprompted, reached for a database uniqueness constraint to make the ingest idempotent, reasoning out loud that "the upstream source is at-least-once," a solution more robust than the reference answer we had written. That is not an agent that needs to be told how to write a replay-safe log. That is an agent that already knows, and improves on the brief.
3. The same picture, task by task
Pool the nine functional tasks into one bar and you might wonder if a couple of easy ones are carrying it. They aren't. Here is every landmine, one square each, for both agents: green if the bare agent satisfied the hidden requirement on its own, orange if it missed.
Nine green, five orange, and the Codex row is a carbon copy of the Claude row. Two coding agents from two different labs, built by different teams on different model families, given a hard problem with no instructions, agree task for task on what they can and can't do. That convergence is the result worth dwelling on. It says the functional competence here is not a quirk of one model you could engineer around: it held identically for two agents from two labs. On tasks like these, you are not going to document your way past it, because there is nothing past it to reach. (Whether it holds beyond this kind of task is a separate question we take up at the end.)
4. The one thing that's left
Look back at the orange squares. They are not random failures, they are all the same kind of thing: a wire-format convention that lives in this project's head and nowhere in its code. The error body has to be {"detail": {"error": ...}} and not a bare string. The request model has to reject unknown fields rather than ignore them. A list endpoint has to return {total, offset, items} and not a raw array. An agent cannot reason these out, because there is nothing to reason from; they are arbitrary house style. On these tasks this is the entire surface that repository context can move: not capability, just conformance to local convention.
That turns out to be a real, useful lever, and it behaves in ways that surprised us: generic context does nothing, one placement is cheapest, and the popular "hierarchical docs" pattern backfires. But it is a small lever, and naming it correctly matters, because the previous two articles framed it as something larger.
5. From incurious to self-sufficient
In The Incurious Agent we watched both agents skip the curated memory/ we had written for them and concluded the bottleneck was consumption: that the missing skill was environmental curiosity, the agent's willingness to go read the help we had prepared. That read fit the data we had then. With harder tasks, it turns out to be incomplete.
This sharpens our last claim. The Incurious Agent argued the binding constraint is that agents don't read the context we prepare. In this study we put the missing piece exactly where the agent always looks, the root contract it reads first, and confirmed it read it. Capability still did not move, because there was no capability gap: the agents were already self-sufficient on the functional work. Curiosity is a real limitation, but on this kind of task it was not the constraint we took it for. It bites only on the thin band of project-specific conventions, and there, as the next piece shows, what matters is whether the context is relevant and where it sits, not how curious the agent is.
The name for what we measured is functional self-sufficiency: on work like this, an agent dropped cold into the codebase already brings the engineering judgment the task requires. It does not need to be taught to write idempotent creates or replay-safe logs; it needs, at most, to be told the handful of arbitrary local conventions it has no way to guess. The practice of piling up READMEs, briefs, scaffolds, and memory so the agent "understands the project" is, for these tasks, solving a problem the agent does not have. The problem it does have is narrow, and the fix is small.
6. Scope and limits
The caveats shape the claim, so they sit in the body, not a footnote.
- "Self-sufficient" here means on AC-checkable functional invariants, on small, well-specified tasks. We are not claiming an agent needs no context for a large, ambiguous feature, or for work that spans many sessions. We are claiming that on each of these concrete, verifiable requirements, a bare agent already succeeds, so context has no room to raise capability on them.
- These are landmines, not epics. Each task is a small, sharp problem with one hidden requirement. That is what makes the result clean (a do-nothing answer fails; the bare agent passes), but it is a deliberately narrow slice of engineering.
- Two agents are two agents. That they agree is striking, but it is two model families at one moment in time. We hold the convergence as a strong hypothesis, not a law.
- The orange band is where the open questions are, and it is small. The rest of this series is about that band: what kind of context closes it, what it costs, and which popular ways of delivering it make things worse.
7. What this changes
If agents are already self-sufficient on the logic of tasks like these, then the job of an agent's environment is not to teach it the project, which it can't, and doesn't need to here. The job is narrower and more mechanical: hand the agent the few conventions it genuinely can't infer, as cheaply and as reachably as possible, and get out of the way. That reframes "context engineering" for this kind of work from write the agent a better manual to find the one paragraph that matters and put it where the agent already looks.
Which raises the obvious question: what counts as "the one paragraph that matters," and does it matter how you deliver it? It does, more than we expected. Two documents of identical length, one of which works and one of which doesn't; a popular scaffolding pattern that turns out to be the most expensive way to say a single sentence. That is the next piece. The harness, the tasks, and every run are saved and reusable; for the first time the bare agent had something it could not do, and it turned out to be smaller, and stranger, than "read the manual."
The Incurious Agent
We gave two coding agents a steadily richer home — a README, an AGENTS.md contract, a project brief, a full scaffold, even a curated memory of how the codebase works — and measured what they actually opened. They read the surface and skip the substance. 84 controlled runs on environmental curiosity.
Relevance, Not Volume
We wrote two operating contracts for a coding agent and matched them to the same length. One was generic; the other carried a single project-specific rule. The generic one changed nothing: conformance stayed at its 8% floor. The one with the rule lifted it to 80–100%. On these tasks it was not about how much context you give an agent, but whether the bytes close a gap it actually has.