Relevance, Not Volume
We wrote two operating contracts for a coding agent and matched them to the same length. One was generic; the other carried a single project-specific rule. The generic one changed nothing: conformance stayed at its 8% floor. The one with the rule lifted it to 80–100%. On these tasks it was not about how much context you give an agent, but whether the bytes close a gap it actually has.
The last piece left us with a narrow target. On tasks like these a bare agent already has the engineering judgment to do the work; the only thing it can't guess is a project's own arbitrary conventions, such as the exact shape of an error body or whether a model rejects unknown fields. So we tried to hand it those conventions, and we ran a clean test of the thing practitioners argue about: does more context help? We wrote two operating contracts, matched them to the same length, and put the rule in only one of them. The generic one moved nothing. The one with the rule moved almost everything. The active ingredient in "context" here is not volume, it is relevance.
Krish Bhimani, Ankit Dwivedi, Rohini Pedamkar, Suraj Sharma · XO Labs Inc. · June 2026
TL;DR: The Self-Sufficient Agent found that on these tasks repository context moves exactly one thing: an agent's conformance to project-specific wire conventions it has no way to infer. Here we ask what kind of context moves it. We gave the agent a generic AGENTS.md contract (~14 KB) and, separately, the same contract padded to the same length but containing the one rule the task needs (~14 KB). Same size, same structure; only the content differs. The generic contract left conformance at its floor (8%); the rule-bearing one lifted it to 100% for Codex and 80% for Claude. Adding a document did nothing; adding the relevant document did almost everything. It also offers a reading of why two recent studies of AGENTS.md reached opposite conclusions: they were standing at opposite ends of the same lever.
1. The only lever there is
We established last time that the two agents we tested, dropped into a real codebase with no documentation, already nail the functional work (idempotency, replay-safety, deduplication) and do it identically. The single place a bare agent slips is a project's house conventions: the wire formats that are true for this service and written down nowhere in its code. On these tasks, that orange band is the entire surface repository context can act on.
So the practical question is not "should I write the agent more context?" We just saw most of it is spent on capability the agent already has. The question is sharper: to close that one narrow gap, what do you actually need to put in front of the agent? A whole operating contract? A project brief? Or something much smaller?
2. A clean test: same length, one sentence apart
The clean way to ask whether more context helps is to add context that does not address the task and see if it helps anyway. So we built two versions of the same AGENTS.md operating contract:
- Generic. A real, sensible contract: how the project is laid out, how to run it, house style, a boot ritual. Everything except the one rule this task needs. ~14.2 KB.
- Generic + the rule. Byte-for-byte the same contract, padded back to the same length, with one paragraph added: the actual convention the task hinges on (say, "list endpoints return
{total, offset, items}, never a bare array"). ~14.7 KB.
Same size. Same structure. Same everything an agent has to read. The only difference is whether the one relevant sentence is in there.
If "more context helps the agent orient," the generic contract should help: it's a thorough, accurate description of the project, exactly the kind of thing teams spend afternoons writing. It is also exactly the kind of thing the agent does not need.
3. The result: volume is inert, relevance is everything
Here is the conformance rate (how often the agent produces the project's required format) across three conditions: no docs at all, the generic 14 KB contract, and the same-length contract with the rule.
The first two groups are the experiment's whole point. No docs and a 14 KB generic contract produce the same near-zero conformance: 4% and 8%, statistically a wash. The agent reads a thorough, accurate description of the project and is no more likely to match its conventions than if you had handed it an empty folder. Then you add back the same number of bytes, but now one paragraph of it is the rule, and conformance jumps to 100% / 80%. The fourteen kilobytes of generic context were doing nothing. The active ingredient was the one sentence.
This is the cleanest result in the program, because it isolates a single variable. Length is held constant, structure is held constant, and the agent reads the document either way. The only thing that changes is whether the bytes address the task, and that alone accounts for the swing from 8% to ~100%. On these tasks, "context" is not a quantity you add; it is a question of fit.
4. Why two studies disagreed about AGENTS.md
This also speaks to a public disagreement. Within a few months, two careful studies of AGENTS.md files reached opposite conclusions: one found the files cut an agent's effort substantially; the other found they barely help, sometimes hurt, and reliably cost more. Practitioners read both and were left without a clear answer.
They may both be right. They were measuring opposite ends of this lever. A study whose AGENTS.md happened to contain the conventions the tasks needed sits at our right-hand bar: relevant context, large gain. A study whose files were generic or redundant with the code sits at our middle bar: present, read, and inert. On this reading the disagreement is not about whether AGENTS.md "works," but about whether the particular file closed a gap the task actually had. Presence is the wrong question; relevance and headroom are the right ones.
5. What "relevant" actually means here
It is worth being precise, because "relevant context" can sound like a tautology. In this setting it has a sharp definition: context is relevant when it supplies something the agent both needs and cannot derive. That is a small set. It excludes everything the agent already knows (all the functional engineering; see the last piece), and everything the agent could read off the code itself. What is left is the genuinely arbitrary: the house conventions, the wire formats, the "we always do it this way" rules that are decisions, not deductions. A document earns its tokens only to the extent it carries those.
That is a demanding bar, and most of what goes into a typical agent scaffold does not clear it. The project brief restates what the code already says. The architecture overview describes structure the agent can see. The maintained memory/ mostly records things the agent would have done anyway. None of it is wrong; it just is not relevant in the strict sense, so it sits in the window as cost, not help.
6. Scope and limits
- One repository, one convention block, five tasks. The flip is large and clean and holds for both agents, but it is a focused result on project wire-formats, not a general law about all documentation.
- "Relevance" is doing real work in this claim, and we have defined it narrowly (needed and underivable). Broaden the definition and the result will soften, which is the point, not a loophole.
- Codex went to 100%, Claude to 80%. Supplying the rule is not a magic switch; one agent applied it slightly less reliably than the other. The flip is what is robust; the last 20% is a separate, smaller story about the agent, which we come to later.
- Relevance is necessary, not sufficient. Putting the right rule in front of the agent is most of the battle, but not all of it. Where you put it turns out to matter too.
7. Necessary, but not sufficient
So the first half of the practitioner's rule is, on these tasks, settled: do not measure context by the pound. A long, accurate, generic contract is not a smaller version of a useful one; it is a different thing entirely, and it did nothing here. What helps is the one rule that closes a gap the agent actually has. Find that, write that.
But there is a second half. We had three different ways to deliver that one winning rule: drop it in the root contract, place it in a big scaffold, or file it in a hierarchical tree of per-folder docs the way a popular open-source pattern recommends. They all contain the same sentence. They do not cost the same, and one of them fails to deliver the rule at all on a deep-target task. That is the next piece, and it is where the popular advice turns out to be the most expensive way to say something simple.
The Self-Sufficient Agent
We promised to raise the difficulty until the bare agent broke. We did, with harder tasks on a real 170-file service, and on these tasks it didn't break. With no documentation at all, two coding agents satisfied nine of nine non-obvious functional requirements, identically. On this kind of work, the bottleneck was never curiosity.
Others
Perspectives we keep returning to: the XO view, a view from Claude, and a view from GPT.