How the Environment Affects Agent Performance and Token Cost
A pilot measuring how a workspace's project context shapes a coding agent's performance and token cost: richer context is nearly free to run, it does not hurt task success, and it lowers the cost of getting oriented.
We ran a coding agent against the same software task under six escalating levels of project context, from an empty workspace up to a full XO project scaffold, holding the model, the prompt, the task, and the codebase fixed. Changing only the environment changed how much the agent spent. Richer context did not hurt whether the task got done, it lowered the token cost of doing it, and because the context is cached, it added almost nothing to the bill.
Krish Bhimani, Ankit Dwivedi, Rohini Pedamkar, Suraj Sharma · XO Labs Inc. · June 2026
TL;DR: We measured an OpenAI Codex agent solving a fixed coding task against a
pinned snapshot of a real FastAPI codebase, under six environments that add
progressively more project context (empty, README, an AGENTS.md contract, a
project brief, the full XO scaffold, and the scaffold plus seeded memory). Three
findings. First, context did not cost success: every environment completed the
task, and the richer ones did it with fewer tokens, down to 36 percent below the
empty workspace. Second, the scaffold is nearly free: 88 to 96 percent of the
input is served from cache, so the full XO workspace pays about what an empty one
pays even though it carries far more context. Third, the cost driver is
orientation: a run's token bill tracks how much the agent has to probe the
project blindly to find its bearings (r = 0.79), and the single most expensive run
in the pilot was the empty workspace. Structured context is the lever on exactly
that variable. This is a small pilot and we report it as one.
1. Background
A theme runs through our work on agentic systems: the environment an agent works inside matters more than the agent that executes the work. We argued this from first principles in Why the unit of work matters, where the claim is that the environment is the load-bearing component, because it holds the ground truth, captures state, meters cost, and tells the agent how work is done here.
This note is a first empirical probe of that argument. If the environment really is the edge, then changing only the environment, while holding the model, the prompt, the task, and the codebase fixed, should change how an agent works. So we built a measurement rig to test exactly that, and we report what the pilot found, including where it came up short.
2. What we measured
The design is a grid. We hold the task and the codebase constant and vary one thing: how much project context the workspace carries when the agent starts. Each cell is one agent run, and every run produces a telemetry record with the score, the token counts, and a trace of what the agent read and did.
Environments
Each environment starts from the same pinned commit of the xo-cowork-api
repository, with the original documentation stripped out, then adds a defined
overlay of context files. The ladder goes from nothing to the full XO scaffold.
| Environment | What the agent starts with |
|---|---|
| E0 Empty | The bare repository, nothing added |
| E1 +README | A 43-line README.md |
| E2 +AGENTS | An AGENTS.md operating contract (239 lines): how the project is organized and how work is done here |
| E3 +PROJECT | AGENTS.md plus a PROJECT.md brief |
| E4 Full XO | The full XO scaffold: AGENTS, CLAUDE, PROJECT, OBJECTIVES, PLAN, PROGRESS, and a memory directory |
| E5 +XO+Memory | The full XO scaffold plus seeded, task-relevant memory from prior sessions |
E0 and E1 are the bare baselines. E2 through E5 are the XO project conditions, where the workspace carries the kind of scaffolding an XO project ships with.
Task and agent
The headline measurements come from T01, an easy feature: add a
GET /health/deep endpoint that checks configured services and returns a status
JSON, with a test. T01 is the clean probe for the cost question, because context
cannot change whether it is solvable. Every configuration can finish it, so any
difference in tokens is a difference in efficiency, not capability. A second task,
T03 (add per-IP rate limiting to the chat endpoints, a medium feature), is
used in the mechanism analysis. Every run used the OpenAI Codex CLI as the
coding agent.
Metrics
- Task success. The share of automated acceptance checks that pass, from 0 to 1.
- Total input tokens. Everything the agent reads and processes: the raw throughput of the run.
- Effective tokens. The cache-adjusted cost, which charges cached input at the
cache rate rather than the full rate. Concretely it is
uncached input + 0.1 × cached input + output. This is the number that maps to what you actually pay, because most of the context is served from cache. - Orientation. How much the agent explores to get its bearings, counted as blind shell probes: directory listings, greps, and file reads run to reconstruct the project's layout.
3. Context did not cost success, and it lowered the token bill
On the easy task, every one of the six environments finished the job with a perfect score. That is the first thing to settle: adding project context did not slow the agent down or confuse it. Nothing was lost.
What changed was the token bill. Against the empty workspace, a README cut input
tokens by 25 percent, the AGENTS.md contract by 32 percent, and the full XO
scaffold with seeded memory by 36 percent, all while finishing the same task. Two
of the configurations, E3 and E4, landed within a few percent of the empty
baseline, so this is a trend rather than a clean monotonic line. But the direction
is consistent and the cheapest workspaces are among the richest ones.
Figure 1: Total input tokens per environment on the easy task. Every run succeeds (score 1.0). README, the AGENTS contract, and the full XO scaffold with seeded memory each spend fewer tokens than the empty workspace, down to 36 percent fewer.
4. The scaffold is nearly free, because it is cached
The obvious worry about scaffolding is that it just stuffs the context window with files the agent has to pay to read. Under prompt caching, that worry does not hold.
Across the T01 runs, 88 to 96 percent of input tokens were served from cache. The project files are read once and then reused at the cache rate, so the effective cost, the part you actually pay for, is a small slice of the raw input. The agent reads between 1.2 and 2.0 million tokens of context, but pays for only 216 to 344 thousand. More to the point, that paid cost barely moves as you add scaffolding: the full XO workspace with memory pays 272 thousand, essentially the same as the empty workspace's 275 thousand, despite carrying the entire project scaffold. The two full-XO configurations bracket the empty baseline rather than towering over it.
Figure 2: For each environment on T01, the light bar is everything the agent reads (mostly cached) and the green bar is what it actually pays for. The dashed line is the empty workspace's paid cost. Adding the full scaffold does not blow up the bill: the paid cost stays in the same band because the context is cached.
So the cost of running an agent inside a richly scaffolded XO project is, to a close approximation, the cost of running it in an empty one. You get the orientation benefits without paying for the context on every turn.
5. Why it works: token cost is an orientation tax
What is the agent actually spending tokens on, if not on reading the scaffold? On finding its way around. With nothing to orient on, the agent has to reconstruct the project's conventions by probing the filesystem: listing directories, grepping for patterns, opening files to infer the structure it was never told. Every one of those probes costs tokens.
Across both tasks, the number of blind shell probes a run made tracked its total
token cost, with a correlation of r = 0.79. The single most expensive run in the
whole pilot was the empty workspace on T01: it ran 68 shell commands and spent 2.2
million input tokens to deliver a result the leaner, context-rich runs produced for
a third less. The cheapest runs, the AGENTS.md and seeded-memory configurations,
were also among the least exploratory.
Figure 3: Each point is one agent run across T01 and T03. The horizontal axis is how many blind shell probes the run made; the vertical axis is its total input tokens. More probing goes with more cost. The empty workspace sits at the expensive, exploratory end.
This is the mechanism that ties the findings together. The cost of an agent run is dominated by orientation, and structured context is precisely the thing that removes the need to orient. A good scaffold hands the agent the project's map up front, so it spends its budget on the work instead of on reconstructing where things are. That is why richer context is cheaper to run, and why the savings show up without any loss of success.
6. Scope and limitations
This is a pilot, and we would rather report it honestly than oversell it.
- Small samples. Each cell is a single run, except the empty T01 cell which has two. These are directional findings, not tight confidence intervals. The exact percentages will move as we add replications.
- One agent, one codebase. We ran OpenAI Codex against a single repository, a
snapshot of
xo-cowork-api. We have not yet checked how the effect transfers to other agents or other codebases. - The token savings are task-specific. The reductions in section 3 come from one easy task. Broader runs in our rig show that richer context does not always cut tokens, so we treat the savings as a real but bounded result, not a universal law. What held up robustly across the pilot is the caching result and the orientation correlation.
- A success effect we chose not to claim. An earlier look suggested structured context also lifted task success on the harder task: the bare workspaces scored partial while the XO-context ones scored complete. On inspection, the failing check was an integrity test that timed out under concurrent grading load rather than finding broken code, and the agent's actual feature code passed. That is a grading artifact, not a capability difference, so we do not make a reliability claim here. We mention it because catching it is the kind of rigor we hold the work to.
7. What this means for XO Projects
The practical reading is simple. The leverage is in the environment, not in a cleverer prompt or a bigger model. An XO project hands the agent the contract, the plan, and the memory up front, so when it sits down to work it already knows how the project is organized. That shows up as a smaller token bill on routine work, and because the context is cached, you get it for almost nothing on top of what an empty workspace would cost. The deeper reason is that an agent's cost is mostly the cost of figuring out where it is, and a good environment answers that question before the agent has to ask it.
Teams that invest in environment design get compounding returns. That was the argument. This pilot is a first measurement pointing the same way, and we will keep widening it.
Why the unit of work matters
The conceptual argument behind this measurement: the environment is the load-bearing component of agentic work.
The unit of work thesis
As agents take on whole jobs, the work becomes the unit: an outcome to own, defined, budgeted, verified, and settled.
Appendix: per-cell data
Every number above is recomputed directly from the per-run telemetry records. Tokens are rounded. The headline cost results come from T01, where all runs succeed so cost is comparable across conditions. T01 E0 is the mean of two runs; all other cells are single runs. T03 runs contribute to the orientation correlation in Figure 3 only.
| Task | Env | Score | Input tokens | Effective (paid) tokens |
|---|---|---|---|---|
| T01 | E0 Empty | 1.00 | 1.87M | 275K |
| T01 | E1 +README | 1.00 | 1.40M | 278K |
| T01 | E2 +AGENTS | 1.00 | 1.27M | 216K |
| T01 | E3 +PROJECT | 1.00 | 1.95M | 323K |
| T01 | E4 Full XO | 1.00 | 1.80M | 344K |
| T01 | E5 +XO+Memory | 1.00 | 1.20M | 272K |
How to cite: Bhimani, K., Dwivedi, A., Pedamkar, R., Sharma, S., and XO Labs Inc. (2026). How the Environment Affects Agent Performance and Token Cost: A Pilot on What XO Project Context Does to a Coding Agent. XO Labs Research.