How the Environment Affects Agent Performance and Token Cost

A pilot measuring how a workspace's project context shapes a coding agent's performance and token cost: richer context is nearly free to run, it does not hurt task success, and it lowers the cost of getting oriented.

Series update: a later 84-run study did not reproduce this pilot's headline that richer context lowered the token bill. The caching observation and method remained useful, but the early cost result did not generalize. Read Agent Context Research: The Evidence So Far for the full evidence path before treating this pilot as current guidance.

We ran a coding agent against the same software task under six escalating levels of project context, from an empty workspace up to a full XO project scaffold, holding the model, the prompt, the task, and the codebase fixed. Changing only the environment changed how much the agent spent. Richer context did not hurt whether the task got done, it lowered the token cost of doing it, and because the context is cached, it added almost nothing to the bill.

Krish Bhimani, Ankit Dwivedi, Rohini Pedamkar, Suraj Sharma · XO Labs Inc. · June 2026

TL;DR: We measured an OpenAI Codex agent solving a fixed coding task against a pinned snapshot of a real FastAPI codebase, under six environments that add progressively more project context (empty, README, an AGENTS.md contract, a project brief, the full XO scaffold, and the scaffold plus seeded memory). Three findings. First, context did not cost success: every environment completed the task, and the richer ones did it with fewer tokens, down to 36 percent below the empty workspace. Second, the scaffold is nearly free: 88 to 96 percent of the input is served from cache, so the full XO workspace pays about what an empty one pays even though it carries far more context. Third, the cost driver is orientation: a run's token bill tracks how much the agent has to probe the project blindly to find its bearings (r = 0.79), and the single most expensive run in the pilot was the empty workspace. Structured context is the lever on exactly that variable. This is a small pilot and we report it as one.

1. Background

A theme runs through our work on agentic systems: the environment an agent works inside matters more than the agent that executes the work. We argued this from first principles in Why the unit of work matters, where the claim is that the environment is the load-bearing component, because it holds the ground truth, captures state, meters cost, and tells the agent how work is done here.

This note is a first empirical probe of that argument. If the environment really is the edge, then changing only the environment, while holding the model, the prompt, the task, and the codebase fixed, should change how an agent works. So we built a measurement rig to test exactly that, and we report what the pilot found, including where it came up short.

2. What we measured

The design is a grid. We hold the task and the codebase constant and vary one thing: how much project context the workspace carries when the agent starts. Each cell is one agent run, and every run produces a telemetry record with the score, the token counts, and a trace of what the agent read and did.

Environments

Each environment starts from the same pinned commit of the xo-cowork-api repository, with the original documentation stripped out, then adds a defined overlay of context files. The ladder goes from nothing to the full XO scaffold.

Environment	What the agent starts with
E0 Empty	The bare repository, nothing added
E1 +README	A 43-line `README.md`
E2 +AGENTS	An `AGENTS.md` operating contract (239 lines): how the project is organized and how work is done here
E3 +PROJECT	`AGENTS.md` plus a `PROJECT.md` brief
E4 Full XO	The full XO scaffold: `AGENTS`, `CLAUDE`, `PROJECT`, `OBJECTIVES`, `PLAN`, `PROGRESS`, and a memory directory
E5 +XO+Memory	The full XO scaffold plus seeded, task-relevant memory from prior sessions

E0 and E1 are the bare baselines. E2 through E5 are the XO project conditions, where the workspace carries the kind of scaffolding an XO project ships with.

Task and agent

The headline measurements come from T01, an easy feature: add a GET /health/deep endpoint that checks configured services and returns a status JSON, with a test. T01 is the clean probe for the cost question, because context cannot change whether it is solvable. Every configuration can finish it, so any difference in tokens is a difference in efficiency, not capability. A second task, T03 (add per-IP rate limiting to the chat endpoints, a medium feature), is used in the mechanism analysis. Every run used the OpenAI Codex CLI as the coding agent.

Metrics

Task success. The share of automated acceptance checks that pass, from 0 to 1.
Total input tokens. Everything the agent reads and processes: the raw throughput of the run.
Effective tokens. The cache-adjusted cost, which charges cached input at the cache rate rather than the full rate. Concretely it is uncached input + 0.1 × cached input + output. This is the number that maps to what you actually pay, because most of the context is served from cache.
Orientation. How much the agent explores to get its bearings, counted as blind shell probes: directory listings, greps, and file reads run to reconstruct the project's layout.

3. Context did not cost success, and it lowered the token bill

On the easy task, every one of the six environments finished the job with a perfect score. That is the first thing to settle: adding project context did not slow the agent down or confuse it. Nothing was lost.

What changed was the token bill. Against the empty workspace, a README cut input tokens by 25 percent, the AGENTS.md contract by 32 percent, and the full XO scaffold with seeded memory by 36 percent, all while finishing the same task. Two of the configurations, E3 and E4, landed within a few percent of the empty baseline, so this is a trend rather than a clean monotonic line. But the direction is consistent and the cheapest workspaces are among the richest ones.

Every workspace finishes the task; richer project context spends fewer tokens doing it. T01 input tokens by environment, all runs score 1.0.

Figure 1: Total input tokens per environment on the easy task. Every run succeeds (score 1.0). README, the AGENTS contract, and the full XO scaffold with seeded memory each spend fewer tokens than the empty workspace, down to 36 percent fewer.

4. The scaffold is nearly free, because it is cached

The obvious worry about scaffolding is that it just stuffs the context window with files the agent has to pay to read. Under prompt caching, that worry does not hold.

Across the T01 runs, 88 to 96 percent of input tokens were served from cache. The project files are read once and then reused at the cache rate, so the effective cost, the part you actually pay for, is a small slice of the raw input. The agent reads between 1.2 and 2.0 million tokens of context, but pays for only 216 to 344 thousand. More to the point, that paid cost barely moves as you add scaffolding: the full XO workspace with memory pays 272 thousand, essentially the same as the empty workspace's 275 thousand, despite carrying the entire project scaffold. The two full-XO configurations bracket the empty baseline rather than towering over it.

Caching makes the scaffold nearly free: the agent reads 1.2 to 2.0M tokens of context but pays for only 216 to 344K, and the full XO workspace pays about what an empty one does.

Figure 2: For each environment on T01, the light bar is everything the agent reads (mostly cached) and the green bar is what it actually pays for. The dashed line is the empty workspace's paid cost. Adding the full scaffold does not blow up the bill: the paid cost stays in the same band because the context is cached.

So the cost of running an agent inside a richly scaffolded XO project is, to a close approximation, the cost of running it in an empty one. You get the orientation benefits without paying for the context on every turn.

5. Why it works: token cost is an orientation tax

What is the agent actually spending tokens on, if not on reading the scaffold? On finding its way around. With nothing to orient on, the agent has to reconstruct the project's conventions by probing the filesystem: listing directories, grepping for patterns, opening files to infer the structure it was never told. Every one of those probes costs tokens.

Across both tasks, the number of blind shell probes a run made tracked its total token cost, with a correlation of r = 0.79. The single most expensive run in the whole pilot was the empty workspace on T01: it ran 68 shell commands and spent 2.2 million input tokens to deliver a result the leaner, context-rich runs produced for a third less. The cheapest runs, the AGENTS.md and seeded-memory configurations, were also among the least exploratory.

Token cost is an orientation tax: across all runs, the more the agent probed the project blindly, the more tokens it spent (r = 0.79).

Figure 3: Each point is one agent run across T01 and T03. The horizontal axis is how many blind shell probes the run made; the vertical axis is its total input tokens. More probing goes with more cost. The empty workspace sits at the expensive, exploratory end.

This is the mechanism that ties the findings together. The cost of an agent run is dominated by orientation, and structured context is precisely the thing that removes the need to orient. A good scaffold hands the agent the project's map up front, so it spends its budget on the work instead of on reconstructing where things are. That is why richer context is cheaper to run, and why the savings show up without any loss of success.

6. Scope and limitations

This is a pilot, and we would rather report it honestly than oversell it.

Small samples. Each cell is a single run, except the empty T01 cell which has two. These are directional findings, not tight confidence intervals. The exact percentages will move as we add replications.
One agent, one codebase. We ran OpenAI Codex against a single repository, a snapshot of xo-cowork-api. We have not yet checked how the effect transfers to other agents or other codebases.
The token savings are task-specific. The reductions in section 3 come from one easy task. Broader runs in our rig show that richer context does not always cut tokens, so we treat the savings as a real but bounded result, not a universal law. What held up robustly across the pilot is the caching result and the orientation correlation.
A success effect we chose not to claim. An earlier look suggested structured context also lifted task success on the harder task: the bare workspaces scored partial while the XO-context ones scored complete. On inspection, the failing check was an integrity test that timed out under concurrent grading load rather than finding broken code, and the agent's actual feature code passed. That is a grading artifact, not a capability difference, so we do not make a reliability claim here. We mention it because catching it is the kind of rigor we hold the work to.

7. What this means for XO Projects

The practical reading is simple. The leverage is in the environment, not in a cleverer prompt or a bigger model. An XO project hands the agent the contract, the plan, and the memory up front, so when it sits down to work it already knows how the project is organized. That shows up as a smaller token bill on routine work, and because the context is cached, you get it for almost nothing on top of what an empty workspace would cost. The deeper reason is that an agent's cost is mostly the cost of figuring out where it is, and a good environment answers that question before the agent has to ask it.

Teams that invest in environment design get compounding returns. That was the argument. This pilot is a first measurement pointing the same way, and we will keep widening it.

Read the full research series

See how replication revised this pilot and how the later studies narrowed the useful role of context.

Why the unit of work matters

The conceptual argument behind this measurement: the environment is the load-bearing component of agentic work.

The unit of work thesis

As agents take on whole jobs, the work becomes the unit: an outcome to own, defined, budgeted, verified, and settled.

Appendix: per-cell data

Every number above is recomputed directly from the per-run telemetry records. Tokens are rounded. The headline cost results come from T01, where all runs succeed so cost is comparable across conditions. T01 E0 is the mean of two runs; all other cells are single runs. T03 runs contribute to the orientation correlation in Figure 3 only.

Task	Env	Score	Input tokens	Effective (paid) tokens
T01	E0 Empty	1.00	1.87M	275K
T01	E1 +README	1.00	1.40M	278K
T01	E2 +AGENTS	1.00	1.27M	216K
T01	E3 +PROJECT	1.00	1.95M	323K
T01	E4 Full XO	1.00	1.80M	344K
T01	E5 +XO+Memory	1.00	1.20M	272K

How to cite: Bhimani, K., Dwivedi, A., Pedamkar, R., Sharma, S., and XO Labs Inc. (2026). How the Environment Affects Agent Performance and Token Cost: A Pilot on What XO Project Context Does to a Coding Agent. XO Labs Research.