XO Docs
The Future of WorkPhase 1: Agentic WorkforceResearch: Why the Unit of Work

The environment is the key part

The environment owns the ground truth: it is the specification, the scorekeeper, and the binding constraint.

Of the three components (the agent, the task, and the environment), the environment is the one that carries the system.

A unit of work cannot exist in a vacuum. Its definition of done is meaningless unless something captures the before state; its verification is impossible unless something captures the after state; its budget is unenforceable unless something meters spend as it happens; its record is untrustworthy if the agent writes its own history. All four of those somethings are the environment: the workspace with its runtime, memory, files, tools, budget, and record.

Define
The environment
runtime · memory · files · tools · budget · record
Agent
State before
State after
Everything is metered as it happens. Tool calls, tokens, and files are recorded by the environment, not reported by the agent.
Verify + settle

Figure 3: The agent acts inside the environment, but the environment owns the ground truth: it captures state before and after, meters every action as it happens, and produces the record that verification and settlement run against. The agent is replaceable; the environment is what makes the unit of work definable, checkable, and billable.

Three observations support putting the environment first.

The environment is the specification. Whatever the prompt says, the agent's effective objective is whatever its environment rewards and permits. This mirrors a consistent finding in RL and alignment research: models optimize what their training environment actually measures, not what their designers intended, and flaws in the environment surface as flaws in behavior, from reward hacking under imperfect proxies to broad emergent misalignment when production environments are exploitable. The production analogue is direct: if the workspace's definition of done is gameable (tests that can be edited, checks that trust the agent's own report), agents will eventually satisfy the letter of the check rather than the intent of the work. Hardening the environment is how you harden the outcome.

Verification must live outside the agent. Trust at scale requires that the party doing the work is not the party keeping score. The environment is the natural scorekeeper: it observes every tool call, file change, and token spent as a side effect of hosting the work, so its record is produced by construction rather than by testimony. This is why the workspace, not the model, is the trust boundary, and why the boundary travels with the workspace wherever it runs, including inside your own cloud.

Environment quality, not agent quality, is the binding constraint. Recent work on automated research agents found that agents iterating inside a well-instrumented sandbox, with clean metrics, fast feedback, and a gradable outcome, outperformed human researchers on a problem humans had tuned for days, and that the bottleneck has shifted from the agents' capability to the design of the evaluation environment itself. The same shift is underway in production agentic work: models improve on their own schedule and are swappable per unit, but the environment, meaning what state is captured, what done means, and what gets metered, is the part you own. It is also where the compounding happens: every unit of work executed in a workspace leaves behind memory, records, and sharpened definitions of done that make the next unit cheaper and safer.

Investing in the agent
  • Gains arrive with each model release
  • Improvements are not owned by you
  • Same prompt, different model, different result
Investing in the environment
  • State capture makes every outcome checkable
  • Records and memory compound across units
  • Any agent, same environment, verifiable result

In plain terms: the agent supplies the skill, but the environment supplies the truth. Definitions of done, state comparisons, budgets, and records are all environmental facts. A mediocre agent in a well-designed environment produces verifiable, billable, improvable work. A brilliant agent in a poor environment produces plausible text.

Next: how a unit of work is calculated.