How a unit of work is calculated
Snapshot before, snapshot after, compare against weighted checks, with formulas and a micro-benchmark.
We measure a unit of work the same way a manager judges work today: snapshot the world before, snapshot it after, and compare the two against the definition of done.
The calculation needs no new accounting. Scoring, settlement, and efficiency all fall out of one state comparison.
The state-snapshot model
Formally, a unit of work is a triple:
u = (S₀, G, B)
S₀ the snapshot of world state when the unit is created
(files, tickets, records: whatever the workspace captures)
G the definition of done: weighted checks {(g₁,w₁), …, (gₙ,wₙ)},
where each gᵢ is a predicate over a state snapshot
B the budget: what reaching the end state is worthWhen the agent reports done, the environment captures the after-snapshot S₁ and
computes the diff and the completion score:
ΔS = diff(S₀, S₁) what actually changed
Σᵢ wᵢ · gᵢ(S₁)
V(u) = ───────────────── ∈ [0, 1] completion score
Σᵢ wᵢ
done(u) = V(u) ≥ τ acceptance, typically τ = 1Note that every check runs against S₁, the state the environment captured,
never against the agent's account of what it did. The score is a property of the
world, not of the report.
Settlement then compares the budget against the metered cost:
C(u) = Σ tokens × price + tool costs metered by the environment
margin(u) = B − C(u) your efficiency on this unit
value-per-$ = V(u) · B / C(u) outcome value per dollar of AI spendAnd across a portfolio of units, the two numbers that matter:
realized value = Σᵤ V(u) · B(u)
intervention rate = |{u : V(u) < τ}| / N how often a human must step inFigure 4: A worked example. A support ticket carries three weighted checks; the after-snapshot passes two of them, so V = 0.8. The unit does not settle until the remaining check passes: partial credit is visible, but payment is for the outcome.
A micro-benchmark
To make the arithmetic concrete, we simulated 1,000 units each of three common unit types, with per-check pass probabilities and token costs drawn from plausible ranges (seeded, strict acceptance τ = 1):
# per unit: V = Σ wᵢ·gᵢ(S₁) / Σ wᵢ ; done if V ≥ 1.0 ; margin = B − C
types = {
"support_ticket": dict(B=4.00, checks=[("status=closed",.5,.97),("reply_sent",.3,.99),("kb_linked",.2,.85)]),
"contract_review": dict(B=25.00, checks=[("risks_flagged",.4,.92),("redlines",.4,.90),("summary_filed",.2,.98)]),
"weekly_report": dict(B=12.00, checks=[("data_pulled",.3,.99),("analysis",.4,.94),("delivered",.3,.97)]),
}| Unit type | Budget B | Pass rate | E[V] | E[C] | Margin | Value per $ | Intervention |
|---|---|---|---|---|---|---|---|
| Support ticket | $4.00 | 81.6% | 0.954 | $0.42 | $3.58 | 9.2× | 18.4% |
| Contract review | $25.00 | 80.0% | 0.918 | $3.14 | $21.86 | 7.3× | 20.0% |
| Weekly report | $12.00 | 90.1% | 0.963 | $1.50 | $10.50 | 7.7× | 9.9% |
Three things to read off this table. First, margin is dominated by the budget,
not the token bill: even at strict acceptance, cost is roughly a tenth of the
outcome's value, which is why pricing on outcomes rather than tokens is viable.
Second, the intervention rate falls directly out of the scoring rule: it is just
the share of units where some check failed, which makes the Phase 1 health
signal (is the intervention rate falling?) measurable for free. Third, the
weakest check dominates the pass rate: the support ticket's 0.85-probability
kb_linked check causes most of its 18.4% intervention rate. Improving one
environmental check moves the whole economics of the unit.
Why the snapshots must come from the environment
The same simulation, re-run with one change: completion is taken from the agent's self-report instead of the environment's snapshot, with the agent claiming done on 18% of incomplete units (an optimistic self-assessment rate, not an adversarial one):
| Verification source | Units settled | Settled but actually incomplete |
|---|---|---|
| Agent self-report | 84.1% | 4.4% |
Environment snapshot (S₁) | 82.1% | 0.0% |
Self-reporting settles slightly more units, and silently pays for incomplete work, at a rate that compounds across thousands of units and worsens under optimization pressure, since "claim done" is always cheaper than "be done." State-snapshot verification cannot be gamed this way, because the agent does not produce the evidence; the environment does. This is the calculation-level restatement of the environment argument: the environment is what makes the number real.