How a unit of work is calculated

Snapshot before, snapshot after, compare against weighted checks, with formulas and a micro-benchmark.

We measure a unit of work the same way a manager judges work today: snapshot the world before, snapshot it after, and compare the two against the definition of done.

The calculation needs no new accounting. Scoring, settlement, and efficiency all fall out of one state comparison.

The state-snapshot model

Formally, a unit of work is a triple:

u = (S₀, G, B)

S₀  the snapshot of world state when the unit is created
     (files, tickets, records: whatever the workspace captures)
G   the definition of done: weighted checks {(g₁,w₁), …, (gₙ,wₙ)},
     where each gᵢ is a predicate over a state snapshot
B   the budget: what reaching the end state is worth

When the agent reports done, the environment captures the after-snapshot S₁ and computes the diff and the completion score:

ΔS = diff(S₀, S₁)                      what actually changed

         Σᵢ wᵢ · gᵢ(S₁)
V(u) = ─────────────────  ∈ [0, 1]      completion score
            Σᵢ wᵢ

done(u) = V(u) ≥ τ                      acceptance, typically τ = 1

Note that every check runs against S₁, the state the environment captured, never against the agent's account of what it did. The score is a property of the world, not of the report.

Settlement then compares the budget against the metered cost:

C(u) = Σ tokens × price + tool costs    metered by the environment

margin(u)     = B − C(u)                 your efficiency on this unit
value-per-$   = V(u) · B / C(u)          outcome value per dollar of AI spend

And across a portfolio of units, the two numbers that matter:

realized value     = Σᵤ V(u) · B(u)
intervention rate  = |{u : V(u) < τ}| / N     how often a human must step in

Snapshot S₀

Agent executes

Snapshot S₁

Score V = Σwᵢ·gᵢ(S₁)

status = closedw = 0.5

reply sentw = 0.3

KB article linkedw = 0.2

V = 0.5 + 0.3 + 0 = 0.8 → below τ = 1.0, one check left to fixsettle when V ≥ τ : compare budget B against metered cost C

Figure 4: A worked example. A support ticket carries three weighted checks; the after-snapshot passes two of them, so V = 0.8. The unit does not settle until the remaining check passes: partial credit is visible, but payment is for the outcome.

A micro-benchmark

To make the arithmetic concrete, we simulated 1,000 units each of three common unit types, with per-check pass probabilities and token costs drawn from plausible ranges (seeded, strict acceptance τ = 1):

# per unit: V = Σ wᵢ·gᵢ(S₁) / Σ wᵢ ;  done if V ≥ 1.0 ;  margin = B − C
types = {
  "support_ticket":  dict(B=4.00,  checks=[("status=closed",.5,.97),("reply_sent",.3,.99),("kb_linked",.2,.85)]),
  "contract_review": dict(B=25.00, checks=[("risks_flagged",.4,.92),("redlines",.4,.90),("summary_filed",.2,.98)]),
  "weekly_report":   dict(B=12.00, checks=[("data_pulled",.3,.99),("analysis",.4,.94),("delivered",.3,.97)]),
}

Unit type	Budget B	Pass rate	E[V]	E[C]	Margin	Value per $	Intervention
Support ticket	$4.00	81.6%	0.954	$0.42	$3.58	9.2×	18.4%
Contract review	$25.00	80.0%	0.918	$3.14	$21.86	7.3×	20.0%
Weekly report	$12.00	90.1%	0.963	$1.50	$10.50	7.7×	9.9%

Three things to read off this table. First, margin is dominated by the budget, not the token bill: even at strict acceptance, cost is roughly a tenth of the outcome's value, which is why pricing on outcomes rather than tokens is viable. Second, the intervention rate falls directly out of the scoring rule: it is just the share of units where some check failed, which makes the Phase 1 health signal (is the intervention rate falling?) measurable for free. Third, the weakest check dominates the pass rate: the support ticket's 0.85-probability kb_linked check causes most of its 18.4% intervention rate. Improving one environmental check moves the whole economics of the unit.

Why the snapshots must come from the environment

The same simulation, re-run with one change: completion is taken from the agent's self-report instead of the environment's snapshot, with the agent claiming done on 18% of incomplete units (an optimistic self-assessment rate, not an adversarial one):

Verification source	Units settled	Settled but actually incomplete
Agent self-report	84.1%	4.4%
Environment snapshot (`S₁`)	82.1%	0.0%

Self-reporting settles slightly more units, and silently pays for incomplete work, at a rate that compounds across thousands of units and worsens under optimization pressure, since "claim done" is always cheaper than "be done." State-snapshot verification cannot be gamed this way, because the agent does not produce the evidence; the environment does. This is the calculation-level restatement of the environment argument: the environment is what makes the number real.

Next: how environments help businesses.

How a unit of work is calculated

The state-snapshot model

A micro-benchmark

Why the snapshots must come from the environment

On this page