Reading: Appendix D. Threshold Calibration

Lesson 1 of 5 in module «Appendix D. Threshold Calibration»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Appendix D. Threshold Calibration

This is a reference appendix. On the first pass it is not needed: the training minimum of each chapter is designed for the default thresholds of AgentClinic-production. This file collects all "Low / Default / High" tables, threshold shift exercises, and signals that indicate when a threshold needs to be revisited. Use it when porting the process to your own project, once the standard values stop fitting.

The principle common to all tables: thresholds only make sense in pairs. Shifting one value without recalculating its pair is not calibration, it's dismantling the circuit. Each section explicitly lists the risks of such a shift.

D.1 Mutation Testing (Chapter 5)

The numbers from Chapter 5 are the default values for AgentClinic-production with a medium incident flow and a mature SDD process. In your project, thresholds depend on the cost of a P0 miss, the complexity of the route graph, the CI SLA window, and the stability of the incoming flow. Shifting any row must be accompanied by an entry in validation.md with justification.

Project parameter	Low	Default (AgentClinic)	High

Exercise

cd book2/examples/stress-mutator

mkdir -p out
cp expected/expected_failures.json out/expected_failures_depth5.json
sed -i 's/"depth_of_diagnostics_min": 3/"depth_of_diagnostics_min": 5/' out/expected_failures_depth5.json

python3 scripts/immunity_score.py \
  --validator-results out/validator_results.json \
  --expected expected/expected_failures.json \
  --out out/immunity_default.json

python3 scripts/immunity_score.py \
  --validator-results out/validator_results.json \
  --expected out/expected_failures_depth5.json \
  --out out/immunity_depth5.json

The first run should pass: the average depth of diagnostics equals 4 and exceeds the threshold of 3. The second run should exit with code 1: the same validator no longer passes the artificially tightened threshold depth_of_diagnostics_min = 5. The delta shows not a new defect in mutants, but the cost of tightening the threshold.

When to revisit the threshold

Over a quarter, no merge is blocked by the threshold — it is excessively low.
More than 10 regressions with the same mutation_id in a week — depth_of_diagnostics is insufficient, increase it.
recovery_time_p95 drops toward zero while strict_reject_rate rises — a Goodhart symptom.
A new class of incidents has appeared — recalculate all three thresholds from scratch.
One seed repeats the same set of mutation_id for five sprints in a row — seed rotation is needed.

Risk: if strict_reject_rate rises while depth_of_diagnostics simultaneously falls, this is a Goodhart symptom. Both parameters move only as a pair.

D.2 Shadow Specification Selection (Chapter 6)

The weights 0.5*mttr_gain + 0.3*early_signal + 0.2*coverage - 0.4*false_escalation and the keep/reject thresholds are the default values for AgentClinic-production. In your project, they depend on the cost of a false escalation, the importance of the early signal, the size of the historical base, and the available budget for example hints.

Parameter	Low	Default (AgentClinic)	High
False escalation cost	penalty `false_escalation: 0.2–0.3`	`0.4`	`0.6–0.8` (healthcare, payments)
Importance of early signal	weight `early_signal: 0.2`	`0.3`	`0.4–0.5` (blast radius >5 services)
Historical base size	20–50 cases (smoke check)	50+ cases	200+ cases with window rotation
Example hints budget	`keep-threshold 0.80`, 4 slots	`0.70`, 8 slots / 2000 tokens	`0.60`, 12 slots / 4000 tokens

Exercise

Run the auction with a conservative risk profile (higher penalty for false escalation):

cd book2/examples/shadow-auction
python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights "0.3,0.4,0.2,0.8" --out out/scorebook.json

python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json

Under this profile shadow.p0.voice_handoff moves from winner to disputed, while shadow.alert.red_color_urgency stays in rejected. This is a manifestation of the new profile: the team rewards MTTR reduction less and penalizes false escalation more heavily.

When to revisit the threshold

Over a month, no winner showed a positive effect in post-mortems — keep-threshold is too low.
The share of disputed is stably above 40% — the formula does not distinguish cases.
In a single phase, more than 8 winners are chosen — budget-tokens was chosen without accounting for the QWEN.md size.
A new class of incidents has appeared outside the historical data.
mttr_gain and false_escalation grow together — a Goodhart symptom.

Risk: the false_escalation penalty and the mttr_gain weight move only as a pair. Shifting one without revisiting the other breaks the link "useful signal ↔ false noise".

D.3 Tiered Budgets (Chapter 9)

The 10M token budget with a 9M/1M split (local/frontier) is the default value for AgentClinic-production with a medium incident flow. In your project, the budget size and proportions depend on the incident flow, the average phase cost, the share of disputed reviews, and sensitivity to local-coder outages.

Project parameter	Low	Default (AgentClinic)	High
Incidents per day	≤50/day → 2–3M tokens, 90/10	200/day → 10M, 9M/1M (90/10)	≥500/day → 25–40M, 80/20
Phase cost (tokens)	~20K	~50K	100K+ (multi-step replay)
Share of disputed reviews	≤5% → frontier 5–7%	~10% → 1M (10%)	15–25% → 15–20% frontier
Sensitivity to `local-coder` outage	≤1/month → 5% reserve	2–4/month → 7%	weekly → 15% + duplicated provider

Exercise

cd book2/examples/budget-keeper

python3 scripts/compile.py --budget-spec specs/budget_network_5m.yaml --out out/budget_plan_5m.json
python3 scripts/simulate.py --plan out/budget_plan_5m.json --scenario scenarios/fail_local_45m.json --out out/fail_result_5m.json

python3 scripts/inspect.py --result out/fail_result_5m.json --query "failover_to_frontier==2 && degraded_queue==18 && token_health_min>=0.5"

Check whether token_health_min stayed above 0.5 with half the budget. In the ready 5M variant, the proportions are preserved: the local tier gets 4.5M, frontier gets 0.5M. If you change only daily_budget_tokens but not the phase quotas, compile.py must fail with a sum error.

When to revisit the threshold

Over a month, no degraded_mode trigger — the budget is excessive or the actual flow is lower than expected.
token_health_min drops below 0.5 more than once a week — the local tier is insufficient.
failover_to_frontier is stably 0 during local tier failures — the gateway is too strict, frontier is not working as insurance.
The share of manual_queue after manual timeout grows for two months in a row — manual_timeout_sec is too short.
Less than 60% of daily_budget_tokens is spent per day — time to shrink the budget.

Risk: the 9M/1M split is tied to per-phase SLA. It cannot be shifted without updating budget_plan_phases in the spec — frontier will stop fitting "disputed" cases.

D.4 Protecting Metrics from Goodhart (Chapter 10)

The thresholds silent_p0 ≤ 5%, manual_review_rate ≥ 15%, edge_drift ≤ 0.12, audit_trace_coverage = 1.0 are the default values for AgentClinic-production. In your project, they depend on the cost of a missed P0, the availability of manual reviewers, the dynamics of the incoming flow, and regulatory requirements for audit.

Project parameter	Low	Default (AgentClinic)	High
Cost of a missed P0	`silent_p0 ≤ 8%`	`≤ 5%`	`≤ 1–2%` (payments)
Availability of manual reviewers	`manual_review_rate ≥ 8%`	`≥ 15%`	`≥ 25%` (regulatory)
Input dynamics	`edge_drift ≤ 0.20`	`≤ 0.12`	`≤ 0.05` (seasonal peaks)
Audit regulation	`audit_trace_coverage ≥ 0.95`	`= 1.00`	`= 1.00` + signed tracing

Exercise

cd book2/examples/goodhart-validator

mkdir -p out

# Copy spec to local out/ and loosen silent_p0_cap to 0.08
cp specs/validation.yaml out/validation_loose.yaml
sed -i 's/threshold: 0.05/threshold: 0.08/' out/validation_loose.yaml

python3 scripts/run_validation.py \
  --validation out/validation_loose.yaml \
  --metrics fixtures/new_metrics_bad.json

# Dangerous variant: loosen two independent guards at once
cp specs/validation.yaml out/validation_unsafe.yaml
sed -i 's/threshold: 0.15/threshold: 0.10/' out/validation_unsafe.yaml
sed -i 's/threshold: 0.05/threshold: 0.20/' out/validation_unsafe.yaml

python3 scripts/run_validation.py \
  --validation out/validation_unsafe.yaml \
  --metrics fixtures/new_metrics_bad.json

The first run should stay red: a bad release with silent_p0=0.18 still violates silent_p0_cap. The second, dangerous, variant passes only because two independent guards are loosened simultaneously. This shows why guard metrics cannot be calibrated one YAML line at a time.

When to revisit the threshold

Over a quarter, no release is blocked by silent_p0_cap — either the team is not making risky changes, or the threshold is excessively soft.
manual_review_rate drops for three sprints in a row as mttr_gain rises — a Goodhart symptom; manual reviewers have stopped being insurance.
edge_drift stably hovers around 0.10–0.11 — real input dynamics are close to the threshold.
audit_trace_coverage fell below 1.0 in even a single run — a violation of the regulatory invariant, hot-fix, not calibration.

A new class of incidents has appeared that does not fall into silent_p0 — new invariants are needed, not revisiting old ones.

Risks: silent_p0 and manual_review_rate move only as a pair. edge_drift only makes sense when audit_trace_coverage=1.0, otherwise drift is computed on a partial sample. All four thresholds form a single risk contract: weakening one in isolation from the others means breaking it, not tuning it.

Full Metric Network

The chapter text uses a simplified mermaid diagram with three metrics and one guard. The full dependency network looks like this:

flowchart LR
    MTTR[MTTR]
    silent_p0[silent_p0]
    manual_review_rate[manual_review_rate]
    escalation_rate[escalation_rate]
    postmortem_regression[postmortem_regression]
    audit_trace_coverage[audit_trace_coverage]
    silent_p0 -->|positive_interdependence| MTTR
    escalation_rate -->|positive_interdependence| MTTR
    manual_review_rate -->|negative_interdependence| MTTR
    manual_review_rate -->|negative_interdependence| escalation_rate
    audit_trace_coverage -->|negative_interdependence| escalation_rate

audit_trace_coverage -->|negative_interdependence| silent_p0
    postmortem_regression -->|positive_interdependence| audit_trace_coverage
    postmortem_regression -->|negative_interdependence| manual_review_rate

The logic is the same as in the simplified version: the red zone is MTTR and silent_p0; the path to weakening it goes through reducing manual review and losing the audit trail.

D.5 Production Readiness (Chapter 11)

The 23/25 threshold is the default value for AgentClinic-production with a medium SDD process maturity and a mixed action type. In your project, the threshold depends on the cost of a cutover error, process maturity, manual review load, and action type (stateless / stateful).

Project parameter	Low	Default (AgentClinic)	High
Cost of cutover error	internal tool: 21–22/25 only semi-manual	mixed production: auto ≥23/25	payments/healthcare: auto ≥24/25
SDD process maturity	3 months → only semi-manual 20–22	6+ months → semi-manual 20–22, auto 23+	12+ months + 50+ replays → auto 23+, fewer manual stops

Exercise

The check_readiness.py script hardcodes THRESHOLD = 23. Run it with a different value via a copy:

cd book2/examples/real-api && mkdir -p out
cp scripts/check_readiness.py out/check_readiness_t22.py
sed -i 's/THRESHOLD = 23/THRESHOLD = 22/' out/check_readiness_t22.py
python3 out/check_readiness_t22.py --readiness fixtures/readiness_block_audit.json

With THRESHOLD = 22, readiness_block_audit.json is still blocked due to audit_trace_coverage=0.7 < 1.0, even though the 22/25 sum passes. This shows that audit_trace_coverage is an independent blocking invariant, not part of the sum. The exercise is about threshold sensitivity, not a recommendation to lower the auto-tolerance.

When to revisit the threshold

Over a quarter, no readiness is blocked by the threshold — it is too low for the team's current maturity.

The share of semi-manual incidents grows for three sprints in a row — the 23/25 threshold is not met due to a systemic gap in Verification or Process.
A class of actions with stateful=true has appeared — require backup_verified and raise the threshold for that class to 24/25.
All readiness failures over a month come from a single axis — this is a gap in SDD templates; fix the templates, not move the threshold.
Readiness artifact build time exceeds the cutover SLA — reconsider which points can be automated, rather than lowering the threshold.

Risk: the 23/25 threshold is incompatible with a zero Security score at any sum — such a failure blocks merging regardless of the result. Lowering below 23/25 changes the operating mode: this is no longer auto-tolerance, but a semi-manual or canary mode. Even "low" (21/25) is a stop after every implement step and explicit operator confirmation, not a license for the agent to perform remediation on its own.