Appendix D. Threshold Calibration
This is a reference appendix. It is not needed on the first pass: the learning minimum of each chapter is designed for the default thresholds of AgentClinic-production. This file collects all "Low / Default / High" tables, threshold-shifting exercises, and signals indicating when a threshold needs to be revised. Use it when transferring the process to your own project, when the standard values are no longer suitable.
The principle common to all tables: thresholds only make sense in pairs. Shifting one value without recalculating the related one is not calibration, but dismantling the control loop. Each section explicitly lists the risks of such a shift.
D.1 Mutation Testing (Chapter 5)
The numbers in Chapter 5 are the default values for AgentClinic-production with a medium incident flow and a mature SDD process. In your project, the thresholds depend on the cost of missing a P0, the complexity of the routing graph, the CI SLA window, and the stability of the incoming flow. Any row shift must be accompanied by an entry in validation.md with justification.
| Project Parameter | Low | Default (AgentClinic) | High |
|---|---|---|---|
| Cost of missing P0 | strict_reject_rate ≥ 0.92 | **≥ 0.98** | ≥ 0.995 (payments, healthcare) |
| Routing graph complexity | depth_of_diagnostics ≥ 2 (≤10 edges) | **≥ 3** (10–50 edges) | ≥ 5 (>100 edges, multi-tenant) |
| CI SLA window | recovery_time_p95_ms ≤ 800 | **≤ 1200** | ≤ 1500 (>500 PR/day) |
| Incident flow stability | 1 mutant per class | 2 mutants per class | 5+ mutants per class + seed rotation |
Exercise
cd book2/examples/stress-mutator
mkdir -p out
cp expected/expected_failures.json out/expected_failures_depth5.json
sed -i 's/"depth_of_diagnostics_min": 3/"depth_of_diagnostics_min": 5/' out/expected_failures_depth5.json
python3 scripts/immunity_score.py \
--validator-results out/validator_results.json \
--expected expected/expected_failures.json \
--out out/immunity_default.json
python3 scripts/immunity_score.py \
--validator-results out/validator_results.json \
--expected out/expected_failures_depth5.json \
--out out/immunity_depth5.json
The first run should pass: the average diagnostic depth is 4 and exceeds the threshold of 3. The second run should exit with code 1: the same validator no longer passes the artificially tightened threshold depth_of_diagnostics_min = 5. The delta shows not a new defect in the mutants, but the cost of tightening the threshold.
When to Reconsider the Threshold
- No merge has been blocked by the threshold for a quarter — it is excessively low.
- More than 10 regressions with the same
mutation_idin a week —depth_of_diagnosticsis insufficient, increase it. recovery_time_p95drops to zero asstrict_reject_raterises — a sign of Goodhart's law.- A new class of incidents has appeared — recalculate all three thresholds from scratch.
- One seed (
seed) repeats the same set ofmutation_idfor five sprints in a row — seed rotation is needed.
Risk: if strict_reject_rate rises while depth_of_diagnostics simultaneously falls, this is a symptom of Goodhart's law. Both parameters must be moved only as a pair.
D.2 Shadow Specification Selection (Chapter 6)
The weights 0.5*mttr_gain + 0.3*early_signal + 0.2*coverage - 0.4*false_escalation and the keep/reject thresholds are the default values for AgentClinic-production. In your project, they depend on the cost of false escalation, the importance of early signal, the size of the historical base, and the available budget for prompt examples.
| Parameter | Low | Default (AgentClinic) | High |
|---|---|---|---|
| Cost of false escalation | penalty false_escalation: 0.2–0.3 | **0.4** | 0.6–0.8 (healthcare, payments) |
| Importance of early signal | weight early_signal: 0.2 | **0.3** | 0.4–0.5 (blast radius >5 services) |
| Size of historical base | 20–50 cases (smoke test) | 50+ cases | 200+ cases with window rotation |
| Budget for prompt examples | keep-threshold 0.80, 4 slots | **0.70, 8 slots / 2000 tokens** | 0.60, 12 slots / 4000 tokens |
Exercise
Run the auction with a conservative risk profile (higher penalty for false escalation):
cd book2/examples/shadow-auction
python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights "0.3,0.4,0.2,0.8" --out out/scorebook.json
python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json
With this profile, shadow.p0.voice_handoff moves from winner to disputed, while shadow.alert.red_color_urgency remains in rejected. This is a manifestation of the new profile: the team rewards MTTR reduction less and penalizes false escalation more strongly.
When to Reconsider the Threshold
- No
winnerhas shown a positive effect in post-mortems for a month —keep-thresholdis too low. - The share of
disputedis consistently above 40% — the formula does not distinguish cases. - More than 8 winners are selected in one phase —
budget-tokenswas chosen without considering the size ofQWEN.md. - A new class of incidents has appeared outside the historical data.
mttr_gainandfalse_escalationrise together — a symptom of Goodhart's law.
Risk: the false_escalation penalty and the mttr_gain weight must be moved only as a pair. Shifting one without revisiting the other breaks the "useful signal ↔ false noise" link.
D.3 Tiered Budgets (Chapter 9)
The budget of 10M tokens with a 9M/1M split (local/frontier) is the default value for AgentClinic-production with a medium incident flow. In your project, the budget size and proportions depend on the incident flow, the average phase cost, the share of disputed reviews, and the sensitivity to local-coder downtime.
| Project Parameter | Low | Default (AgentClinic) | High |
|---|---|---|---|
| Incident flow per day | ≤50/day → 2–3M tokens, 90/10 | 200/day → 10M, 9M/1M (90/10) | ≥500/day → 25–40M, 80/20 |
| Phase cost (tokens) | ~20K | ~50K | 100K+ (multi-step replay) |
| Share of disputed reviews | ≤5% → frontier 5–7% | ~10% → 1M (10%) | 15–25% → 15–20% frontier |
Sensitivity to local-coder downtime | ≤1 time/month → reserve 5% | 2–4 times/month → 7% | weekly → 15% + duplicated provider |
Exercise
cd book2/examples/budget-keeper
python3 scripts/compile.py --budget-spec specs/budget_network_5m.yaml --out out/budget_plan_5m.json
python3 scripts/simulate.py --plan out/budget_plan_5m.json --scenario scenarios/fail_local_45m.json --out out/fail_result_5m.json
python3 scripts/inspect.py --result out/fail_result_5m.json --query "failover_to_frontier==2 && degraded_queue==18 && token_health_min>=0.5"
Check whether token_health_min stayed above 0.5 with half the budget. In the ready-made 5M variant, the proportions are preserved: the local tier gets 4.5M, frontier gets 0.5M. If only daily_budget_tokens is changed without the phase quotas, compile.py must fail with a sum error.
When to Reconsider the Threshold
- No
degraded_modetrigger for a month — the budget is excessive or the actual flow is below expected. token_health_mindrops below 0.5 more than once a week — the local tier is insufficient.failover_to_frontieris consistently 0 during local tier failures — the gateway is too strict, frontier does not work as insurance.- The share of
manual_queueafter manual timeout grows for two months in a row —manual_timeout_secis too short. - Less than 60% of
daily_budget_tokensis spent per day — it is time to shrink the budget.
Risk: the 9M/1M split is tied to the SLA by phases. It cannot be shifted without updating budget_plan_phases in the specification — frontier will no longer accommodate "disputed" cases.
D.4 Goodhart-Proofing Metrics (Chapter 10)
The thresholds silent_p0 ≤ 5%, manual_review_rate ≥ 15%, edge_drift ≤ 0.12, audit_trace_coverage = 1.0 are the default values for AgentClinic-production. In your project, they depend on the cost of a missed P0, the availability of manual reviewers, the dynamics of the incoming flow, and regulatory requirements for audit.
| Project Parameter | Low | Default (AgentClinic) | High |
|---|---|---|---|
| Cost of missed P0 | silent_p0 ≤ 8% | **≤ 5%** | ≤ 1–2% (payments) |
| Availability of manual reviewers | manual_review_rate ≥ 8% | **≥ 15%** | ≥ 25% (regulatory) |
| Input dynamics | edge_drift ≤ 0.20 | **≤ 0.12** | ≤ 0.05 (seasonal peaks) |
| Audit regulation | audit_trace_coverage ≥ 0.95 | **= 1.00** | = 1.00 + signed trace |
Exercise
cd book2/examples/goodhart-validator
mkdir -p out
# Copy spec to local out/ and relax silent_p0_cap to 0.08
cp specs/validation.yaml out/validation_loose.yaml
sed -i 's/threshold: 0.05/threshold: 0.08/' out/validation_loose.yaml
python3 scripts/run_validation.py \
--validation out/validation_loose.yaml \
--metrics fixtures/new_metrics_bad.json
# Dangerous variant: relax two independent protections at once
cp specs/validation.yaml out/validation_unsafe.yaml
sed -i 's/threshold: 0.15/threshold: 0.10/' out/validation_unsafe.yaml
sed -i 's/threshold: 0.05/threshold: 0.20/' out/validation_unsafe.yaml
python3 scripts/run_validation.py \
--validation out/validation_unsafe.yaml \
--metrics fixtures/new_metrics_bad.json
The first run should remain red: a bad release with silent_p0=0.18 still violates silent_p0_cap. The second, dangerous variant only passes because two independent protections are simultaneously relaxed. This shows why guard metrics cannot be calibrated one YAML line at a time.
When to Reconsider the Threshold
- No release has been blocked by
silent_p0_capfor a quarter — either the team is not making risky changes, or the threshold is excessively soft. manual_review_ratefalls for three sprints in a row whilemttr_gainrises — a symptom of Goodhart's law, manual reviewers have ceased to be a safety net.edge_driftstably fluctuates around 0.10–0.11 — the real input dynamics are close to the threshold.audit_trace_coveragedropped below 1.0 in even a single run — a regulatory invariant violation, hot-fix, not calibration.
- A new class of incidents has appeared that does not fall into
silent_p0— new invariants are needed, not a revision of old ones.
Risks: silent_p0 and manual_review_rate must be moved only as a pair. edge_drift only makes sense at audit_trace_coverage=1.0, otherwise drift is computed from a partial sample. All four thresholds form a single risk contract: weakening one in isolation from the rest means breaking it, not tuning it.
Full Metric Network
The text in the chapter uses a simplified mermaid diagram with three metrics and one guard. The full dependency network looks like this:
flowchart LR
MTTR[MTTR]
silent_p0[silent_p0]
manual_review_rate[manual_review_rate]
escalation_rate[escalation_rate]
postmortem_regression[postmortem_regression]
audit_trace_coverage[audit_trace_coverage]
silent_p0 -->|positive_interdependence| MTTR
escalation_rate -->|positive_interdependence| MTTR
manual_review_rate -->|negative_interdependence| MTTR
manual_review_rate -->|negative_interdependence| escalation_rate
audit_trace_coverage -->|negative_interdependence| escalation_rate
audit_trace_coverage -->|negative_interdependence| silent_p0
postmortem_regression -->|positive_interdependence| audit_trace_coverage
postmortem_regression -->|negative_interdependence| manual_review_rateThe logic is the same as in the simplified version: the red zone is MTTR and silent_p0; the path to its weakening goes through cutting back manual review and losing the audit trace.
D.5 Production Readiness (Chapter 11)
The threshold of 23/25 is the default value for AgentClinic-production with medium SDD process maturity and a mixed action type. In your project, the threshold depends on the cost of a cutover error, process maturity, the load on manual review, and the nature of the actions (stateless / stateful).
| Project Parameter | Low | Default (AgentClinic) | High |
|---|---|---|---|
| Cost of cutover error | internal tool: 21–22/25 semi-manual only | mixed production: auto ≥23/25 | payments/healthcare: auto ≥24/25 |
| SDD process maturity | 3 months → semi-manual only 20–22 | 6+ months → semi-manual 20–22, auto 23+ | 12+ months + 50+ replays → auto 23+, fewer manual stops |
| Manual review load | every pull request (~5/week) → can maintain 21–22 semi-manual | 20–30% of pull requests → auto 23+ | rarely → auto 24/25 |
| Action type | stateless → 22/25 only canary/semi-manual, auto 23+ | mixed → auto 23+ | stateful → auto 24+ and backup_verified |
Exercise
The script check_readiness.py hardcodes THRESHOLD = 23. Run it with a different value via a copy:
cd book2/examples/real-api && mkdir -p out
cp scripts/check_readiness.py out/check_readiness_t22.py
sed -i 's/THRESHOLD = 23/THRESHOLD = 22/' out/check_readiness_t22.py
python3 out/check_readiness_t22.py --readiness fixtures/readiness_block_audit.json
At THRESHOLD = 22, readiness_block_audit.json is still blocked because of audit_trace_coverage=0.7 < 1.0, even though the sum of 22/25 passes. This shows that audit_trace_coverage is an independent blocking invariant, not part of the sum. The exercise is about threshold sensitivity, not a recommendation to lower the auto-approval.
When to Reconsider the Threshold
- No readiness has been blocked by the threshold for a quarter — it is too low for the current team maturity.
- The share of semi-manual incidents grows for three sprints in a row — the threshold of 23/25 is not reached due to a systematic gap in Verification or Process.
- A class of actions with
stateful=truehas appeared — requirebackup_verifiedand raise the threshold for this class to 24/25. - All readiness failures within a month are along one axis — this is a gap in SDD templates; fix the templates, not the threshold.
- The time to assemble readiness artifacts exceeds the SLA for cutover — reconsider which points can be automated, not lower the threshold.
Risk: the threshold of 23/25 is incompatible with a zero score in Security at any sum — such a failure blocks the merge regardless of the total. Lowering below 23/25 changes the operating mode: it is no longer auto-approval, but semi-manual or canary mode. Even "low" (21/25) is a stop after every implement step and explicit operator confirmation, not the agent's right to perform remediation independently.