Reading: Appendix D. Threshold Calibration

Lesson 1 of 5 in module «Appendix D. Threshold Calibration»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Appendix D. Threshold Calibration

This is a reference appendix. It is not needed on the first pass: the learning minimum of each chapter is designed for the default thresholds of AgentClinic-production. This file collects all "Low / Default / High" tables, threshold-shifting exercises, and signals indicating when a threshold needs to be revised. Use it when transferring the process to your own project, when the standard values are no longer suitable.

The principle common to all tables: thresholds only make sense in pairs. Shifting one value without recalculating the related one is not calibration, but dismantling the control loop. Each section explicitly lists the risks of such a shift.

D.1 Mutation Testing (Chapter 5)

The numbers in Chapter 5 are the default values for AgentClinic-production with a medium incident flow and a mature SDD process. In your project, the thresholds depend on the cost of missing a P0, the complexity of the routing graph, the CI SLA window, and the stability of the incoming flow. Any row shift must be accompanied by an entry in validation.md with justification.

Project ParameterLowDefault (AgentClinic)High
Cost of missing P0strict_reject_rate ≥ 0.92**≥ 0.98**≥ 0.995 (payments, healthcare)
Routing graph complexitydepth_of_diagnostics ≥ 2 (≤10 edges)**≥ 3** (10–50 edges)≥ 5 (>100 edges, multi-tenant)
CI SLA windowrecovery_time_p95_ms ≤ 800**≤ 1200**≤ 1500 (>500 PR/day)
Incident flow stability1 mutant per class2 mutants per class5+ mutants per class + seed rotation

Exercise

cd book2/examples/stress-mutator

mkdir -p out
cp expected/expected_failures.json out/expected_failures_depth5.json
sed -i 's/"depth_of_diagnostics_min": 3/"depth_of_diagnostics_min": 5/' out/expected_failures_depth5.json

python3 scripts/immunity_score.py \
  --validator-results out/validator_results.json \
  --expected expected/expected_failures.json \
  --out out/immunity_default.json

python3 scripts/immunity_score.py \
  --validator-results out/validator_results.json \
  --expected out/expected_failures_depth5.json \
  --out out/immunity_depth5.json

The first run should pass: the average diagnostic depth is 4 and exceeds the threshold of 3. The second run should exit with code 1: the same validator no longer passes the artificially tightened threshold depth_of_diagnostics_min = 5. The delta shows not a new defect in the mutants, but the cost of tightening the threshold.

When to Reconsider the Threshold

  • No merge has been blocked by the threshold for a quarter — it is excessively low.
  • More than 10 regressions with the same mutation_id in a week — depth_of_diagnostics is insufficient, increase it.
  • recovery_time_p95 drops to zero as strict_reject_rate rises — a sign of Goodhart's law.
  • A new class of incidents has appeared — recalculate all three thresholds from scratch.
  • One seed (seed) repeats the same set of mutation_id for five sprints in a row — seed rotation is needed.

Risk: if strict_reject_rate rises while depth_of_diagnostics simultaneously falls, this is a symptom of Goodhart's law. Both parameters must be moved only as a pair.

D.2 Shadow Specification Selection (Chapter 6)

The weights 0.5*mttr_gain + 0.3*early_signal + 0.2*coverage - 0.4*false_escalation and the keep/reject thresholds are the default values for AgentClinic-production. In your project, they depend on the cost of false escalation, the importance of early signal, the size of the historical base, and the available budget for prompt examples.

ParameterLowDefault (AgentClinic)High
Cost of false escalationpenalty false_escalation: 0.2–0.3**0.4**0.6–0.8 (healthcare, payments)
Importance of early signalweight early_signal: 0.2**0.3**0.4–0.5 (blast radius >5 services)
Size of historical base20–50 cases (smoke test)50+ cases200+ cases with window rotation
Budget for prompt exampleskeep-threshold 0.80, 4 slots**0.70, 8 slots / 2000 tokens**0.60, 12 slots / 4000 tokens

Exercise

Run the auction with a conservative risk profile (higher penalty for false escalation):

cd book2/examples/shadow-auction
python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights "0.3,0.4,0.2,0.8" --out out/scorebook.json

python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json

With this profile, shadow.p0.voice_handoff moves from winner to disputed, while shadow.alert.red_color_urgency remains in rejected. This is a manifestation of the new profile: the team rewards MTTR reduction less and penalizes false escalation more strongly.

When to Reconsider the Threshold

  • No winner has shown a positive effect in post-mortems for a month — keep-threshold is too low.
  • The share of disputed is consistently above 40% — the formula does not distinguish cases.
  • More than 8 winners are selected in one phase — budget-tokens was chosen without considering the size of QWEN.md.
  • A new class of incidents has appeared outside the historical data.
  • mttr_gain and false_escalation rise together — a symptom of Goodhart's law.

Risk: the false_escalation penalty and the mttr_gain weight must be moved only as a pair. Shifting one without revisiting the other breaks the "useful signal ↔ false noise" link.

D.3 Tiered Budgets (Chapter 9)

The budget of 10M tokens with a 9M/1M split (local/frontier) is the default value for AgentClinic-production with a medium incident flow. In your project, the budget size and proportions depend on the incident flow, the average phase cost, the share of disputed reviews, and the sensitivity to local-coder downtime.

Project ParameterLowDefault (AgentClinic)High
Incident flow per day≤50/day → 2–3M tokens, 90/10200/day → 10M, 9M/1M (90/10)≥500/day → 25–40M, 80/20
Phase cost (tokens)~20K~50K100K+ (multi-step replay)
Share of disputed reviews≤5% → frontier 5–7%~10% → 1M (10%)15–25% → 15–20% frontier
Sensitivity to local-coder downtime≤1 time/month → reserve 5%2–4 times/month → 7%weekly → 15% + duplicated provider

Exercise

cd book2/examples/budget-keeper

python3 scripts/compile.py --budget-spec specs/budget_network_5m.yaml --out out/budget_plan_5m.json
python3 scripts/simulate.py --plan out/budget_plan_5m.json --scenario scenarios/fail_local_45m.json --out out/fail_result_5m.json

python3 scripts/inspect.py --result out/fail_result_5m.json --query "failover_to_frontier==2 && degraded_queue==18 && token_health_min>=0.5"

Check whether token_health_min stayed above 0.5 with half the budget. In the ready-made 5M variant, the proportions are preserved: the local tier gets 4.5M, frontier gets 0.5M. If only daily_budget_tokens is changed without the phase quotas, compile.py must fail with a sum error.

When to Reconsider the Threshold

  • No degraded_mode trigger for a month — the budget is excessive or the actual flow is below expected.
  • token_health_min drops below 0.5 more than once a week — the local tier is insufficient.
  • failover_to_frontier is consistently 0 during local tier failures — the gateway is too strict, frontier does not work as insurance.
  • The share of manual_queue after manual timeout grows for two months in a row — manual_timeout_sec is too short.
  • Less than 60% of daily_budget_tokens is spent per day — it is time to shrink the budget.

Risk: the 9M/1M split is tied to the SLA by phases. It cannot be shifted without updating budget_plan_phases in the specification — frontier will no longer accommodate "disputed" cases.

D.4 Goodhart-Proofing Metrics (Chapter 10)

The thresholds silent_p0 ≤ 5%, manual_review_rate ≥ 15%, edge_drift ≤ 0.12, audit_trace_coverage = 1.0 are the default values for AgentClinic-production. In your project, they depend on the cost of a missed P0, the availability of manual reviewers, the dynamics of the incoming flow, and regulatory requirements for audit.

Project ParameterLowDefault (AgentClinic)High
Cost of missed P0silent_p0 ≤ 8%**≤ 5%**≤ 1–2% (payments)
Availability of manual reviewersmanual_review_rate ≥ 8%**≥ 15%**≥ 25% (regulatory)
Input dynamicsedge_drift ≤ 0.20**≤ 0.12**≤ 0.05 (seasonal peaks)
Audit regulationaudit_trace_coverage ≥ 0.95**= 1.00**= 1.00 + signed trace

Exercise

cd book2/examples/goodhart-validator

mkdir -p out

# Copy spec to local out/ and relax silent_p0_cap to 0.08
cp specs/validation.yaml out/validation_loose.yaml
sed -i 's/threshold: 0.05/threshold: 0.08/' out/validation_loose.yaml

python3 scripts/run_validation.py \
  --validation out/validation_loose.yaml \
  --metrics fixtures/new_metrics_bad.json

# Dangerous variant: relax two independent protections at once
cp specs/validation.yaml out/validation_unsafe.yaml
sed -i 's/threshold: 0.15/threshold: 0.10/' out/validation_unsafe.yaml
sed -i 's/threshold: 0.05/threshold: 0.20/' out/validation_unsafe.yaml

python3 scripts/run_validation.py \
  --validation out/validation_unsafe.yaml \
  --metrics fixtures/new_metrics_bad.json

The first run should remain red: a bad release with silent_p0=0.18 still violates silent_p0_cap. The second, dangerous variant only passes because two independent protections are simultaneously relaxed. This shows why guard metrics cannot be calibrated one YAML line at a time.

When to Reconsider the Threshold

  • No release has been blocked by silent_p0_cap for a quarter — either the team is not making risky changes, or the threshold is excessively soft.
  • manual_review_rate falls for three sprints in a row while mttr_gain rises — a symptom of Goodhart's law, manual reviewers have ceased to be a safety net.
  • edge_drift stably fluctuates around 0.10–0.11 — the real input dynamics are close to the threshold.
  • audit_trace_coverage dropped below 1.0 in even a single run — a regulatory invariant violation, hot-fix, not calibration.
  • A new class of incidents has appeared that does not fall into silent_p0 — new invariants are needed, not a revision of old ones.

Risks: silent_p0 and manual_review_rate must be moved only as a pair. edge_drift only makes sense at audit_trace_coverage=1.0, otherwise drift is computed from a partial sample. All four thresholds form a single risk contract: weakening one in isolation from the rest means breaking it, not tuning it.

Full Metric Network

The text in the chapter uses a simplified mermaid diagram with three metrics and one guard. The full dependency network looks like this:

flowchart LR
    MTTR[MTTR]
    silent_p0[silent_p0]
    manual_review_rate[manual_review_rate]
    escalation_rate[escalation_rate]
    postmortem_regression[postmortem_regression]
    audit_trace_coverage[audit_trace_coverage]
    silent_p0 -->|positive_interdependence| MTTR
    escalation_rate -->|positive_interdependence| MTTR
    manual_review_rate -->|negative_interdependence| MTTR
    manual_review_rate -->|negative_interdependence| escalation_rate
    audit_trace_coverage -->|negative_interdependence| escalation_rate
    audit_trace_coverage -->|negative_interdependence| silent_p0
    postmortem_regression -->|positive_interdependence| audit_trace_coverage
    postmortem_regression -->|negative_interdependence| manual_review_rate

The logic is the same as in the simplified version: the red zone is MTTR and silent_p0; the path to its weakening goes through cutting back manual review and losing the audit trace.

D.5 Production Readiness (Chapter 11)

The threshold of 23/25 is the default value for AgentClinic-production with medium SDD process maturity and a mixed action type. In your project, the threshold depends on the cost of a cutover error, process maturity, the load on manual review, and the nature of the actions (stateless / stateful).

Project ParameterLowDefault (AgentClinic)High
Cost of cutover errorinternal tool: 21–22/25 semi-manual onlymixed production: auto ≥23/25payments/healthcare: auto ≥24/25
SDD process maturity3 months → semi-manual only 20–226+ months → semi-manual 20–22, auto 23+12+ months + 50+ replays → auto 23+, fewer manual stops
Manual review loadevery pull request (~5/week) → can maintain 21–22 semi-manual20–30% of pull requests → auto 23+rarely → auto 24/25
Action typestateless → 22/25 only canary/semi-manual, auto 23+mixed → auto 23+stateful → auto 24+ and backup_verified

Exercise

The script check_readiness.py hardcodes THRESHOLD = 23. Run it with a different value via a copy:

cd book2/examples/real-api && mkdir -p out
cp scripts/check_readiness.py out/check_readiness_t22.py
sed -i 's/THRESHOLD = 23/THRESHOLD = 22/' out/check_readiness_t22.py
python3 out/check_readiness_t22.py --readiness fixtures/readiness_block_audit.json

At THRESHOLD = 22, readiness_block_audit.json is still blocked because of audit_trace_coverage=0.7 < 1.0, even though the sum of 22/25 passes. This shows that audit_trace_coverage is an independent blocking invariant, not part of the sum. The exercise is about threshold sensitivity, not a recommendation to lower the auto-approval.

When to Reconsider the Threshold

  • No readiness has been blocked by the threshold for a quarter — it is too low for the current team maturity.
  • The share of semi-manual incidents grows for three sprints in a row — the threshold of 23/25 is not reached due to a systematic gap in Verification or Process.
  • A class of actions with stateful=true has appeared — require backup_verified and raise the threshold for this class to 24/25.
  • All readiness failures within a month are along one axis — this is a gap in SDD templates; fix the templates, not the threshold.
  • The time to assemble readiness artifacts exceeds the SLA for cutover — reconsider which points can be automated, not lower the threshold.

Risk: the threshold of 23/25 is incompatible with a zero score in Security at any sum — such a failure blocks the merge regardless of the total. Lowering below 23/25 changes the operating mode: it is no longer auto-approval, but semi-manual or canary mode. Even "low" (21/25) is a stop after every implement step and explicit operator confirmation, not the agent's right to perform remediation independently.

My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100