Topic: Appendix D. Threshold Calibration
Difficulty level: Medium
Estimated study time: 6-8 hours (theory: 3 hours, practice: 3-5 hours)
Prerequisites: Completion of AgentClinic course chapters 5, 6, 9, 10, and 11
Basic knowledge of YAML and command-line operations
Understanding of mutation testing, shadow specifications, and tiered budgets concepts
Experience working with SLA metrics and CI/CD processes
Familiarity with Goodhart's law and its manifestations in engineering systems
Learning objectives: Configure and calibrate mutation testing thresholds (strict_reject_rate, depth_of_diagnostics, recovery_time_p95) accounting for project specifics and document rationale in validation.md
Recalculate weights and selection thresholds for shadow specifications (keep-threshold, reject-threshold, budget-tokens) when false escalation cost and prompt sample budget change
Design tiered token budgets with correct local/frontier split and verify integrity through compile.py
Identify symptoms of Goodhart's law in defensive metric networks and correct thresholds only with paired shifts while preserving invariants
Determine production readiness of a system accounting for action type (stateless/stateful) and independent blocking invariants
Overview: Appendix D is a reference material for porting the AgentClinic-production process to your own project. It systematizes all "Low / Default / High" threshold tables for five key areas: mutation testing, shadow specification selection, tiered budgets, Goodhart-proofing metrics, and production readiness. Central principle: thresholds only make sense in pairs. Shifting one value without recalculating its related parameter is not calibration but dismantling of the control loop. The guide contains practical exercises on real scripts, signals for threshold review, and warnings about risks of incorrect calibration.
Key concepts: Paired calibration: The fundamental principle of Appendix D: any threshold exists in conjunction with other parameters. For example, strict_reject_rate and depth_of_diagnostics only move together. If strict_reject_rate increases while depth_of_diagnostics decreases, this is a symptom of Goodhart's law, not quality improvement. Similarly: false_escalation penalty and mttr_gain weight, silent_p0 and manual_review_rate, edge_drift and audit_trace_coverage.
Agentclinic default values: The baseline for production systems with medium incident flow (200/day), mature SDD process (6+ months), and mixed action types. Any deviation from this baseline requires rationale in validation.md.
Threshold review signals: Specific triggers indicating the need for calibration: no blocks for a quarter (threshold excessively low), regression concentration on a single mutation_id (threshold insufficient), recovery_time_p95 dropping to zero while strict_reject_rate rises (Goodhart), manual_review_rate rising alongside mttr_gain (goal substitution).
Goodhart's law in metrics: When a metric becomes a target, it ceases to be a good metric. In the AgentClinic context: rising strict_reject_rate without real diagnostic improvement, MTTR reduction by ignoring complex cases, rising mttr_gain alongside rising false_escalation — all these patterns require paired correction.
Independent blocking invariants: Metrics that are not part of the aggregate threshold but block merge independently. For example, audit_trace_coverage = 1.0 is a mandatory condition for production readiness; with audit_trace_coverage = 0.7, readiness of 22/25 or even 25/25 does not permit auto-switchover.
Mutation seed rotation: The practice of periodically changing the seed in mutation testing to prevent validator overfitting on a fixed set of mutation_ids. One seed repeating for five consecutive sprints is a signal to rotate.
Tiered budget split: Architecture for distributing tokens between local tier (9M out of 10M) and frontier tier (1M). Proportions are tied to phase SLAs: changing daily_budget_tokens without updating budget_plan_phases violates integrity and causes compile.py to error.
Risk profiles for shadow specifications: Weight configuration in the selection formula: conservative profile (0.3, 0.4, 0.2, 0.8) penalizes false escalation more heavily and rewards MTTR less than standard (0.5, 0.3, 0.2, -0.4).
Operating modes: Permission hierarchy: auto ≥23/25 — fully automatic mode; semi-manual 20–22 — stop after implement with explicit confirmation; canary — gradual rollout. Dropping below 23/25 is a mode change, not a "softer threshold".
Metric dependency network: Complete model of interrelationships: silent_p0 and escalation_rate positively affect MTTR; manual_review_rate and audit_trace_coverage negatively affect MTTR and escalation_rate; postmortem_regression is positively associated with audit_trace_coverage and negatively with manual_review_rate.
Practice exercises: Name: Exercise D.1: Calibrating depth_of_diagnostics
Problem: In a project with a route graph of 80 edges (multi-tenant), verify how changing the depth_of_diagnostics_min threshold from 3 to 5 affects immunity_score. Use scripts from book2/examples/stress-mutator.
Solution: 1. Navigate to directory: cd book2/examples/stress-mutator
- Create output directory: mkdir -p out
- Copy and modify expected result: cp expected/expected_failures.json out/expected_failures_depth5.json; sed -i 's/"depth_of_diagnostics_min": 3/"depth_of_diagnostics_min": 5/' out/expected_failures_depth5.json
- Run baseline calculation: python3 scripts/immunity_score.py --validator-results out/validator_results.json --expected expected/expected_failures.json --out out/immunity_default.json — should pass (average depth 4 > 3)
- Run tightened calculation: python3 scripts/immunity_score.py --validator-results out/validator_results.json --expected out/expected_failures_depth5.json --out out/immunity_depth5.json — should exit with code 1
- Compare the delta: this is not a new defect but the cost of tightening the threshold. Document rationale in validation.md: "Graph expanded to 80 edges, multi-tenant; depth_of_diagnostics_min raised to 5, strict_reject_rate recalculated to ≥0.995".
Complexity: intermediate
Name: Exercise D.2: Conservative profile for shadow auction
Problem: A healthcare team requires reducing false escalations. Form a conservative weight profile and analyze which candidates will change status.
Solution: 1. Navigate to directory: cd book2/examples/shadow-auction
- Run auction with conservative weights: python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights "0.3,0.4,0.2,0.8" --out out/scorebook.json
- Apply thresholds with increased budget: python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json
- Analyze changes: shadow.p0.voice_handoff moves from winner to disputed (high false escalation risk), shadow.alert.red_color_urgency remains rejected
- Document in validation.md: "False escalation cost raised to 0.8, keep-threshold lowered to 0.70, budget-tokens increased to 2000 to compensate for conservatism".
Complexity: intermediate
Name: Exercise D.3: Verifying tiered budget integrity
Problem: A project with 50 incidents/day flow requires a 5M token budget. Verify that compile.py will reject an incorrect specification with changed daily_budget_tokens without recalculating phase quotas.
Solution: 1. Navigate to directory: cd book2/examples/budget-keeper
- Compile correct plan: python3 scripts/compile.py --budget-spec specs/budget_network_5m.yaml --out out/budget_plan_5m.json
- Verify proportion preservation: local = 4.5M, frontier = 0.5M (90/10)
- Simulate local tier failure: python3 scripts/simulate.py --plan out/budget_plan_5m.json --scenario scenarios/fail_local_45m.json --out out/fail_result_5m.json
- Check health: python3 scripts/inspect.py --result out/fail_result_5m.json --query "failover_to_frontier==2 && degraded_queue==18 && token_health_min>=0.5"
- Create incorrect specification: copy budget_network_5m.yaml, change only daily_budget_tokens to 3M, leaving phase quotas
- Verify that compile.py fails with sum error — this protects against control loop dismantling.
Complexity: intermediate
Name: Exercise D.4: Demonstrating the danger of separate guard-metric calibration
Problem: Show why weakening two independent guards simultaneously passes a bad release, while weakening one does not.
Solution: 1. Navigate to directory: cd book2/examples/goodhart-validator; mkdir -p out
- First scenario — weakening silent_p0_cap: cp specs/validation.yaml out/validation_loose.yaml; sed -i 's/threshold: 0.05/threshold: 0.08/' out/validation_loose.yaml; python3 scripts/run_validation.py --validation out/validation_loose.yaml --metrics fixtures/new_metrics_bad.json — result: FAILED (silent_p0=0.18 > 0.08)
- Second scenario — dangerous double weakening: cp specs/validation.yaml out/validation_unsafe.yaml; sed -i 's/threshold: 0.15/threshold: 0.10/' out/validation_unsafe.yaml; sed -i 's/threshold: 0.05/threshold: 0.20/' out/validation_unsafe.yaml; python3 scripts/run_validation.py --validation out/validation_unsafe.yaml --metrics fixtures/new_metrics_bad.json — result: PASSED (false positive)
- Conclusion: guard-metrics form a unified risk contract; weakening them individually is not allowed. Document in validation.md: prohibition on single-line YAML changes.
Complexity: advanced
Name: Exercise D.5: Verifying independence of audit_trace_coverage
Problem: Show that audit_trace_coverage is an independent blocking invariant not included in the 25/25 sum for production readiness.
Solution: 1. Navigate to directory: cd book2/examples/real-api; mkdir -p out
- Copy and modify script: cp scripts/check_readiness.py out/check_readiness_t22.py; sed -i 's/THRESHOLD = 23/THRESHOLD = 22/' out/check_readiness_t22.py
- Run with readiness_block_audit.json: python3 out/check_readiness_t22.py --readiness fixtures/readiness_block_audit.json
- Result: BLOCKED due to audit_trace_coverage=0.7 < 1.0, despite sum of 22/25 (or even potential 25/25)
- Analyze: this demonstrates that Security=0 or audit_trace_coverage<<1.0 are absolute blockers. Lowering THRESHOLD to 22 is a transition to semi-manual mode, not a "softer admission". Document in validation.md: "audit_trace_coverage=1.0 — regulatory invariant, not subject to calibration".
Complexity: intermediate
Case studies: Name: Case: Threshold recalibration when entering the healthcare market
Scenario: The AgentClinic-production team, working with internal tools (strict_reject_rate ≥ 0.98, depth_of_diagnostics ≥ 3), received a contract for incident processing in a medical information system. Regulatory requirements: missing a P0-critical incident incurs a fine of up to 2% of annual revenue and criminal liability for management.
Challenge: Standard AgentClinic thresholds are insufficient for healthcare. The team faced the need to simultaneously raise strict_reject_rate to ≥0.995, depth_of_diagnostics to ≥5 (route graph expanded to 120 edges, multi-tenant with patient isolation), reduce recovery_time_p95 to ≤1500 ms at >500 PR/day. Risk: isolated increase of strict_reject_rate without depth_of_diagnostics will cause Goodhart's law — the validator will reject everything indiscriminately, including correct fixes.
Solution: The team conducted paired calibration using Appendix D methodology: 1) Documented rationale in validation.md: "Regulatory contract §12.3, cost of missing P0: 2% fine + criminal liability"; 2) Raised strict_reject_rate to 0.995 and simultaneously depth_of_diagnostics to 5; 3) Increased mutants per class to 5+ with seed rotation every 2 sprints; 4) Conducted stress testing through immunity_score.py with artificially tightened depth_of_diagnostics_min=5 threshold; 5) Verified that recovery_time_p95 does not drop to zero as strict_reject_rate rises — absence of Goodhart's law symptom.
Result: The system passed regulatory audit with zero missed P0 incidents over 6 months of operation. MTTR increased by 12% due to deeper diagnostics, but this was accepted as a necessary cost of compliance. Seed rotation detected 3 validator overfitting cases before they reached production.
Lessons learned: Regulatory requirements demand not "maximum" thresholds but justified paired shifts with documented cost of decision
Seed rotation is not an optional practice but mandatory when increasing mutant count, otherwise validator overfits on fixed patterns
12% MTTR increase with higher depth_of_diagnostics is an expected trade-off that should be built into customer SLA in advance
Absence of Goodhart's law symptom (recovery_time_p95 → 0) is a more important health indicator than absolute strict_reject_rate value
Related concepts: Paired calibration
Threshold review signals
Goodhart's law in metrics
Mutation seed rotation
Name: Case: Control loop dismantling through incorrect budget compression
Scenario: A fintech startup experiencing a liquidity crisis decided to cut LLM infrastructure costs by 50%. The CTO changed only daily_budget_tokens in the specification from 10M to 5M without recalculating phase quotas.
Challenge: The compile.py script crashed with a sum error — this is a protective mechanism provided in Appendix D. However, the CTO, without understanding, commented out the integrity check in a local copy and manually assembled the budget. Result: local tier received 4M instead of 4.5M, frontier — 0.2M instead of 0.5M, and the remaining 0.8M was "distributed at discretion" without connection to phase SLAs.
Solution: The problem was discovered after 3 weeks when, upon local-coder failure, the system could not switch to frontier: the gateway was too rigid due to insufficient tokens in the frontier tier. The team restored the original compile.py mechanism, conducted correct compilation with 90/10 proportions (4.5M/0.5M), simulated failure through simulate.py, and verified failover_to_frontier functionality.
Result: System downtime was 47 minutes, 12 incidents went to manual_queue with SLA breach. Financial losses exceeded budget compression savings by 8 times. The CTO was removed from infrastructure decision-making.
Lessons learned: compile.py is not bureaucratic obstruction but control loop protection; bypassing checks is dismantling, not optimization
90/10 proportions are tied to phase SLAs; changing them requires recalculating budget_plan_phases, not only daily_budget_tokens
Saving on infrastructure guard-metrics has negative ROI: losses from one downtime exceed annual savings
Appendix D methodology requires any budget change to go through full cycle: compile → simulate → inspect → documentation in validation.md
Related concepts: Tiered budget split
Paired calibration
Threshold review signals
Name: Case: False positive pass with separate guard-metric calibration
Scenario: A payment system team, working under business pressure to "accelerate releases", decided to "optimize" guard-metrics. Instead of paired calibration of silent_p0 and manual_review_rate, they independently weakened both thresholds: silent_p0_cap from 0.05 to 0.20 and manual_review_rate from 0.15 to 0.10.
Challenge: Individually each change seemed justified: "we have rare P0s" and "few manual reviewers". Simultaneous weakening created a false positive pass: a bad release with silent_p0=0.18 and manual_review_rate=0.08 passed validation, though both indicators were critically low. This is a direct manifestation of Appendix D principle: guard-metrics form a unified risk contract.
Solution: A payment data leak incident (consequences: 340K affected transactions, $2.4M regulatory fine) led to an audit. Auditors reproduced exercise D.4: weakening only silent_p0_cap to 0.08 still blocked the bad release, while double weakening — passed it. The team implemented a hard rule: any validation.yaml change requires review checking for "independent weakenings" and automatic run through goodhart-validator.
Result: System restored, regulatory requirements met through emergency manual_review_rate increase to 0.25 and introduction of signed audit_trace_coverage traceability. Release time increased by 40%, but zero missed P0 level was restored.
Lessons learned: Guard-metrics are not a set of independent switches but a unified risk contract; weakening one in isolation from others is dismantling protection
False positive pass is worse than false negative block: a missed defect in production costs orders of magnitude more than release delay
Automation of "independent weakening" check through goodhart-validator must be mandatory, not optional
Business pressure to "accelerate" does not justify methodology bypass; an escalation mechanism is needed, not silent compromise
Related concepts: Independent blocking invariants
Goodhart's law in metrics
Metric dependency network
Paired calibration
Study tips: Progress through material sequentially by sections D.1–D.5, skipping no exercises — each exercise demonstrates a specific risk of incorrect calibration
Maintain your own validation.md in parallel with study: document hypothetical rationales for your project, even if this is a training simulation
Use the "predict, then verify" approach: before running a script, write down expected result, then compare with actual — this develops intuition about threshold behavior
Create a correspondence table "my project → AgentClinic parameters → deviations → rationale" — this is a template for real porting
For visual style: redraw the mermaid metric dependency network diagram on paper, marking positive and negative connections with different colors — this helps remember which metrics move in pairs
Practice identifying Goodhart's law symptoms: invent "what if" scenarios and check if Appendix D has a corresponding review signal
Group study by "one day — one risk system" principle: D.1+D.4 (mutations and Goodhart), D.2+D.5 (shadow specifications and readiness), D.3 (budgets) — this better traces connections between sections
For auditory style: dictate calibration rationales aloud in format "If [project parameter], then [threshold], because [risk]" — this prepares for real stakeholder discussions
Test edge cases: what happens if all thresholds are set to "Low" or "High" simultaneously? Why is this unworkable?
Additional resources: Agentclinic course source materials: Chapters 5, 6, 9, 10, 11 — theoretical foundation that Appendix D builds upon
book2/examples repository: Practical scripts for all exercises: stress-mutator, shadow-auction, budget-keeper, goodhart-validator, real-api
validation.md template: Format for documenting calibration rationales; create your own copy for your project
mermaid documentation: For independent editing and extending metric dependency network diagrams
Article "goodhart's law and machine learning" (varoquaux): Theoretical justification for paired calibration and network structure of guard-metrics
Google sre book, chapter 4 (service level objectives): Parallels between error budgets and tiered token budgets in AgentClinic
Agentclinic production porting checklist: Self-compile based on tables D.1–D.5 with columns "Project parameter", "Our value", "Deviation from default", "Rationale", "Review date"
Summary: Appendix D is a practical reference for calibrating AgentClinic-production thresholds for porting to your own projects. Key principles: (1) thresholds only make sense in pairs, shifting one without recalculating its related parameter is dismantling the control loop; (2) any deviation from default values requires rationale in validation.md; (3) guard-metrics form a unified risk contract, not a set of independent switches; (4) Goodhart's law symptoms (rising target metric without real improvement, falling related indicators) are the main signal for paired correction; (5) independent blocking invariants (audit_trace_coverage = 1.0, Security = 0) are not subject to calibration at all. Five sections cover mutation testing, shadow specifications, tiered budgets, Goodhart-proofing, and production readiness — each with threshold tables, practical exercises on real scripts, review signals, and risk warnings. The methodology requires documentation, automatic integrity verification, and regular parameter rotation to prevent overfitting.