Study guide: Appendix D. Threshold Calibration

Lesson 3 of 5 in module «Appendix D. Threshold Calibration»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Topic: Appendix D. Threshold Calibration

Difficulty level: Medium

Estimated study time: 3-4 hours

Prerequisites: Familiarity with basic SDD (Software-Defined Diagnostics/Delivery) concepts

Understanding of MTTR, SLA, CI/CD metrics, and incident management

Experience with the Linux command line (bash) and Python

Basic understanding of mutation testing and how LLMs work (tokens, specifications)

Learning objectives: Understand and apply the principle of paired threshold calibration, avoiding a "disassembly of the safety loop" when shifting only one metric.

Adapt mutation testing thresholds (chapter 5) depending on the cost of a P0 miss and the complexity of the route graph.

Tune the weights and thresholds of the shadow specifications auction (chapter 6) taking into account the cost of false escalation and the importance of early signal.

Optimize tiered token budgets (chapter 9) for different load profiles and incident flows.

Recognize and prevent manifestations of Goodhart's Law (chapter 10) when configuring guard metrics.

Overview: This study guide focuses on Appendix D — threshold calibration in the AgentClinic-production process. Threshold calibration is not merely changing numbers in configuration files; it is a fine-tuning of the balance between risks, the cost of errors, and available resources. The material brings together tier tables (Low / Default / High), practical exercises on shifting thresholds, and indicators for their review. The key emphasis is on the fact that thresholds only make sense in pairs: changing one parameter must be accompanied by a recalculation of the related one, otherwise the system loses its stability.

Key concepts: Paired calibration: The principle that thresholds cannot be changed in isolation. Shifting one value without recalculating the related one breaks the safety system (for example, an increase in strict_reject_rate alongside a decrease in depth_of_diagnostics is a symptom of Goodhart's Law).

Mutation testing (d.1): Evaluation of the quality of the diagnostics process based on artificial failures. It depends on the cost of a P0 miss (the probability of missing a critical incident) and the complexity of the route graph.

Selection of shadow specifications (d.2): An auction process where weights (mttr_gain, early_signal, false_escalation) determine which specifications become active. It requires balancing response speed against the number of false positives.

Tiered budgets (d.3): Distribution of computing resources (in tokens) between the local and frontier tiers. Changing the proportions directly affects phase SLAs.

Goodhart protection (d.4): A mechanism protecting metrics from manipulation, when optimizing for a number worsens the actual outcome. It is controlled via invariants: silent_p0, manual_review_rate, audit_trace_coverage.

Production readiness (d.5): Evaluation of an artifact's readiness for release (default 23/25). It includes hard blocking invariants, such as audit_trace_coverage = 1.0, that cannot be bypassed by the total score.

Practice exercises: Name: Calibration of diagnostic depth (D.1)

Problem: You need to check how tightening the diagnostic depth threshold (depth_of_diagnostics_min) will affect the existing validator. The task is to compare the default run with a run under stricter requirements (threshold 5 instead of 3).

Solution: 1. Go to the directory cd book2/examples/stress-mutator.

Create an out folder and copy the expected failures file there: cp expected/expected_failures.json out/expected_failures_depth5.json.
Replace the threshold in the file: sed -i 's/"depth_of_diagnostics_min": 3/"depth_of_diagnostics_min": 5/' out/expected_failures_depth5.json.
Run the calculation with default values (it will pass successfully, since the average depth 4 > 3).
Run the calculation with the new expectations (it will fail, since 4 < 5). The difference will show the cost of tightening the threshold.

Complexity: intermediate

Name: Shadow specifications auction with a conservative profile (D.2)

Problem: The team decided that false escalations cost too much. An auction needs to be run with a new weights profile, where the penalty for false escalation is increased to 0.8 and the weight of the early signal is reduced to 0.2.

Solution: 1. Go to cd book2/examples/shadow-auction.

Run the scoring script with the new weights: python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights "0.3,0.4,0.2,0.8" --out out/scorebook.json.
Run the decision-making with a budget of 2000 tokens: python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json.
Analyze the result: shadow.p0.voice_handoff should move to disputed, since the formula has become stricter in assessing risks.

Complexity: advanced

Name: Testing the tiered budget under failure (D.3)

Problem: It is necessary to simulate a situation with a reduced budget of 5M tokens (4.5M local / 0.5M frontier) and check how the system will survive a 45-minute outage of the local tier, so that token_health_min does not drop below 0.5.

Solution: 1. Go to cd book2/examples/budget-keeper.

Compile the plan: python3 scripts/compile.py --budget-spec specs/budget_network_5m.yaml --out out/budget_plan_5m.json.
Run the failure simulation: python3 scripts/simulate.py --plan out/budget_plan_5m.json --scenario scenarios/fail_local_45m.json --out out/fail_result_5m.json.
Check the invariants: python3 scripts/inspect.py --result out/fail_result_5m.json --query "failover_to_frontier==2 && degraded_queue==18 && token_health_min>=0.5".
Make sure that budget changes without updating phase quotas cause an error (compile.py fails).

Complexity: intermediate

Name: Circumventing Goodhart's metric protections (D.4)

Problem: Verify in practice how weakening even two independent safeguards (for example, manual_review_rate and silent_p0) allows one to "push through" a bad release that should have been blocked.

Solution: 1. Go to cd book2/examples/goodhart-validator.

Weaken the silent_p0 threshold from 0.05 to 0.08 in a local copy of the specification.
Run validation (the script should remain 'red', since the metric 0.18 is still higher than 0.08).
Create a 'dangerous' config by weakening two thresholds at once (for example, edge_drift to 0.10 and silent_p0 to 0.20).
Run validation with bad metrics (fixtures/new_metrics_bad.json) — the system will pass the check, which proves that pointwise calibration is unacceptable.

Complexity: advanced

Case studies: Name: Anomalous growth of missed critical incidents (P0)

Scenario: In a large financial project (high risk profile), the team implemented an automated SDD process. Initially, the default AgentClinic thresholds were used: silent_p0 ≤ 5% and manual_review_rate ≥ 15%. Over time, developers began to complain about CI pipeline slowdowns due to manual checks.

Challenge: To speed up the process, the manual_review_rate threshold was lowered to 8% (the 'Low' setting), while silent_p0 remained at 5%. This led to the system starting to miss new classes of incidents not present in the historical database (shadow specifications). The MTTR metric formally decreased, but the number of catastrophic P0 misses grew (a symptom of Goodhart's Law).

Solution: SRE engineers returned to the principle of paired calibration. The manual_review_rate threshold was returned to 15%. At the same time, the budget of hint samples was increased to 12 slots (the 'High' level) to compensate for the load. All changes were documented in validation.md.

Result: The number of missed P0s returned to acceptable values (< 1-2%). The process stabilized, and the system once again began correctly classifying disputed cases thanks to the restoration of balance between automation and manual control.

Lessons learned: Reducing the share of manual checks without accounting for the dynamics of the incoming flow leads to a growth in P0 misses.

The MTTR and manual_review_rate metrics have a negative interdependence; they should only be changed in pairs.

Process acceleration must not be achieved by weakening guard metrics.

Related concepts: Protecting metrics from Goodhart's Law (D.4)

Paired threshold calibration

Cost of a P0 miss

Name: Service degradation during a local LLM-tier outage

Scenario: An e-commerce platform with a flow of 600 incidents per day used a budget of 10M tokens with a 90/10 split between local-coder and frontier. This corresponded to the "Default" profile, but the actual flow belonged to the "High" category.

Challenge: During a seasonal sale, the local LLM provider began to fail regularly (weekly). The 1M token reserve on the frontier tier was exhausted within minutes. The triggered failover switched the system to degraded_queue mode, which led to hours of delays in restoring critical services.

Solution: The tiered budgets were recalculated (Section D.3). The total budget was increased to 25M tokens, and the proportion was changed to 80/20. A duplicate provider was added for the local tier. The budget_plan_phases specifications were updated so that the frontier tier could accommodate all "disputed" cases during a failure of the primary one.

Result: At the next local cluster failure, the system switched to frontier without pain. token_health_min did not drop below 0.5, and user service was not interrupted.

Lessons learned: The token budget and the local/frontier proportions must correspond to the actual peak incident flow.

The 9M/1M split is tightly coupled to phase SLAs; changing the proportions requires updating the specifications.

With weekly local-coder failures, the reserve must be at least 15-20%.

Related concepts: Tiered budgets (D.3)

CI SLA window

Sensitivity to local-coder failures

Study tips: Do not change thresholds on first read: The study minimum of each chapter is designed for the default thresholds. Start calibration only when the standard values stop fitting your flow.

Look for symptoms of Goodhart's Law: If one metric (for example, strict_reject_rate or MTTR) is steadily improving while the related one (depth_of_diagnostics or manual_review_rate) is falling — you are not optimizing the process, you are breaking the safety system.

Document every change: Any shift of a row in the tables must be accompanied by an entry in validation.md with a clear justification (the cost of a miss has changed, the flow has grown, etc.).

Visualize dependencies: Use mermaid diagrams (as in Section D.4) to understand exactly how changing a node (for example, audit_trace_coverage) will affect the entire metrics graph.

Additional resources: Chapters 5, 6, 9, 10, 11 (base course): The basic context for understanding the processes for which thresholds are configured in Appendix D.

The validation.md file: A template for recording justifications for threshold shifts. Required when porting the process to your own project.

The book2/examples/ repository: Contains the source scripts immunity_score.py, score.py, compile.py, and JSON/YAML configuration files for completing the exercises.

Summary: Appendix D is a deep dive into the fine-tuning of AgentClinic-production thresholds. The main takeaway: thresholds never exist in a vacuum. Any calibration is a balancing of risks. You cannot simply weaken the manual review threshold to speed up the pipeline without simultaneously strengthening protection against P0 misses. You cannot change the token budget without revising phase SLAs. Successful operation of the system requires continuous monitoring of indicators (for example, the share of disputed reviews or the cost of false escalation) and timely, paired revision of configurations.