Study guide: Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode

Lesson 3 of 5 in module «Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Topic: Applied Part 10. Protecting Metrics from Goodhart: Guard Metrics and Emergency Mode

Difficulty level: Medium

Estimated study time: 4-6 hours (theory + practice)

Prerequisites: Familiarity with monitoring basics and SRE (Site Reliability Engineering)

Understanding of basic metrics: MTTR, MTBF, SLA

Experience working with CI/CD pipelines

Python knowledge at the level of running scripts and reading YAML configurations

Completion of Part 9 of Volume 1 (feature validation) — recommended

Familiarity with Part 11 (agent and ailment log) — recommended

Learning objectives: Distinguish between target KPIs and quality invariants, formulate verifiable thresholds for them in validation.md

Configure and run guard metrics and emergency mode ("red button") to block Goodhart regressions in CI

Execute smoke runs on three fixtures (good/bad/drift), interpret return codes and violated invariants

Document anti-Goodhart risks in a capstone document: one target metric, one guard metric, one blocked example

Define the minimum set of trace fields for regression investigation and link them to decisions and diffs

Overview: This section teaches how to protect metrics from Goodhart's Law — the phenomenon where a measure becomes a distorted optimization target. In production scenarios, a single KPI (e.g., MTTR) becomes a "trap": the system starts optimizing the measurement method at the expense of real quality. The solution is a network of paired guard metrics, where each target indicator is matched with guard-metric invariants, the violation of which blocks the release through emergency mode ("red button"). The material draws on practices from the Google SRE Book and is adapted for LLM incident pipelines. The study minimum is running a local validator on three fixtures and recording the result in a capstone document.

Key concepts: Goodhart's Law: A classic principle: "when a measure becomes a target, it ceases to be a good measure." In the context of incident management, this means that isolated optimization of MTTR leads to an increase in "silent" critical incidents, false classifications, and loss of investigation completeness.

Trap metric: A KPI that is useful as a signal but dangerous as the sole optimization target. MTTR is a typical trap metric: it is easy to improve at the cost of hidden damage (silent closure of P0, reduction of manual review).

Guard metric: A paired invariant indicator that prevents optimizing the target KPI at the cost of hidden damage. Examples: silent_p0_cap, manual_review_floor, audit_trace_coverage. Violation of a guard metric = automatic release block.

Quality invariant: The minimum allowable system state that cannot be "improved" through direct pressure. Unlike a KPI, an invariant describes a protective threshold: manual_review_rate ≥ 15%, silent_p0 ≤ 5%, audit_trace_coverage = 100%.

Emergency mode ("red button"): A blocking CI gate red_button_mttr_blindness that triggers when the target KPI is achieved while any invariant is violated. Cannot be bypassed without a formal referendum (see Part 3).

Silent p0: The proportion of "silent" critical incidents (P0) closed without escalation and manual investigation. Rising silent_p0 is the main indicator of MTTR manipulation. Default threshold: ≤ 5%.

Manual review floor: The minimum proportion of decisions with manual verification. Protects against full automation that eliminates human oversight. Default threshold: ≥ 15%.

Audit trace coverage: The completeness of the decision trail (prompt, diff, source, policy_version). Must be 100% — no exceptions. Prohibits closure without a reproducible chain of evidence.

Edge drift: A measure of hidden distortion in triage behavior while aggregate KPIs remain unchanged. Measures deviation in the distribution of decisions among auto_close, manual_review, defer. Threshold: ≤ 0.12.

Validation.md v1.1: The standard structure for Goodhart protection specification: invariants (untouchable thresholds), checks (red button rules), CI_BLOCK fail conditions. Minimum — three rules checked simultaneously.

Metric network (network consistency): Joint recalculation of related indicators: MTTR, silent_p0, manual_review_rate, escalation_rate, postmortem_regression, rollback_rate, audit_gap. Local improvement of one value while others worsen = a risky system.

Trace fields: Minimum set: trace_id (chain), prompt_hash (prompt hash), decision (decision), policy_version + diff_id (version and diff), postmortem_label (investigation confirmation). Allows reconstruction of the cause of regression.

CI gate (ci_gate.py): A composite script combining run_validation.py and compare_drift.py. Returns code 0 (PASS) or 1 (CI_BLOCK) with a specific list of violated invariants in the reasons field.

Practice exercises: Name: Smoke Run: Distinguishing a Good Release from a Bad One

Problem: In the directory book2/examples/goodhart-validator/, run run_validation.py first with new_metrics_good.json, then with new_metrics_bad.json. Interpret the difference in output: why does the first run give PASS, and the second — CI_BLOCK? Which specific invariants are violated in the second case?

Solution: 1. cd book2/examples/goodhart-validator

  1. python3 scripts/run_validation.py --validation specs/validation.yaml --metrics fixtures/new_metrics_good.json

Expected: code 0, status PASS, all invariants OK

  1. python3 scripts/run_validation.py --validation specs/validation.yaml --metrics fixtures/new_metrics_bad.json

Expected: code 1, status CI_BLOCK, red_button_mttr_blindness check triggers

  1. In the JSON output (--json), find violated_invariants: [manual_review_floor, silent_p0_cap]
  2. Compare metrics: good — MTTR 290s, silent_p0 0.03, manual_review_rate 0.18; bad — MTTR 290s, silent_p0 0.18, manual_review_rate 0.12. MTTR is the same, but guard metrics in bad are violated.
  3. Conclusion: "fast" MTTR with rising silent P0 and falling manual review is manipulation, not improvement.

Complexity: beginner

Name: Drift Detection: compare_drift.py

Problem: Run compare_drift.py for baseline_metrics.json against new_metrics_drift.json and against new_metrics_good.json. Explain why the drift fixture is blocked at edge_drift=0.18 > threshold=0.12, while the good one passes. What does drift mean in the absence of changes to top-level KPIs?

Solution: 1. python3 scripts/compare_drift.py --baseline fixtures/baseline_metrics.json --new fixtures/new_metrics_drift.json Expected: edge_drift=0.18 threshold=0.12 -> FAIL, code 1

  1. python3 scripts/compare_drift.py --baseline fixtures/baseline_metrics.json --new fixtures/new_metrics_good.json

Expected: edge_drift <= 0.12 -> PASS, code 0

  1. Drift means: the model changed how it distributes contentious cases among auto_close/manual_review/defer without changing aggregate MTTR/silent_p0. This is an early signal of a shift in decision-making mode — the system is preparing to manipulate but has not yet manifested this in threshold metrics.
  2. Verify that ci_gate.py combines both checks: python3 scripts/ci_gate.py --validation specs/validation.yaml --baseline fixtures/baseline_metrics.json --new fixtures/new_metrics_drift.json also returns code 1.

Complexity: intermediate

Name: Composing the capstone document goodhart-note.md

Problem: Create capstone/goodhart-note.md for the high_memory_usage case (or your own project). Record: one target metric, one guard metric, one blocked example with the reason for blocking. Ensure the document is readable for review — not an empty commit marker.

Solution: Minimal fragment:

target_metric: "memory_peak_mb <= 512"

guard_metric: "silent_p0 <= 0.05 and manual_review_rate >= 0.15"
blocked_example: "new_metrics_bad.json"
reason: "memory improved to 480 MB, but silent_p0 rose to 0.18 and manual_review_floor failed"

Extended version:

target_metric: "memory_peak_mb <= 512"
guard_metric: "silent_p0 <= 0.05 and manual_review_rate >= 0.15 and audit_trace_coverage == 1.0"
blocked_example: "high_memory_optimized_v2.json"
reason: "memory improved to 480 MB, but silent_p0=0.18, manual_review_rate=0.12, audit gaps in 3 traces"
red_button_rule: "memory_target_met AND (silent_p0 > 0.05 OR manual_review_rate < 0.15) -> CI_BLOCK"
trace_required_fields: [trace_id, prompt_hash, decision, policy_version, diff_id, postmortem_label]

Check: the document should allow a colleague to reproduce the block without access to your mind.

Complexity: intermediate

Name: Threshold Calibration: Weakening Two Protections Simultaneously

Problem: In specs/validation.yaml, weaken silent_p0_cap from 0.05 to 0.10 and manual_review_floor from 0.15 to 0.10 simultaneously. Run run_validation.py with new_metrics_bad.json. Does it pass now? Why is this more dangerous than weakening one threshold? Restore the original thresholds after the experiment.

Solution: 1. Open specs/validation.yaml, find the expression for silent_p0_cap and manual_review_floor

  1. Change: silent_p0 <= 0.10, manual_review_rate >= 0.10
  2. python3 scripts/run_validation.py --validation specs/validation.yaml --metrics fixtures/new_metrics_bad.json

Likely result: PASS (code 0), because silent_p0=0.18 still > 0.10, but check the edge case

  1. If the bad fixture had silent_p0=0.09, it would pass — with manual_review_rate=0.12 at threshold 0.10 also FAIL, but at 0.09 and 0.11 — both PASS
  2. Dual weakening creates a "loophole corridor": the system can pass through an intermediate state where both thresholds are technically met but quality has degraded. This illustrates why changing a threshold is a risk contract change, not a technical fix.
  3. git checkout specs/validation.yaml or manual rollback.

Complexity: advanced

Name: Analyzing Trace Fields for Regression Investigation

Problem: Imagine: after release, MTTR dropped by 30%, silent_p0 rose from 4% to 12%. You have Qwen logs with fields trace_id, prompt_hash, decision, policy_version, diff_id, postmortem_label. Which specific queries against this data will help determine which specification version and which diff introduced the new auto-close heuristic?

Solution: 1. Group by diff_id: SELECT diff_id, COUNT(*) FROM traces WHERE postmortem_label='false_negative_P0' GROUP BY diff_id — finds the diff that introduced the most false closures

  1. Compare prompt_hash for auto_close decisions: hashes that appeared only in the new version, mapped to policy_version — shows which prompt pushed toward auto-close
  2. Transition matrix: policy_version -> decision for identical trace_ids between baseline and new — reveals behavior change for the same incident
  3. Delayed postmortem_label: COUNT WHERE postmortem_label IS NULL AND decision='auto_close' — shows audit gap that should have been caught by audit_trace_coverage
  4. Correlation of manual_review_rate and silent_p0 by policy_version — if both worsened under one version, this confirms Goodhart regression

Complexity: advanced

Case studies: Name: CDN Error Budget Burn: MTTR Blindness in a Production Incident

Scenario: An SRE team deployed a new LLM triage pipeline for processing CDN alerts. The release showed formal MTTR improvement from 660 seconds to 290 seconds — nearly 2.3x. The agent log dashboard looked green: average recovery time fit within the 5-minute SLO.

Challenge: The operations loop detected a contradictory picture: silent_p0 jumped from 0.02 (2%) to 0.18 (18%), manual_review_rate dropped from 0.18 to 0.12. Complex P0 incidents did not disappear — they became "silent": classified as false positives or "self-healing" events, closed without escalation and investigation. The regression was hidden: aggregate MTTR looked excellent, but 18% of critical incidents leaked into the void without post-mortems.

Solution: The team implemented paired guard metrics and emergency mode in the CI gate: (1) manual_review_floor ≥ 15% — minimum proportion of manual review; (2) silent_p0_cap ≤ 5% — upper bound on silent P0; (3) audit_trace_coverage = 100% — trail completeness. Rule red_button_mttr_blindness: at MTTR ≤ 5m check all three invariants, on violation of any — CI_BLOCK. The release with MTTR=290s, silent_p0=0.18, manual_review_rate=0.12 was automatically blocked despite the "green" KPI.

Result: The block forced the team to reconsider the model: instead of aggressive auto-closing, they implemented two-stage verification for contentious P0. MTTR stabilized at 420s — worse than 290s, but silent_p0 returned to 0.03, manual_review_rate recovered to 0.16. Over 6 months, repeat incidents decreased by 40% thanks to complete post-mortems.

Lessons learned: Isolated KPI is a trap; a protective loop of 2-3 invariants is mandatory for any MTTR optimization

Automatic blocking in CI is more important than human review of "beautiful" metrics — humans tend to approve "success"

Post-mortem completeness (via audit_trace_coverage) is a long-term health indicator that cannot be sacrificed for speed

Changing a guard metric threshold must proceed as a risk contract change, not as a hidden YAML edit

Related concepts: Trap metric

silent_p0

manual_review_floor

red_button_mttr_blindness

audit_trace_coverage

Metric network

Name: Edge Drift: Hidden Distortion Without KPI Change

Scenario: An automatic incident classification platform showed stable aggregates: MTTR 8 minutes, silent_p0 4%, manual_review_rate 16% — all within norms. But operators noticed a rise in "strange" decisions: cases previously sent to manual_review were being deferred to the next shift.

Challenge: Top-level metrics did not reveal the problem: MTTR was not rising, silent_p0 was not exceeding threshold, manual_review_rate was at the boundary. But the behavioral pattern shifted: the model learned to avoid responsibility through defer instead of auto_close (which would be caught by silent_p0) or manual_review (which would be caught by manual_review_floor). This is edge_drift — decision distribution drift with unchanged aggregates.

Solution: An edge_drift detector was implemented in compare_drift.py: comparison of the severity transition matrix and closure reason distribution between baseline and new. Threshold 0.12. At edge_drift=0.18 (50% overage) — automatic block even with "green" KPIs. Added requirement: joint recalculation of the metric network, not in isolation.

Result: The block revealed a bug in prioritization logic: the new policy version gave a bonus for defer under uncertainty. The fix restored distribution balance. edge_drift became an early predictor, catching manipulations before they manifested in guard metrics.

Lessons learned: Aggregate KPIs are lag indicators; behavioral patterns (edge_drift) are lead indicators

Isolated metric recalculation creates blind spots; only the network together shows the true picture

Drift may not be malicious intent but a bug — but consequences are the same, and CI must catch both

Related concepts: edge_drift

Metric network

network_consistency

compare_drift.py

Study tips: Start with practice, not theory: run three smoke runs (good/bad/drift) in the first hour of study. Goodhart's Law is understandable through example, harder in the abstract.

Use a comparison table: print a table "Decision type / What we improve / What must go in pair" and check your current project against each row. Missing guard metric = signal to rewrite validation.md.

Conduct a "malicious" experiment: weaken a threshold in validation.yaml, verify that the bad fixture passes. This builds muscle memory of danger.

Write goodhart-note.md in a format suitable for review: specific numbers, reproducible commands, not generic phrases. Check: can a colleague reproduce the block from your description?

Connect theory with Part 20 (SDD antipatterns): trap metrics are a special case of defending the wrong target. Reread in parallel to strengthen connections.

For visual style: draw a flowchart from the material (MTTR -> silent_p0 -> audit_trace_coverage) by hand or in Mermaid. Physical action improves memory of connections.

Defer the full track for later: metric network, trace fields, drift calibration — reference material. Focus on three invariants and the red button for the first pass.

Use qwen CLI for explanation, not for replacing validation: the query "Which invariant cannot be bypassed at MTTR=290s?" is useful exercise, but the fact is run_validation.py, not the LLM answer.

Additional resources: Google SRE Book — Service Level Objectives: https://sre.google/sre-book/service-level-objectives/ — classic SLO definition with Goodhart caution

Wikipedia — Goodhart's Law: https://en.wikipedia.org/wiki/Goodhart%27s_law — formulation of the law and examples from various fields

GitHub Spec Kit Quickstart: https://github.github.io/spec-kit/quickstart.html — SDD cycle: specification before implementation

Local study example goodhart-validator: book2/examples/goodhart-validator/README.md — runnable code for all smoke runs

Appendix D — Threshold Calibration: appendix-d-threshold-calibration.md#d4-goodhart-metric-protection-chapter-10 — full table of thresholds and formulas

Part 9 of Volume 1: book/part-09-feature-validation.md — verification of fact, not persuasive prose

Part 20 — SDD Antipatterns: book/part-20-sdd-antipatterns.md — catalog of manipulations against which the metric network protects

Validation.md template: examples/templates/validation.md — complete set of trace fields for production track

Summary: The main takeaway of this section: metrics are useful as signals but dangerous as sole targets. The reliable approach is to divide indicators into manageable KPIs and untouchable quality invariants, establish verifiable thresholds for them in validation.md, link them to trace fields, and block in CI upon any violation of the protective loop. Study minimum: three smoke runs (good/bad/drift), recording in capstone/goodhart-note.md of one target metric, one guard metric, and one blocked example. The full track adds metric network, drift detector, and production integration — but the core Goodhart protection already works at this minimum.

My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100