Reading: Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode

Lesson 1 of 5 in module «Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Applied Part 10. Protecting Metrics from Goodhart: Guard Metrics and Emergency Mode

Status: Recommendation. Protecting KPIs with a paired guard metric and a blocking emergency mode is a well-established practice described in the Google SRE Book. The specific thresholds (silent_p0, manual_review_floor, audit_trace_coverage) and the validation.md v1.1 format are a recommended framework that most teams adapt.

For the educational run, it is enough to launch examples/goodhart-validator/ and see how a good MTTR is blocked by a rise in silent_p0. The metric network, trace fields, and threshold calibration belong to the full production track. If you encounter the name "red button" below, read it as a short alias for the formal emergency mode.

In part 9 of the first volume, one metric per check was sufficient: "feedback is visible after publication," "the sum does not go negative." In a production scenario like cdn_error_budget_burn, the same logic is no longer enough. The agent log and fault dashboard from part 11 shows a contradictory picture after release, and a single metric becomes bait. Here we will expand it into a network of paired guard metrics — these are pairs of "KPI + safety net indicator," where the second prevents the first from being optimized at the cost of hidden damage. The catalog of typical manipulations this network protects against is systematized in part 20. SDD Antipatterns.

Before Reading

Foundation from the first volume: part 9 teaches checking the fact, not persuasive prose; part 20 shows how the process starts defending the wrong goal.
Local educational case: cdn_error_budget_burn, because an improved MTTR can be blocked by a rise in silent_p0.
Trace for capstone/: one target metric, one guard metric, and one blocked example for high_memory_usage.

Main terms of the first pass: guard metric and emergency mode ("red button"). The rest — silent_p0, manual_review_floor, audit_trace_coverage, edge_drift, trace fields, metric network — are reference material, open them only when needed for a single line of capstone/goodhart-note.md.
What to defer: the metric network, trace fields, drift calibration, and a full emergency mode.

Goal

By the end of the section, you will assemble a validation.md that catches Goodhart traps in advance and prevents the LLM incident pipeline from improving reported KPIs at the cost of triage degradation.

The main gain is this: you will separate metrics into managed goals and inviolable quality invariants. Then you will anchor verifiable thresholds, evidence in traces, and CI blocks for them.

"Bait metrics" here means KPIs that are useful as a signal but become dangerous if optimized separately from quality invariants. KPI (key performance indicator) is a key indicator that the team aims to improve with a release.

This approach continues the SDD cycle: specification, verification criteria, and iterations are fixed before changes are introduced, rather than adjusted after obtaining a beautiful result (GitHub Spec Kit Quickstart).

The very effect of "when a measure becomes a target, it ceases to be a good measure" is classically known as Goodhart's law (Wikipedia: Goodhart's law). The SLO definition from Google SRE directly relies on this caution (SRE Book: Service Level Objectives).

Minimal Educational Scenario

Educational Case

The production incident cdn_error_budget_burn, projected onto the educational agent log from book/part-11-second-feature-phase.md. The release improved MTTR from 660s to 290s, formally looking like a success. But silent_p0 jumped from 0.02 to 0.18, and manual_review_rate dropped from 0.18 to 0.12. The goal is to see that the CI gateway catches this shift and blocks the merge, despite the "green" MTTR.

Preparation

book2/examples/goodhart-validator/specs/validation.yaml — invariants and red button check.

book2/examples/goodhart-validator/fixtures/baseline_metrics.json — baseline (MTTR 660s, silent_p0 0.02).
book2/examples/goodhart-validator/fixtures/new_metrics_good.json — improvement without blind spots.
book2/examples/goodhart-validator/fixtures/new_metrics_bad.json — "MTTR blindness" (290s, silent_p0 0.18).
book2/examples/goodhart-validator/fixtures/new_metrics_drift.json — drift along edge correlations.
book2/examples/goodhart-validator/scripts/run_validation.py, compare_drift.py, ci_gate.py.

Steps

cd book2/examples/goodhart-validator. Expectation: you are in the example directory, no additional dependencies.
"Good" run: python3 scripts/run_validation.py --validation specs/validation.yaml --metrics fixtures/new_metrics_good.json. *Expectation: return code 0, status PASS, all three invariants OK.*
"MTTR blindness" run: python3 scripts/run_validation.py --validation specs/validation.yaml --metrics fixtures/new_metrics_bad.json. *Expectation: return code 1, red_button_mttr_blindness triggers, manual_review_floor and silent_p0_cap are marked FAIL.*

Bad: looking only at MTTR — the release is faster, seems "better."

Good: running validation with invariants — "faster" with silent_p0=0.18 is blocked automatically.

Drift run against the drift fixture: python3 scripts/compare_drift.py --baseline fixtures/baseline_metrics.json --new fixtures/new_metrics_drift.json. *Expectation: edge_drift > 0.12, return code 1.*
Control: the same compare_drift.py against good metrics. *Expectation: edge_drift <= 0.12, return code 0.*
Full CI gateway: python3 scripts/ci_gate.py --validation specs/validation.yaml --baseline fixtures/baseline_metrics.json --new fixtures/new_metrics_bad.json. *Expectation: return code 1, in reasons the specific violated invariants are listed, not a generic FAIL.*
Fix the run as a short anti-Goodhart conclusion: the target metric improved, but silent_p0_cap and manual_review_floor blocked the release. Expectation: on the next pull request with MTTR acceleration, the validator compares not "green against old baseline" but against good/bad/drift fixtures.

If you have Qwen Code installed and need an explanation for review, perform a separate optional step:

qwen -p "Read @fixtures/new_metrics_bad.json and @specs/validation.yaml. Which invariant cannot be bypassed even with MTTR=290s? Don't change the files." --approval-mode plan

Such output is useful as an explanation but does not replace run_validation.py, compare_drift.py, and ci_gate.py.

Control Fact

Step 2 yields return code 0, steps 3 and 4 yield return code 1 with specific indication of violated invariants. Step 6 shows the same behavior in the composite gateway. If the CI gateway passes new_metrics_bad.json, the validator configuration is weakened — the silent_p0_cap or manual_review_floor threshold is shifted.

How This Gets Into `capstone/`

Move one target metric, one guard metric, and one blocked example to capstone/goodhart-note.md. If the main assessed case is high_memory_usage, record this run as an anti-Goodhart risk for the same circuit: memory or MTTR cannot be improved at the cost of silent_p0, manual audit, or 5xx. Do not move the entire metric network if it was not recalculated; for the educational minimum, it is enough to show that an improved KPI does not pass without a protective invariant.

Minimum fragment:

target_metric: "MTTR <= 5m"

guard_metric: "silent_p0 <= 0.05 and manual_review_rate >= 0.15"
blocked_example: "new_metrics_bad.json"
reason: "MTTR improved, but silent_p0 and manual_review_floor fail"

Reviewable Trace

The scripts run_validation.py, compare_drift.py, and ci_gate.py write results to stdout; no separate out/ directory is created. For the educational path, move the result to capstone/goodhart-note.md: target metric, guard metric, blocked example, and reason.

If you save outputs/goodhart.last-run.txt in your project, it should be a readable appendix to the review, not an empty marker. In SDD, a fact is a reproducible command or a readable artifact, not the mere presence of a commit.

Key Ideas

First, define which indicators remain quality invariants and which become optimization goals and are therefore subject to manipulation. An invariant cannot be "improved" by direct pressure: it describes the minimum acceptable state of the system. Examples of invariants:

audit completeness;
share of manual review;
upper bound of silent_p0 (this is the share of "silent" critical incidents closed without escalation).

An optimization goal, on the contrary, can be lowered or raised, but only within a protective corridor. MTTR is useful as a recovery speed indicator but dangerous as the sole reward for a model or team.

In validation.md, make this distinction explicit. MTTR<=5m can be a goal. And manual_review_rate>=15%, silent_p0<=5%, and audit_trace_coverage==100% should remain admission conditions.

Bad:

> Achieve MTTR below 5 minutes.

Problem: a bare goal without guard metrics, a direct path to silent_p0.

Good:

> MTTR <= 5m AND silent_p0 <= 5% AND manual_review_rate >= 15% AND audit_trace_coverage == 100% — violation of any condition = CI_BLOCK.

The Goodhart trap manifests itself when a metric becomes a substitute for reality. The system begins to optimize the way of measuring, not the quality of triage. If MTTR is checked in isolation, the model learns to close incidents faster, reduce the share of escalations, and avoid long investigations — these are exactly what spoil the average recovery time.

On the graph, this looks like a win: MTTR drops to 5 minutes or below. But in the operational loop, it can mean the opposite. Complex P0s have not disappeared but have become invisible because they were misclassified as false positives, low urgency, or "self-recovered" events.

The "MTTR 5 minutes" trap is especially dangerous for rare severe incidents, where the speed of closure competes with the completeness of investigation. In numbers, it looks like this:

baseline on 300 incidents in replay: MTTR 11:00, escalation share 14%, silent_p0 2%;
new optimized version: MTTR 4:50, escalations 6%, silent_p0 18%.

Formally, the KPI has improved. But the system has started missing critical events more often without manual review and escalation. Block such a release: it shifts the risk from the visible report to future recurring incidents, post-mortem regressions, and lost responsibility chains.

Antibodies in validation.md are formal conditions that prevent optimization from redefining the meaning of quality. The minimum set is three rules, and they must be checked simultaneously:

Rule	What It Protects	Boundary
`manual_review_floor`	share of decisions with manual verification	not below 15%

Individually, these rules leave loopholes. High traceability does not compensate for the growth of silent_p0. Manual review is useless if it is impossible to reconstruct the prompt, difference, and source of the decision. Configure the "red button" so that it triggers not on one bad number, but on a violation of the protective contour.

What to Choose as a Goal and What as Protection

Not all KPIs require the same protection. Manual triage operations and automatic remediation have different risk levels, so the minimum set of invariants is also different. The main rule is: the more dangerous the action, the more guard metrics go paired with the target KPI.

Decision Type	What We Improve	What Must Go in Pair
Manual triage operation	`MTTR`	decision trace fully saved
Auto-classification without action	classification speed and accuracy	no silent P0s; decision trace saved
Auto-escalation	escalation delay	no silent P0s; no false escalations

Full English names of metrics (silent_p0, manual_review_floor, audit_trace_coverage, false_escalation_rate, edge_drift, postmortem_gap, backup_verified) with thresholds and formulas are moved to Appendix D. The important thing here is the rule, not the table of names: each line of "what we improve" must have one or two guards from the same risk area.

For dangerous actions (the last three rows), additionally include the "red button" — a blocking gateway that cannot be bypassed without a referendum from chapter 3. For manual and observational operations (the first three rows), a soft warning is enough.

The goal of the table is not to turn it into a dogma, but to help you see what you missed. If in the line "auto-remediation with state" you have no backup check — this is a signal to rewrite validation.md, not to "optimize MTTR."

> [conceptual interface] — the structure of validation.md that you adapt to your tracing files.

#### Minimum structure of validation.md v1.1
version: 1.1
invariants:
  - name: manual_review_floor
    expression: "manual_review_rate >= 0.15"

  - name: silent_p0_cap
    expression: "silent_p0 <= 0.05"

  - name: audit_trace_required
    expression: "audit_trace_coverage == 1.0"

checks:
  - name: red_button_mttr_blindness
    when: "MTTR <= 5m"
    assert: "manual_review_rate >= 0.15 and silent_p0 <= 0.05 and audit_trace_coverage == 1.0"
    fail: "CI_BLOCK"

The full form with artifact_inputs, network_consistency, and the exact expression for audit_trace_required through COUNT(events_with(...)) is in [examples/goodhart-validator/specs/validation.yaml](examples/goodhart-validator/specs/validation.yaml).

The next layer of protection is a detector of hidden distortions right in the specification. Consider a regression a change in triage behavior with unchanged KPIs. Here, regression is a shift in the distribution of decisions that is not visible in aggregates.

Reason: harm is not always visible in top-level numbers. MTTR may remain the same, the share of escalations may look normal, but the model will start distributing controversial cases differently between auto_close, manual_review, and defer.

Therefore, in validation.md compare not only aggregates but also behavioral patterns:

severity transition matrix;
distribution of close reasons;
share of reopened incidents;
delay to post-mortem label;
change in the relationship between manual_review_rate and silent_p0.

If drift_budget (the allowable deviation corridor from the baseline) is exceeded, block the build even with "green" KPIs. This means the system has already changed its decision mode.

To see the main trap, three metrics and one guard are enough:

flowchart LR
    MTTR[MTTR]
    silent_p0[silent_p0]
    manual_review_rate[manual_review_rate]
    audit_trace_coverage[audit_trace_coverage]
    silent_p0 -->|pulls MTTR down dishonestly| MTTR
    manual_review_rate -->|pulls MTTR up honestly| MTTR
    audit_trace_coverage -->|limits silent_p0| silent_p0

Read this way: you can artificially improve MTTR if you allow "silent" P0s to close without escalation. The guard audit_trace_coverage prohibits closing without a trace, and manual_review_rate keeps the share of manual checks. The full picture with additional metrics (escalation_rate, postmortem_regression) is in Appendix D; there you will also find formal thresholds and connections.

Bind the checks to Qwen logs, decisions, and diff chains — otherwise they cannot be moved to production without losing context. Minimum trace composition for one event: trace_id (chain), prompt_hash (prompt hash), decision (what was chosen), policy_version + diff_id (which version and which change introduced it), and postmortem_label (what the post-mortem confirmed). The full set of fields with agent, raw_alert_excerpt, reasoning_delta, and review_outcome belongs to the full track and is collected in [examples/templates/validation.md](examples/templates/validation.md).

These five fields make it possible to answer engineering questions after blocking: which version of the specification changed the behavior, which prompt pushed the model toward auto-close, which difference introduced a new heuristic. Without this bundle, validation.md remains a declaration; with it, it becomes a reproducible audit artifact.

Design metrics as a network of dependencies, not as a set of independent counters. This is network_consistency: a change in one metric should not contradict related ones. Recalculate MTTR, silent_p0, manual_review_rate, escalation_rate, postmortem_regression, rollback_rate, and audit_gap together (see the diagram above). Local improvement of one quantity often creates debt in another. A practical criterion is edge consistency: if MTTR falls, but at the same time manual review decreases and the share of late-confirmed P0s grows, mark the system as risky. This turns CI from a "KPI pass/fail" check into a check of triage behavior stability.

> [conceptual interface] — scripts/metrics/network_recompute.py shows the form of local recalculation of the metric network; there is no ready-made CLI in the textbook repository. A runnable analog of the CI gateway itself with anti-Goodhart checks — python3 examples/goodhart-validator/scripts/run_validation.py and ci_gate.py (see "Control Fact" below).

#### Updating the metric network and validation CI-gate after spec change
python3 scripts/metrics/network_recompute.py \
  --spec specs/incident-spec.md \
  --replay data/replay_*.jsonl \
  --out .artifacts/metric_network.json

python3 scripts/metrics/ci_gate.py \
  --artifact validation.md \
  --metric-network .artifacts/metric_network.json \
  --traces .artifacts/qwen_trace.ndjson

CONTROL: CI_GATE = PASS if (edge_drift <= 0.12 && silent_p0 <= 0.05 && manual_review_rate >= 0.15 && audit_trace_coverage == 1.0) else CI_BLOCK

Full Track: Threshold Calibration

The "Low / Default / High" table for silent_p0, manual_review_rate, edge_drift, audit_trace_coverage, the exercise on "dangerous" weakening of two defenses at once, and the full metric dependency network are in Appendix D, section D.4. On the first pass, it is enough to see that a bad release is blocked by a guard metric.

Examples and Application

Example: a team wants to prove the readiness of a new triage pipeline for more aggressive auto-closing. First, a replay run is launched with the target optimization MTTR<=5m. Then the same set of incidents is checked through red_button_mttr_blindness.

If the result looks like MTTR=4:50, silent_p0=18%, manual_review_rate=12%, block the release. The reason is not bad speed, but a violation of protective invariants. This is an important distinction: the goal is achieved, but the quality contract is broken.

> [conceptual interface] — scripts/metrics/simulate.py and validate_red_button.py show the form of the emergency mode check; there is no ready-made CLI in the textbook repository. A runnable analog of the same invariants check on educational fixtures — python3 examples/goodhart-validator/scripts/run_validation.py (see examples/goodhart-validator/README.md).

#### Example of running the red button on a replay
python3 scripts/metrics/simulate.py \
  --scenario data/replay_300.jsonl \
  --goal "MTTR<=5m" \
  --spec specs/incident-spec.md

python3 scripts/metrics/validate_red_button.py \
  --artifact validation.md \
  --mode red_button \
  --assert "silent_p0<=5% && manual_review_rate>=15% && audit_trace_coverage==1.0"

CONTROL: red_button = BLOCKED (MTTR=4:50, silent_p0=18%, manual_review_rate=12%)

The second example is erroneous auto-closing of 40 P0s as "false positives." Before the post-mortem, the metric looks clean: incidents are closed quickly, escalations are fewer, the queue is not growing.

After checking against labels, it turns out differently. Five events were real critical failures. They are exactly what should increase silent_p0, escalation_regret, and postmortem_regression.

In validation.md, record such a case as a predictive triage failure. Don't wait for user damage in production. Use replay and post-mortem truth as an early blocking signal.

In practice, keep validation.md next to the specification and update it only through the same review process as triage rules. CI on every change rebuilds the metric network, runs the replay, checks trace completeness, and compares behavior to the baseline. A threshold change — for example, raising the allowable silent_p0 from 5% to 7% — should be carried out as a risk contract change through a mutable rule from part 3, not as a technical YAML edit. This barrier protects the system from the gradual erosion of invariants under the pressure of convenient reports.

Summary

Bait metrics are dangerous not because they are false. They are useful until they become the sole optimization goal.

A reliable validation.md solves five tasks:

separates goals from invariants;

blocks MTTR improvements with a rise in silent_p0;
requires minimum manual verification;
checks triage behavior drift;
preserves the evidence chain of Qwen logs, decisions, and diffs.

> [runnable] — the minimum smoke run for this chapter is in [examples/goodhart-validator/](examples/goodhart-validator/README.md).

cd book2/examples/goodhart-validator
python3 scripts/run_validation.py \
  --validation specs/validation.yaml \
  --metrics fixtures/new_metrics_good.json

python3 scripts/ci_gate.py \
  --validation specs/validation.yaml \
  --baseline fixtures/baseline_metrics.json \
  --new fixtures/new_metrics_good.json

The example with the expected failure for the red button scenario uses fixtures/new_metrics_bad.json: run_validation.py and ci_gate.py should exit with code 1 because manual_review_floor and silent_p0_cap are violated.

In the next chapter, this protective contour will be connected to real monitoring and deployment APIs.

Artifacts and Readiness Criteria

Artifact	Ready When
Local run of `book2/examples/goodhart-validator`	optimization goal is separated from the inviolable invariant

| Three fixtures: good / bad / drift | new_metrics_good.json passes, new_metrics_bad.json is blocked with a specific reason, new_metrics_drift.json is blocked by compare_drift.py | | One example of blocking due to a guard metric | MTTR improved, but the release is blocked due to silent_p0 or manual_review_floor | | Record in capstone/goodhart-note.md | three lines: target metric, paired guard metric, blocking condition |

The full track adds validation.md with a target metric and quality invariants, .artifacts/metric_network.json or a runnable analog of the metric network, a replay set, and a CI gateway report with edge_drift, silent_p0, and audit_trace_coverage. Consider it ready if the emergency mode blocks MTTR<5m with a rise in silent_p0, the trace fields link the prompt, diff, decision, and post-mortem label, and a threshold change is formalized as a risk contract change, not YAML cosmetics.

Practice

cd book2/examples/goodhart-validator && python3 scripts/run_validation.py --validation specs/validation.yaml --metrics fixtures/new_metrics_good.json --json — *expectation: code 0, in JSON "status": "PASS", all three invariants manual_review_floor, silent_p0_cap, audit_trace_required have "ok": true.*
python3 scripts/run_validation.py --validation specs/validation.yaml --metrics fixtures/new_metrics_bad.json --json — *expectation: code 1, in JSON "status": "CI_BLOCK", check red_button_mttr_blindness triggers with "fail": "CI_BLOCK" and violated_invariants: [manual_review_floor, silent_p0_cap]. Likewise python3 scripts/compare_drift.py --baseline fixtures/baseline_metrics.json --new fixtures/new_metrics_drift.json --threshold 0.12 returns code 1 and prints edge_drift=0.18 threshold=0.12 -> FAIL.*
Move three lines to capstone/goodhart-note.md: target metric (MTTR), guard metric (silent_p0 or audit_trace_coverage), red button rule. *Expectation: the next merge attempt with improved MTTR but falling audit_trace_coverage is blocked automatically.*

Control Questions

Why is MTTR dangerous as the sole optimization goal?

How does a quality invariant differ from a KPI?
What trace fields are needed to investigate a Goodhart regression?
After a release, MTTR fell by 30%, but silent_p0 rose from 4% to 12%. What will you do — accept the release, roll it back, or require additional diagnostics? What exactly will you check?