Reading: Applied Part 5. Mutation Testing of Specifications

Lesson 1 of 5 in module «Applied Part 5. Mutation Testing of Specifications»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Applied Part 5. Mutation Testing of Specifications

Status: Frontier. Mutation testing for specifications and the immunity score metric vector are a practice that is not yet standardized. The idea of "one mutant — one expected failure" belongs to the recommendations. The operator sets and thresholds themselves need to be configured per project.

For the educational walkthrough, it is enough to run examples/stress-mutator/ and see that one mutant yields one expected failure. Selecting operators, thresholds, and a CI gate is the full production track.

Let's introduce the basic concepts. Mutation testing is a technique in which a reference artifact is deliberately "corrupted" in a controlled way, and the test loop is required to catch this defect. Immunity metric is a vector metric for validator robustness, consisting of three components:

strict_reject_rate — the fraction of cases rejected strictly at the expected step;
depth_of_diagnostics — the useful depth of diagnostics before failure;
recovery_time — the time until a stable verdict is returned.

The figurative name "vaccination of validators" refers to ordinary mutation testing of specifications. The validator receives deliberately corrupted inputs and is required to reject them at the expected step.

The boundary with neighboring mechanisms is as follows. In Chapter 2 you create a single manual defect to learn how to read a symptom. In this chapter you create a series of machine mutants to measure the validator's robustness. In Chapter 4 the Verifier searches for a minimal counterexample to a rule, not enumerates a catalog of mutation operators. In Chapter 8 the result of such checks can become evidence for a verdict, but the file arbitration itself does not replace a mutant generator.

The chapter relies on the discipline of facts from part 9 of the first volume. Without it, mutations make no sense. A mutant tests precisely the fact of failure at the expected Given/When/Then step. The simplest example of this discipline was already encountered in the training AgentClinic: an empty review text from part 12 must be rejected. Here the same logic is generalized to a set of mutation operators tied to the catalog of classical errors from part 20. SDD Antipatterns.

Before Reading

Foundation from the first volume: part 9 introduces verification facts, part 20 — classes of process errors.
Local training case: appointment_latency_spike (minimal incident-payload, used to build base/base_spec.json in the runnable example).
Trace for capstone/: seed, operator list, three immunity metrics, and the verdict as a string in validation.md for high_memory_usage.
Key terms of the first pass: mutation testing (entry to the chapter) and the immunity metric (exit — three vector components). The rest — mutation operators, mutation factory, "vaccination of validators" — are reference terms, opened only when configuring the CI gate.
What to defer: selection of operators, calibration of thresholds, and the mutation CI gate.

Goal

After this chapter, the reader will assemble a generator of degenerate specifications for the auto-incident-management project and configure a validator loop that does three things: rejects absurd cases with precise diagnostics, preserves a chain of evidence in the SDD, and computes the immunity metric before merge. The validator stops being a syntax guard and becomes a tool of anatomical diagnostics: it shows the fact of failure, the field, the Given/When/Then step, the JSON Schema rule, the failure route, and the regression risk. This aligns with the spec-first approach — the contract precedes planning and code implementation (GitHub Spec Kit).

Minimal Training Scenario

Training Case

The production incident appointment_latency_spike (derived from the training feature /agents from book/part-11-second-feature-phase.md): SLA 10 minutes, escalation from appointments_oncall to sre_lead. The Nullify mutation zeroes out severity. The expectation — the validator halts before When:evaluate_sla_window with the code EMPTY_REQUIRED_FIELD, before the SLA is calculated and before the owner is selected.

Preparation

book2/examples/stress-mutator/base/base_spec.json — the correct source.
book2/examples/stress-mutator/expected/expected_failures.json — the expected (diagnostic_code, halt_before) under the by_operator key and the immunity thresholds in thresholds.
book2/examples/stress-mutator/scripts/mutate_specs.py, fake_validator.py, immunity_score.py.
book2/examples/stress-mutator/manifest.example.json — the determinism reference.

Steps

cd book2/examples/stress-mutator. Expectation: you are in the example directory, no additional dependencies.
python3 scripts/mutate_specs.py --base base/base_spec.json --seed 20260517 --operators Nullify,FutureTime,EscalationCycle,PriorityContradiction --out out/mutations. *Expectation: out/mutations/manifest.json is created and one JSON file per mutant.*
Determinism check — repeat step 2. *Expectation: the list of mutation_id and the order match the previous run.*

Bad: a single run without re-running — impossible to tell a deterministic generator from random noise. Good: two consecutive runs, identical mutation_id order, the regression base is reproducible.

Compare out/mutations/manifest.json with manifest.example.json via diff. Expectation: 0 lines of difference.
python3 scripts/fake_validator.py --mutations out/mutations --out out/validator_results.json. *Expectation: for each mutation_id the result contains a pair diagnostic_code + halt_before.*
python3 scripts/immunity_score.py --validator-results out/validator_results.json --expected expected/expected_failures.json. *Expectation: strict_reject_rate >= 0.98, depth_of_diagnostics >= 3, recovery_time_p95_ms <= 1200.*
For the educational minimum, stop here: the runnable example has proved mutant determinism, expected failures, and immunity computation.

If you have Qwen Code installed and want additional explanation, perform a separate optional step:

qwen -p "Read @out/validator_results.json and @expected/expected_failures.json. Which mutants are rejected not at the expected step? Do not modify the files." --approval-mode plan

This query does not replace the runnable check. Its result can be used as a review comment, but not as the sole fact of readiness.

The full production track adds a separate CI gate. In your project this is usually python3 scripts/ci_gate.py --strict-reject-min 0.98 --diag-depth-min 3 --recover-ms-p95 1200 --fail-on-regression — three thresholds, any violation blocks the merge. There is no runnable analog specifically for stress-mutator in the tutorial; the similar in spirit examples/goodhart-validator/scripts/ci_gate.py is shown in part 10.

Control Fact

All three metrics from step 6 simultaneously satisfy the thresholds. manifest.json matches manifest.example.json byte-for-byte. If you performed the optional Qwen query, its output must not contradict the runnable facts. Without determinism, expected failures, and a green immunity metric, the educational pipeline is not considered green.

How This Ends Up in `capstone/`

Transfer to capstone/validation.md or a short capstone/README.md only the smoke-run summary: seed, operators, three immunity metrics, and the verdict. Do not transfer the out/mutations directory: it should remain a reproducible local trace, not a reviewable artifact.

Minimal fragment:

stress_run:
  seed: 20260517
  operators: [Nullify, FutureTime, EscalationCycle, PriorityContradiction]

strict_reject_rate: "1.0 >= 0.98"
  depth_of_diagnostics: "4.0 >= 3"
  recovery_time_p95_ms: "850 <= 1200"
  verdict: PASS

Reviewable Trace

The out/ directory is the result of a local run and is ignored in book2/examples/.gitignore. Do not commit it as an educational artifact and don't make a commit for the sake of a checkmark. For the first pass, a line in capstone/validation.md is enough: seed, operators, three metrics, and verdict.

In your production repository you can keep a short report outputs/immunity.last-run.json if it is created by CI and participates in review. In the educational route, the source of truth remains the reproducible command and the minimal capstone fragment above.

Key Ideas

Divide degenerate incident-process scenarios into four classes. Empty fields are not just null: they also include empty strings, empty owner arrays, missing severity, service_id, or runbook_ref — any emptiness that makes it impossible to choose a safe action. Temporal anomalies look formally correct: the ISO timestamp exists, but response_timestamp turns out to be earlier than event_received_at or later than the agreed-upon now. Reversible escalation cycles and recursive dependencies are more dangerous than ordinary omissions — they can send the executive loop into infinite redefinition of owner, priority, or next action.

Let's introduce another concept. Mutation factory is not a random noise generator, but a deterministic mutator on top of the correct base_spec.json. The base specification is parsed into an abstract syntax tree (AST) with explicit Given/When/Then nodes, an SLA matrix, escalation rules, and JSON Schema fragments. Then the following operators are applied to it:

Nullify — zeroing out a field;
FutureTime — shifting a timestamp into the future;
EscalationCycle — adding a reverse edge to the escalation graph;
PriorityContradiction — introducing mutually contradictory priority rules.

Future extensions will add RecursiveDependency for indirect recursion between computed fields.

The principle "one mutant — one expected failure" is the main rule of the factory. Let's show the contrast.

Bad:

> one mutant simultaneously zeros out service_id, reverses the escalation graph, and inverts priorities; expected_failure is not specified.

Problem: on failure you cannot localize the cause. The validator may stop on any of the three defects; the regression is bound to a composite artifact.

Good:

> a single Nullify mutator zeroes out only severity; expected_failure.code = EMPTY_REQUIRED_FIELD, halt_before = When:evaluate_sla_window.

Each run receives a fixed seed. The same input produces the same list of mutation_id in a stable order. This is critical for the Verifier/Implementer duel: a disputed case can be reproduced, given to both roles, and checked for who actually violated the contract.

> [runnable] — a minimal implementation of this interface is in examples/stress-mutator/README.md.

cd book2/examples/stress-mutator

python3 scripts/mutate_specs.py \
  --base base/base_spec.json \
  --seed 20260517 \
  --operators Nullify,FutureTime,EscalationCycle,PriorityContradiction \
  --out out/mutations

python3 scripts/fake_validator.py \
  --mutations out/mutations \
  --out out/validator_results.json
#### CONTROL: re-running with the same seed must produce the same mutation_id list and the same order

Combinatorial explosion appears already at depth 2–3. Give the generator a selection policy, not a full enumeration: at least one mutant per class (required field, time window, escalation graph, recursive dependency, priority conflict). Tie the priority of operators to incident history: if post-mortems more often show erroneous time windows, give FutureTime and NegativeLag greater weight in the queue. Directed fuzzing tests historically fragile contract points, instead of spending token budget on uniform chaos.

flowchart TD
A[File base_spec.json] --> B[AST normalizer]
B --> C[Mutation factory]
C --> C1[Nullify]
C --> C2[FutureTime]
C --> C3[EscalationCycle]
C --> C4[PriorityContradiction]

C1 --> D[Verifier/Implementer duel with Given/When/Then step binding]
C2 --> D
C3 --> D
C4 --> D
D --> E[Diagnostics and stack route]
E --> F[mutation_id and validation.md]
F --> G[CI gate]

Bind each mutant to a specific Given/When/Then step and a specific JSON Schema rule. Otherwise the diagnostics will remain too general to fix. The bindings must be explicit: the Nullify(service_id) mutation belongs to Given:incident_received and the rule required.service_id, while the FutureTime(response_timestamp) mutation belongs to When:evaluate_sla_window and the constraint format + maximum(now).

If a mutant breaks Then:notify_primary_owner, the report should show the essence of the problem. The issue is not the notification as an action. The issue is the impossibility of computing a valid owner after the route has been corrupted. This tracing shortens manual debugging: the engineer sees the stuck point, not just the final VALIDATION_FAILED.

{
  "mutation_id": "m_20260517_0009",
  "operator": "EscalationCycle",
  "target_step": "When:route_escalation",
  "json_schema_rule": "$defs.escalation_graph.no_cycles",
  "failed_step": "Verifier::GraphCheck::Escalation",
  "stack_route": [
    "schema.normalize",

"step.when.prepare",
    "graph.build",
    "graph.detect_cycle",
    "halt"
  ]
}

Cycle diagnostics requires a separate graph pass. The reason is that JSON Schema validates the form of data well, but does not always express the topological behavior of the route. For EscalationCycle the validator builds a directed graph of owners or queues and runs depth-first search (DFS) with white/gray/black states. Detecting a gray node returns the minimal cycle, e.g. primary_oncall → sre_lead → primary_oncall.

A similar check is used for reversible priority transitions. If P1 is downgraded to P2 by one rule, and then another rule returns P2 to P1 without a tie-breaker rule, the validator must stop before the executive phase. The diagnostic code must distinguish CYCLE_ESCALATION from PRIORITY_REVERSAL. The first is fixed by the route graph. The second — by a conflict-resolution policy.

Check temporal anomalies before routing. Incorrect time distorts the SLA, severity, and choice of response channel. Give the validator at least three anchors — event_detected_at, event_received_at, the agreed-upon now from a controlled time source — and a max_reaction_lag policy. Accordingly, a failure receives one of three codes: INVALID_TIME_ANCHOR (if response_timestamp is in the future — a problem in the input payload), NEGATIVE_RESPONSE_LAG (negative response lag — a problem in time normalization), or STALE_INCIDENT_WINDOW (the event is older than the allowed window — a problem in the SLA rule). Different codes matter for the SDD log: they show exactly where the contract is weakened.

Recursive dependencies differ from cycles in that they may not look like a short loop in the graph. A typical chain: owner is computed from priority, priority depends on blast_radius, blast_radius requests owner_group, and owner_group in turn requires the already-computed owner.

For such cases, set an expansion limit, e.g. max_resolution_depth = 8. Save a trace of dependency-resolution attempts. If the limit is exceeded, the validator returns RECURSION_LIMIT together with the chain of fields, rather than masking the problem as a timeout. This protects the LLM executor from endless clarification of conditions and makes cascades of failure observable.

Now about the immunity metric (the vector components are at the beginning of the chapter). Introduce it as a vector, not as a single final score. If strict_reject_rate grows but depth_of_diagnostics drops to one, the loop has become stricter but blinder. If recovery_time_p95_ms goes over the limit, even a correct validator starts slowing down CI and provokes workarounds.

Build the CI blocking on immunity thresholds and regression comparison with the previous pass. For the educational loop, start with the following values:

strict_reject_rate >= 0.98,
depth_of_diagnostics >= 3,
recovery_time_p95_ms <= 1200.

Then calibrate the values against the actual load and number of mutants.

Merge is blocked if the new change does at least one of the following:

misses an old mutation_id,
degrades diagnostic depth,
exceeds the recovery-time limit.

Such a gate protects not only the JSON Schema, but the entire validator loop: the normalizer, graph checks, Given/When/Then rules, and the report format.

> [runnable] — the command below corresponds to book2/examples/stress-mutator.

cd book2/examples/stress-mutator

python3 scripts/immunity_score.py \
  --validator-results out/validator_results.json \
  --expected expected/expected_failures.json

In your project this gate usually looks like python3 scripts/ci_gate.py --strict-reject-min 0.98 --diag-depth-min 3 --recover-ms-p95 1200 --fail-on-regression. There is no ready-made script specifically for stress-mutator in the tutorial; the idea "one threshold not passed = block" is preserved in the similar in form examples/goodhart-validator/scripts/ci_gate.py (part 10).

Record the run results in the SDD as a chain of evidence, not as a one-off test log: mutation_id, spec diff, original and mutated fragments, rejection log, diagnostic code, stack_route, link to the JSON Schema rule, and the final entry in validation.md. For review it is especially useful to keep expected_failure and actual_failure: if they diverge, the validator may be rejecting the case randomly or too late. This structure turns the mutation catalog into a catalog of precedents, where each new rule is tied to a specific blind spot and verifiable grounds.

Full Track: Threshold Calibration

The "Low / Default / High" table for strict_reject_rate, depth_of_diagnostics, recovery_time_p95_ms, and the number of mutants per class, the threshold-shift exercise, and signals for revision are in Appendix D, section D.1. On the first pass, the section is not needed.

Examples and Application

Example: a correct specification describes the appointment_latency_spike incident. The SLA requires response within 10 minutes. The escalation route goes from appointments_oncall to sre_lead.

The mutator creates m_20260517_nullify_855e4297f7. In it, the severity field is replaced with an empty string. The mutant is bound to Given:incident_received and the rule severity.minLength. The expected failure is EMPTY_REQUIRED_FIELD. The pipeline must halt before When:evaluate_sla_window, before the SLA is calculated and before the owner is selected.

If instead the validator reaches Then:notify_owner, the empty severity field has leaked too deep and may produce a false notification about an unclassified incident.

{
  "mutation_id": "m_20260517_nullify_855e4297f7",
  "base_case": "appointment_latency_spike",
  "operator": "Nullify",
  "target_step": "Given:incident_received",
  "json_schema_rule": "$.properties.severity.minLength",
  "diff_spec": {
    "before": { "severity": "P1" },
    "after": { "severity": "" }
  },
  "expected_failure": {
    "code": "EMPTY_REQUIRED_FIELD",
    "halt_before": "When:evaluate_sla_window"
  }
}

The second example checks the escalation graph for the cdn_error_budget_burn incident. The owner edge_oncall passes P1 to traffic_sre. The mutator adds a reverse edge traffic_sre → edge_oncall.

What the Verifier must do. Return CYCLE_ESCALATION, show the minimal cycle, and bind the failure to When:route_escalation. The Implementer, at the same time, must not propose a workaround like "pick the first owner from the list." After fixing it in JSON Schema or in an additional graph rule, the same mutation_id is run again to prove that the patch closes precisely the defect found.

The entry in validation.md must include the diff, the verdict, the recovery time, and a link to the CI run. Otherwise the decision will be impossible to verify on the next route change.

Summary

The stress-specification generator turns validator checking into a managed engineering cycle: it classifies degenerate scenarios, creates reproducible mutations, binds each failure to a Given/When/Then step and a JSON Schema rule, measures immunity through the three vector components, and preserves evidence in the SDD via mutation_id, spec diffs, rejection log, and validation.md. Such a loop turns absurd cases into a regression set against future toxic requirements and hidden failure cascades. The next chapter moves on to the auction of shadow specifications.

Artifacts and Readiness Criteria

Artifact	Ready when
`base/base_spec.json`	describes a correct incident scenario from which mutations will be built
Local `out/mutations/` (4 mutants)	re-running with the same `seed` produces the same `mutation_id` order; the directory is not committed
`out/validator_results.json`	each mutant has a bound Given/When/Then step and JSON Schema rule; has `diagnostic_code`, `halt_before`, depth
Minimal immunity report	the three vector components are filled in — `strict_reject_rate`, `depth_of_diagnostics`, `recovery_time_p95_ms`; the runnable example passes smoke-pass

The full track adds expected/expected_failures.json as a regression base for CI, a short reviewable report or entry in validation.md, and a CI gate that compares the new run against the old mutation_id. Consider it ready if the validator stops cycles and temporal anomalies before the execution phase, and CI blocks regression against at least one old mutation_id.

Practice

cd book2/examples/stress-mutator && python3 scripts/mutate_specs.py --base base/base_spec.json --seed 20260517 --out out/mutations — *expectation: in out/mutations/ exactly 4 files with mutation_id m_20260517_nullify_855e4297f7, m_20260517_futuretime_…, m_20260517_escalationcycle_…, m_20260517_prioritycontradiction_…; diff out/mutations/manifest.json manifest.example.json gives 0 lines of difference.*
python3 scripts/fake_validator.py --mutations out/mutations --out out/validator_results.json && python3 scripts/immunity_score.py --validator-results out/validator_results.json --expected expected/expected_failures.json --out out/immunity.json — *expectation: strict_reject_rate >= 0.98, depth_of_diagnostics >= 3, recovery_time_p95_ms <= 1200.*
Transfer to capstone/validation.md a single line: "immunity (seed=20260517): rejected <n>/4 mutants at the expected step; failure — <mutation_id>, needs an additional guard". *Expectation: on the next regression the comparison is against the fixed seed, not against "everything green".*

Review Questions

Why is JSON Schema insufficient for checking cycles and recursive dependencies?

What does strict_reject_rate show, and what does it hide?
When does the validator's growing strictness become harmful?
The validator passed a smoke run with 50 mutants and showed strict_reject_rate=0.95, depth_of_diagnostics=2.4, recovery_time_p95_ms=900. All three scalars are within the default thresholds. Name at least one scenario in which this run should be considered a failure, and what additional manifest.json fields need to be checked to make such a failure visible to the next reviewer.