Applied Part 5. Mutation Testing of Specifications
Status: Frontier. Mutation testing for specifications and the immunity score metric vector — a practice not yet standardized. The "one mutant — one expected failure" idea is a recommendation. The operator sets and thresholds themselves need to be tuned per project.
For the educational walkthrough, it is sufficient to run examples/stress-mutator/ and see that one mutant yields one expected failure. Selecting operators, thresholds, and the CI gate is the full production track.
Let us introduce basic concepts. Mutation testing is a technique where a reference artifact is deliberately "corrupted" in a controlled way, and the test harness is required to catch that defect. Immunity metric is a vector metric of validator resilience consisting of three components:
strict_reject_rate— proportion of cases rejected strictly at the expected step;depth_of_diagnostics— useful diagnostic depth before failure;recovery_time— time until a stable verdict is returned.
The figurative name "validator vaccination" means ordinary mutation testing of specifications. The validator receives deliberately corrupted inputs and must reject them at the expected step.
The boundary with neighboring mechanisms is as follows. In Chapter 2 you create one manual defect to learn to read symptoms. In this chapter you create a series of machine mutants to measure validator resilience. In Chapter 4 the Verifier searches for a minimal counterexample to a rule, rather than iterating through a catalog of mutation operators. In Chapter 8 the result of such checks may become evidence for a verdict, but the file arbitration itself does not replace the mutant generator.
This chapter relies on the discipline of facts from Part 9 of the first volume. Without it, mutations have no meaning. A mutant checks precisely the fact of failure at the expected Given/When/Then step. The simplest example of this discipline was already encountered in the educational AgentClinic: an empty review text from Part 12 must be rejected. Here the same logic is generalized to a set of mutation operators tied to the catalog of classic errors from Part 20. SDD Antipatterns.
Before Reading
- Foundation from the first volume: Part 9 introduces verification facts, Part 20 introduces process error classes.
- Local educational case:
appointment_latency_spike(minimal incident payload, on whichbase/base_spec.jsonin the runnable example is built). - Trace for
capstone/: seed, list of operators, three immunity metrics, and verdict as a string invalidation.mdforhigh_memory_usage. - Key terms for the first pass: mutation testing (entry to the chapter) and immunity metric (exit — three vector components). The rest — mutation operators, mutation factory, "validator vaccination" — are reference material, only opened when configuring the CI gate.
- What to defer: operator selection, threshold calibration, and mutation CI gate.
Goal
After this chapter the reader will assemble a generator of degenerate specifications for an auto-remediation incident management project and configure a validator harness that does three things: discards absurd cases with precise diagnostics, preserves the evidence chain in SDD, and computes the immunity metric before merge. The validator ceases to be a syntax guard and becomes a tool of anatomical diagnostics: it shows the failure fact, field, Given/When/Then step, JSON Schema rule, failure route, and regression risk. This aligns with the "spec-first" approach — the contract precedes code planning and implementation (GitHub Spec Kit).
Minimal Educational Scenario
Educational Case
Production incident appointment_latency_spike (derived from the educational feature /agents from book/part-11-second-feature-phase.md): SLA 10 minutes, escalation from appointments_oncall to sre_lead. The Nullify mutation zeroes out severity. Expectation — the validator stops before When:evaluate_sla_window with code EMPTY_REQUIRED_FIELD, before SLA calculation and before owner selection.
Preparation
book2/examples/stress-mutator/base/base_spec.json— correct source.book2/examples/stress-mutator/expected/expected_failures.json— expected(diagnostic_code, halt_before)under keyby_operatorand immunity thresholds inthresholds.book2/examples/stress-mutator/scripts/mutate_specs.py,fake_validator.py,immunity_score.py.book2/examples/stress-mutator/manifest.example.json— determinism reference.
Steps
cd book2/examples/stress-mutator. Expectation: you are in the example directory, no additional dependencies.python3 scripts/mutate_specs.py --base base/base_spec.json --seed 20260517 --operators Nullify,FutureTime,EscalationCycle,PriorityContradiction --out out/mutations. *Expectation:out/mutations/manifest.jsonis created and one JSON file per mutant.*- Determinism control — repeat step 2. *Expectation: the list of
mutation_idand order matched the previous run.*
Bad: one run without repeat — impossible to distinguish a deterministic generator from random noise. Good: two consecutive runs, same mutation_id order, regression base is reproducible.
- Compare
out/mutations/manifest.jsonwithmanifest.example.jsonviadiff. Expectation: 0 lines of difference. python3 scripts/fake_validator.py --mutations out/mutations --out out/validator_results.json. *Expectation: for eachmutation_idthe result contains adiagnostic_code+halt_beforepair.*python3 scripts/immunity_score.py --validator-results out/validator_results.json --expected expected/expected_failures.json. *Expectation:strict_reject_rate >= 0.98,depth_of_diagnostics >= 3,recovery_time_p95_ms <= 1200.*- For the educational minimum stop here: the runnable example proved mutant determinism, expected failures, and immunity calculation.
If you have Qwen Code installed and want to get additional explanation, perform a separate optional step:
qwen -p "Read @out/validator_results.json and @expected/expected_failures.json. Which mutants are rejected at a non-expected step? Do not modify the files." --approval-mode plan
This query does not replace the runnable check. Its result can be used as a review comment, but not as the sole fact of readiness.
The full production track adds a separate CI gate. In your own project this is usually python3 scripts/ci_gate.py --strict-reject-min 0.98 --diag-depth-min 3 --recover-ms-p95 1200 --fail-on-regression — three thresholds, any violation blocks merge. There is no runnable equivalent specifically for stress-mutator in the textbook; the closest in concept examples/goodhart-validator/scripts/ci_gate.py is shown in Part 10.
Control Fact
The three metrics from step 6 simultaneously satisfy thresholds. manifest.json matches manifest.example.json bit-for-bit. If you performed the optional Qwen query, its output must not contradict runnable facts. Without determinism, expected failures, and a green immunity metric the educational pipeline is not considered green.
How This Gets into capstone/
Transfer to capstone/validation.md or a short capstone/README.md only the smoke run summary: seed, operators, three immunity metrics, and verdict. Do not transfer the out/mutations directory: it must remain a reproducible local trace, not a reviewable artifact.
Minimal fragment:
stress_run:
seed: 20260517
operators: [Nullify, FutureTime, EscalationCycle, PriorityContradiction]
strict_reject_rate: "1.0 >= 0.98"
depth_of_diagnostics: "4.0 >= 3"
recovery_time_p95_ms: "850 <= 1200"
verdict: PASS
Reviewable Trace
The out/ directory is a local run result and is ignored in book2/examples/.gitignore. Do not commit it as an educational artifact and do not commit for the sake of a checkmark. For the first pass a line in capstone/validation.md is sufficient: seed, operators, three metrics, and verdict.
In your own production repository you may store a short report outputs/immunity.last-run.json if it is created by CI and participates in review. In the educational route the source of truth remains the reproducible command and the minimal capstone fragment above.
Key Ideas
Divide degenerate incident process scenarios into four classes. Empty fields — this is not just null: it also includes empty strings, empty owner arrays, missing severity, service_id, or runbook_ref — any emptiness without which it is impossible to choose a safe action. Temporal anomalies look formally correct: there is an ISO timestamp, but response_timestamp turns out to be earlier than event_received_at or later than the agreed now. Reversible escalation cycles and recursive dependencies are more dangerous than ordinary omissions — they can send the execution harness into infinite redefinition of owner, priority, or next action.
Let us introduce one more concept. Mutation factory — not a random noise generator, but a deterministic mutator over a correct base_spec.json. The base specification is parsed into an abstract syntax tree (AST) with explicit Given/When/Then nodes, an SLA matrix, escalation rules, and JSON Schema fragments. Operators are then applied to it:
Nullify— zeroing out a field;FutureTime— shifting a timestamp into the future;EscalationCycle— adding a reverse edge to the escalation graph;
PriorityContradiction— introducing mutually contradictory priority rules.
Future extensions will add RecursiveDependency for indirect recursion between computed fields.
The principle "one mutant — one expected failure" is the main rule of the factory. Let us show the contrast.
Bad:
> one mutant simultaneously zeroes out service_id, reverses the escalation graph, and inverts priorities; expected_failure is not set.
Problem: on failure the cause cannot be localized. The validator may stop on any of the three defects, regression is tied to a composite artifact.
Good:
> one mutator Nullify zeroes out only severity; expected_failure.code = EMPTY_REQUIRED_FIELD, halt_before = When:evaluate_sla_window.
Each run gets a fixed seed (seed). The same input creates the same list of mutation_id in a stable order. This is critical for the Verifier/Implementor duel: a disputed case can be reproduced, given to both roles, and checked which one violated the contract.
> [runnable] — the minimal implementation of this interface is in examples/stress-mutator/README.md.
cd book2/examples/stress-mutator
python3 scripts/mutate_specs.py \
--base base/base_spec.json \
--seed 20260517 \
--operators Nullify,FutureTime,EscalationCycle,PriorityContradiction \
--out out/mutations
python3 scripts/fake_validator.py \
--mutations out/mutations \
--out out/validator_results.json
#### CONTROL: repeated run with the same seed must produce the same mutation_id list and same order
Combinatorial explosion appears already at depth 2–3. Give the generator a selection policy, not full enumeration: at minimum one mutant per class (required field, time window, escalation graph, recursive dependency, priority conflict). Tie operator priority to incident history: if post-mortems more often show erroneous time windows, give FutureTime and NegativeLag higher weight in the queue. Directed fuzzing tests historically fragile places in the contract, rather than spending token budget on uniform chaos.
flowchart TD A[File base_spec.json] --> B[AST Normalizer] B --> C[Mutation Factory] C --> C1[Nullify] C --> C2[FutureTime] C --> C3[EscalationCycle] C --> C4[PriorityContradiction] C1 --> D[Verifier/Implementor Duel with Given/When/Then Step Binding] C2 --> D C3 --> D C4 --> D D --> E[Diagnostics and Stack Route] E --> F[mutation_id and validation.md] F --> G[CI Gate]
Bind each mutant to a specific Given/When/Then step and a specific JSON Schema rule. Otherwise diagnostics will remain too general to fix. Bindings must be explicit: mutation Nullify(service_id) relates to Given:incident_received and rule required.service_id, while mutation FutureTime(response_timestamp) — to When:evaluate_sla_window and constraint format + maximum(now).
If a mutant breaks Then:notify_primary_owner, the report must show the essence of the problem. The issue is not the notification as an action. The issue is the impossibility of computing a valid owner after the route is corrupted. Such tracing reduces manual debugging: the engineer sees the sticking point, not just the final VALIDATION_FAILED.
{
"mutation_id": "m_20260517_0009",
"operator": "EscalationCycle",
"target_step": "When:route_escalation",
"json_schema_rule": "$defs.escalation_graph.no_cycles",
"failed_step": "Verifier::GraphCheck::Escalation",
"stack_route": [
"schema.normalize",
"step.when.prepare",
"graph.build",
"graph.detect_cycle",
"halt"
]
}
Cycle diagnostics requires a separate graph pass. The reason is that JSON Schema checks data shape well, but does not always express topological route behavior. For EscalationCycle the validator builds a directed graph of owners or queues and runs depth-first search (DFS) with white/gray/black states. Detecting a gray node returns a minimal cycle, for example primary_oncall → sre_lead → primary_oncall.
A similar control is used for reversible priority transitions. If P1 is downgraded to P2 by one rule, and then another rule returns P2 to P1 without a tie-breaker rule, the validator must stop before the execution phase. The diagnostic code must distinguish CYCLE_ESCALATION from PRIORITY_REVERSAL. The first is fixed by the route graph. The second — by conflict resolution policy.
Check temporal anomalies before routing. Incorrect time distorts SLA, severity, and reaction channel choice. Give the validator at least three anchors — event_detected_at, event_received_at, agreed now from a controlled time source — and a max_reaction_lag policy. Accordingly, failure gets one of three codes: INVALID_TIME_ANCHOR (if response_timestamp is in the future — problem in input payload), NEGATIVE_RESPONSE_LAG (negative reaction delay — problem in time normalization), or STALE_INCIDENT_WINDOW (event older than allowed window — problem in SLA rule). Different codes matter for the SDD log: they show where exactly the contract is weakened.
Recursive dependencies differ from cycles in that they may not look like a short loop in a graph. A typical chain: owner is computed from priority, priority depends on blast_radius, blast_radius queries owner_group, and owner_group again requires the already computed owner.
For such cases set an unfolding limit, for example max_resolution_depth = 8. Preserve the resolution attempt trace. If the limit is exceeded, the validator returns RECURSION_LIMIT together with the field chain, rather than masking the problem as a timeout. This protects the LLM executor from infinite condition refinement and makes the failure cascade observable.
Now about the immunity metric (vector components — at the beginning of the chapter). Introduce it as a vector, not as a single final score. If strict_reject_rate rises but depth_of_diagnostics falls to one, the harness became stricter but blinder. If recovery_time_p95_ms exceeds the limit, even a correct validator starts slowing CI and provoking workarounds.
Build blocking in CI on immunity thresholds and regression comparison with the previous run. For the educational harness start with the following values:
strict_reject_rate >= 0.98,depth_of_diagnostics >= 3,recovery_time_p95_ms <= 1200.
Then calibrate values to actual load and number of mutants.
Merge is blocked if a new change does at least one of three things:
- misses an old
mutation_id, - worsens diagnostic depth,
- exceeds the recovery time limit.
Such a gate protects not only JSON Schema, but the entire validator harness: normalizer, graph checks, Given/When/Then rules, and report format.
> [runnable] — the command below corresponds to book2/examples/stress-mutator.
cd book2/examples/stress-mutator
python3 scripts/immunity_score.py \
--validator-results out/validator_results.json \
--expected expected/expected_failures.json
In your own project this gate usually looks like python3 scripts/ci_gate.py --strict-reject-min 0.98 --diag-depth-min 3 --recover-ms-p95 1200 --fail-on-regression. There is no ready-made script specifically for stress-mutator in the textbook; the idea "one missed threshold = block" is preserved in the similar in form examples/goodhart-validator/scripts/ci_gate.py (Part 10).
Record run results in SDD as an evidence chain, not as a one-off test log: mutation_id, specification diff, original and mutated fragments, rejection log, diagnostic code, stack_route, reference to JSON Schema rule, and final entry in validation.md. For review it is especially useful to store expected_failure and actual_failure: if they diverge, the validator may be rejecting the case randomly or too late. Such structure turns the mutation catalog into a precedent catalog, where each new rule is tied to a specific blind spot and verifiable basis.
Full Track: Threshold Calibration
The "Low / Default / High" table for strict_reject_rate, depth_of_diagnostics, recovery_time_p95_ms and number of mutants per class, the threshold shift exercise, and signals for review are placed in Appendix D, Section D.1. Not needed on the first pass.
Examples and Application
Example: a correct specification describes incident appointment_latency_spike. SLA requires reaction within 10 minutes. Escalation route goes from appointments_oncall to sre_lead.
The mutator creates m_20260517_nullify_855e4297f7. In it the severity field is replaced with an empty string. The mutant is bound to Given:incident_received and rule severity.minLength. Expected failure — EMPTY_REQUIRED_FIELD. The pipeline must stop before When:evaluate_sla_window, before SLA calculation and before owner selection.
If instead the validator reaches Then:notify_owner, it means the empty severity field leaked too deep and may cause a false notification about an unclassified incident.
{
"mutation_id": "m_20260517_nullify_855e4297f7",
"base_case": "appointment_latency_spike",
"operator": "Nullify",
"target_step": "Given:incident_received",
"json_schema_rule": "$.properties.severity.minLength",
"diff_spec": {
"before": { "severity": "P1" },
"after": { "severity": "" }
},
"expected_failure": {
"code": "EMPTY_REQUIRED_FIELD",
"halt_before": "When:evaluate_sla_window"
}
}
A second example checks the escalation graph for incident cdn_error_budget_burn. Owner edge_oncall hands P1 to traffic_sre. The mutator adds a reverse edge traffic_sre → edge_oncall.
What the Verifier must do. Return CYCLE_ESCALATION, show the minimal cycle, and bind the failure to When:route_escalation. The Implementor must not propose a workaround like "pick the first owner from the list". After fixing in JSON Schema or in an additional graph rule, the same mutation_id is re-run to prove that the patch closes exactly the found defect.
The validation.md entry must include the diff, verdict, recovery time, and reference to the CI run. Otherwise the decision will be impossible to verify at the next route change.
Summary
The stress specification generator turns validator checking into a managed engineering cycle: it classifies degenerate scenarios, creates reproducible mutations, binds each breakage to a Given/When/Then step and JSON Schema rule, measures immunity through three vector components, and preserves evidence in SDD via mutation_id, specification diffs, rejection log, and validation.md. Such a harness turns absurd cases into a regression set against future toxic requirements and hidden failure cascades. The next chapter moves to the shadow specification auction.
Artifacts and Readiness Criteria
| Artifact | Ready when |
|---|---|
base/base_spec.json | describes a correct incident scenario on which mutations will be built |
Local out/mutations/ (4 mutants) | repeated run with the same seed produces the same mutation_id order; directory is not committed |
out/validator_results.json | each mutant has a bound Given/When/Then step and JSON Schema rule; has diagnostic_code, halt_before, depth (depth) |
| Minimal immunity report | three vector components are filled — strict_reject_rate, depth_of_diagnostics, recovery_time_p95_ms; runnable example passes smoke-pass |
The full track adds expected/expected_failures.json as a regression base for CI, a short reviewable report or entry in validation.md, and a CI gate that compares the new run against the old mutation_id. Consider it ready if the validator stops cycles and temporal anomalies before the execution phase, and CI blocks regression on at least one old mutation_id.
Practice
cd book2/examples/stress-mutator && python3 scripts/mutate_specs.py --base base/base_spec.json --seed 20260517 --out out/mutations— *expectation: inout/mutations/exactly 4 files withmutation_idm_20260517_nullify_855e4297f7,m_20260517_futuretime_…,m_20260517_escalationcycle_…,m_20260517_prioritycontradiction_…;diff out/mutations/manifest.json manifest.example.jsonyields 0 lines of difference.*python3 scripts/fake_validator.py --mutations out/mutations --out out/validator_results.json && python3 scripts/immunity_score.py --validator-results out/validator_results.json --expected expected/expected_failures.json --out out/immunity.json— *expectation:strict_reject_rate >= 0.98,depth_of_diagnostics >= 3,recovery_time_p95_ms <= 1200.*- Transfer to
capstone/validation.mdone line: "immunity (seed=20260517): rejected<n>/4mutants at expected step; failure —<mutation_id>, additional guard needed". *Expectation: on next regression comparison is against fixedseed, not "all green".*
Review Questions
- Why is JSON Schema insufficient for checking cycles and recursive dependencies?
- What does
strict_reject_rateshow, and what does it hide? - When does validator strictness growth become harmful?
- The validator passed a smoke run with 50 mutants and showed
strict_reject_rate=0.95,depth_of_diagnostics=2.4,recovery_time_p95_ms=900. All three scalars are within default thresholds. Name at least one scenario in which this run should be considered a failure, and what additional manifest.json fields need to be checked so that such a failure is visible to the next reviewer.