Topic: Applied Part 5. Mutation Testing of Specifications
Difficulty level: Medium
Estimated study time: 6-8 hours (theory: 2-3 hours, practice: 4-5 hours)
Prerequisites: Familiarity with JSON Schema and the structure of Given/When/Then specifications
Completion of Part 9 of Volume 1 (validation facts and validation)
Completion of Part 20 of Volume 1 (SDD antipatterns)
Basic command-line and Python 3 skills
Understanding of graph structure fundamentals (directed graphs, DFS)
Experience with version control systems (Git)
Learning objectives: Run a deterministic mutant generator with a fixed seed and reproduce an identical set of mutation_ids upon re-execution
Configure a validation pipeline that rejects mutants at the expected Given/When/Then step with an exact diagnostic code
Calculate and interpret the vector immunity metric (strict_reject_rate, depth_of_diagnostics, recovery_time_p95_ms) and apply threshold values for the CI gate
Link each mutant to a specific JSON Schema rule and specification step, ensuring traceability in SDD
Form a minimal reviewable trace in capstone/validation.md for integrating mutation testing into the merge process
Overview: Mutation testing of specifications is an engineering practice at the frontier of standardization that transforms the validator from a syntax guard into an anatomical diagnostic tool. The essence of the approach: we controllably "corrupt" a correct specification (base_spec.json), create degenerate incident process scenarios, and require that the validator catch each defect at a strictly defined step with a predictable diagnostic code. The key principle is "one mutant — one expected rejection". The result is a vector immunity metric of three components: strict_reject_rate (share of strict rejections at the expected step), depth_of_diagnostics (useful diagnostic depth until stop), and recovery_time_p95_ms (time to stable verdict). The practice is closely tied to the "specification-first" (spec-first) approach and turns absurd cases into a regression suite against future toxic requirements. The study material relies on the runnable example in examples/stress-mutator/ with the appointment_latency_spike incident and four mutation operators: Nullify, FutureTime, EscalationCycle, PriorityContradiction.
Key concepts: Mutation testing of specifications: A technique in which a reference artifact (base_spec.json) is controllably distorted by mutation operators, and the test pipeline must catch each defect. Differs from manual creation of individual defects (chapter 2) in its mass scale and reproducibility, from counterexample verification (chapter 4) in its binding to a catalog of operators, from file arbitration (chapter 8) in generation rather than checking of ready artifacts.
Immunity metric: A vector metric of validator resilience consisting of three components: strict_reject_rate (share of cases strictly rejected at the expected step), depth_of_diagnostics (useful diagnostic depth until rejection, measured as the number of meaningful steps in stack_route), recovery_time_p95_ms (time to return a stable verdict). Introduced as a vector rather than a scalar to avoid situations where increased strictness is accompanied by diagnostic blindness or performance degradation.
"One mutant — one expected rejection" principle: The main rule of the mutation factory. Each mutant contains exactly one change and has a fixed expected_failure with a diagnostic code and a halt_before stop point. Prohibits compound mutations that make cause localization impossible. Bad: one mutant simultaneously nullifies service_id, reverses the escalation graph, and inverts priorities. Good: the Nullify mutator nullifies only severity, expected_failure.code = EMPTY_REQUIRED_FIELD, halt_before = When:evaluate_sla_window.
Mutation factory: A deterministic mutator on top of a correct base_spec.json. Parses the specification into an AST with Given/When/Then nodes, SLA matrix, escalation rules, and JSON Schema fragments. Applies operators with a fixed seed for reproducibility. Critical for the Verifier and Implementor duel: a disputed case can be reproduced, given to both roles, and checked for contract violation.
Mutation operators: Nullify (field nullification — empty string, null, empty array), FutureTime (shifting a timestamp into the future relative to the agreed now), EscalationCycle (adding a back edge to the escalation graph), PriorityContradiction (introducing mutually contradictory priority rules). Future extensions: RecursiveDependency (indirect recursion between computed fields).
Validator vaccination: A figurative name for ordinary mutation testing of specifications. The validator receives controllably corrupted inputs and must reject them at the expected step. Turns absurd cases into a "vaccine" against regressions.
Degenerate scenario classes: Empty fields (any emptiness without which a safe action cannot be chosen: empty strings, empty owner arrays, missing severity, service_id, runbook_ref), temporal anomalies (formally correct ISO timestamps with violated causality), reversible escalation cycles (infinite owner redefinition), recursive dependencies (indirect computation chains with potential infinite unfolding).
Determinism and seed: A fixed generator seed (e.g., 20260517) guarantees bitwise identity of manifest.json across repeated runs. Two consecutive runs with the same mutation_id order is the minimal quality control. Without determinism, a regression base and role duel are impossible.
Graph checks outside JSON Schema: JSON Schema checks data shape well but does not express topological route behavior. For EscalationCycle, the validator builds a directed graph and runs DFS with white/gray/black states. Detecting a gray node returns the minimal cycle. For PriorityContradiction — similar control of reversible transitions with distinct codes CYCLE_ESCALATION and PRIORITY_REVERSAL.
Temporal anchors and max reaction lag policy: Three mandatory anchors: event_detected_at, event_received_at, agreed now from a controlled time source. Three rejection codes: INVALID_TIME_ANCHOR (response_timestamp in the future — input load problem), NEGATIVE_RESPONSE_LAG (negative lag — normalization problem), STALE_INCIDENT_WINDOW (event older than window — SLA rule problem). Different codes are critical for the SDD log.
Recursive dependencies and max resolution depth: Differ from cycles in the absence of a short loop. Typical chain: owner ← priority ← blast_radius ← owner_group ← owner. Unfolding limit (e.g., 8) with attempt trace. Exceeding → RECURSION_LIMIT with field chain, not masking as timeout. Protects the LLM executor from infinite refinement.
Mutation CI gate: Merge blocking upon violation of any of three thresholds: strict_reject_rate < 0.98, depth_of_diagnostics < 3, recovery_time_p95_ms > 1200. Also blocks skipping of old mutation_id, degradation of diagnostic depth, time limit exceeded. Example: python3 scripts/ci_gate.py --strict-reject-min 0.98 --diag-depth-min 3 --recover-ms-p95 1200 --fail-on-regression.
Reviewable trace and SDD evidence: Minimal fragment in capstone/validation.md: seed, operator list, three immunity metrics, verdict. Full trace: mutation_id, specification diff, original and mutated fragments, rejection log, diagnostic code, stack_route, link to JSON Schema rule. The out/mutations catalog is local, not committed; source of truth is the reproducible command.
Practice exercises: Name: Verifying mutant generator determinism
Problem: You are given a correct base_spec.json and a mutate_specs.py script with seed=20260517. You must prove that the generator is deterministic: two consecutive runs must produce an identical manifest.json. One student ran the script once and saw 4 mutant files. The reviewer cannot check reproducibility. Fix the procedure and explain why a single run is insufficient.
Solution: Step 1: Run the first pass: python3 scripts/mutate_specs.py --base base/base_spec.json --seed 20260517 --operators Nullify,FutureTime,EscalationCycle,PriorityContradiction --out out/mutations_run1. Save manifest.json. Step 2: Delete out/mutations_run1 and run the exact same command with out/mutations_run2. Step 3: Compare via diff out/mutations_run1/manifest.json out/mutations_run2/manifest.json. Expectation: 0 lines of difference. Step 4: Compare with the reference: diff out/mutations_run1/manifest.json manifest.example.json. Expectation: 0 lines. Step 5: Check mutation_id order stability: jq '.mutations | map(.mutation_id)' out/mutations_run1/manifest.json must match run2. Explanation: a single run cannot distinguish a deterministic generator from random noise. Only a repeated run creates a regression base. Without this, the Verifier/Implementor duel and CI check are impossible.
Complexity: beginner
Name: Localizing mutant leakage through diagnostic depth
Problem: The Nullify mutant nullifies severity. The validator stopped at step Then:notify_primary_owner with code MISSING_OWNER instead of the expected EMPTY_REQUIRED_FIELD at When:evaluate_sla_window. strict_reject_rate = 1.0 (all rejected), but depth_of_diagnostics = 1.2 (below threshold of 3). Analyze why this is a regression, and describe minimal changes to the validator and expected_failures.json to restore the threshold.
Solution: Step 1: Diagnose the problem. Empty severity leaked too deep: the validator did not check field mandatory-ness at input (Given:incident_received) and tried to compute the owner from invalid data. This creates a risk of false notification about an unclassified incident. Step 2: Add a guard in the normalizer: check severity.minLength and severity ∈ enum before the evaluate_sla_window stage. Step 3: Update fake_validator.py: when severity is empty, return EMPTY_REQUIRED_FIELD with halt_before = When:evaluate_sla_window, stack_route = ['schema.normalize', 'Given:incident_received', 'field.severity.check', 'halt']. Step 4: Check expected_failures.json: for mutation_id m_20260517_nullify_855e4297f7, expect code=EMPTY_REQUIRED_FIELD, halt_before=When:evaluate_sla_window, depth=4. Step 5: Re-run immunity_score.py. Expectation: strict_reject_rate remains 1.0, depth_of_diagnostics recovers to ≥3, recovery_time_p95_ms checked separately. Lesson: high strict_reject_rate hides blind diagnostics. The vector immunity metric catches such regression.
Complexity: intermediate
Name: Configuring the CI gate with operator priority considerations
Problem: Post-mortem history shows that 60% of production incidents are related to erroneous time windows, 25% to escalation cycles, 10% to empty fields, 5% to priority conflicts. You have a token budget of 100 mutants per run. The standard uniform set (25 per class) does not reflect historical vulnerability. Rebuild the operator selection policy, justify the choice, and show how this affects expected_failures.json and immunity thresholds.
Solution: Step 1: Replace uniform distribution with history-weighted: FutureTime 40 mutants, EscalationCycle 25, Nullify 20, PriorityContradiction 10, reserve 5 for new patterns. Step 2: In mutate_specs.py, set --operator-weights FutureTime=0.40,EscalationCycle=0.25,Nullify=0.20,PriorityContradiction=0.10. Step 3: Update expected_failures.json: increase temporal code detail (INVALID_TIME_ANCHOR, NEGATIVE_RESPONSE_LAG, STALE_INCIDENT_WINDOW) with different halt_before for each subtype. Step 4: Reconsider thresholds: with increased mutant count, strict_reject_rate can be raised to 0.99, depth_of_diagnostics kept ≥3 (but require ≥4 for temporal anomalies), recovery_time_p95_ms reduced to 1000 through graph check optimization. Step 5: In ci_gate.py, add --weighted-fail-on-regression, which compares not only global metrics but also per-operator strict_reject_rate. Step 6: Document justification in validation.md: "FutureTime priority increased per Q1-Q3 2024 post-mortems, see incident references #INC-2047, #INC-2089". Lesson: directed fuzzing tests historically fragile areas rather than wasting budget on uniform chaos.
Complexity: advanced
Name: Tracing recursive dependency and configuring max_resolution_depth
Problem: The RecursiveDependency mutant creates a chain: owner is computed from priority, priority from blast_radius, blast_radius requests owner_group, owner_group requires owner. The validator crashes with timeout after 30 seconds. stack_route is absent. The developer proposes to "simply increase timeout to 60 seconds". Refute this solution, implement max_resolution_depth with a reproducible trace, and show the entry in validation.md.
Solution: Step 1: Reject timeout increase: this masks the problem, makes CI unpredictable, provides no diagnostics for contract repair. Step 2: In fake_validator.py, add a dependency resolver with resolution stack tracking: resolution_stack = [], push on field entry, pop on exit, repeated push of same field → cycle detection. Step 3: Set max_resolution_depth = 8. When exceeded, return RECURSION_LIMIT with full chain: ['owner', 'priority', 'blast_radius', 'owner_group', 'owner']. Step 4: Add attempt trace to the log: resolution_attempts with timestamp of each step. This distinguishes recursion from merely complex dependency. Step 5: In immunity_score.py, verify that for RecursiveDependency mutants, depth_of_diagnostics includes the resolution trace (counted as len(resolution_attempts) until detection). Step 6: Entry in validation.md: ``yaml stress_run: seed: 20260517 operators: [Nullify, FutureTime, EscalationCycle, PriorityContradiction, RecursiveDependency] recursive_guard: max_resolution_depth: 8 verdict: PASS strict_reject_rate: "1.0 >= 0.98" depth_of_diagnostics: "5 >= 3" recovery_time_p95_ms: "400 <= 1200" `` Lesson: infinite unfolding must be observable and controllable, not replaced by timeout.
Complexity: advanced
Case studies: Name: Implementing mutation testing in an auto-incident-management platform
Scenario: An SRE platform team (150 engineers, 40 microservices) managed incidents through an automated pipeline with a JSON Schema validator at input. After a series of production incidents where empty severity fields leaked to the notification stage and caused false escalations to the VP on-call, the team decided to implement mutation testing of specifications. The base scenario is appointment_latency_spike (SLA 10 minutes, escalation appointments_oncall → sre_lead).
Challenge: Three key problems: (1) The validator checked JSON syntax but did not catch semantic defects — an empty severity string passed minLength=0 due to a schema error; (2) No reproducibility: manual test cases were created ad-hoc and lost when specifications were updated; (3) CI missed regressions because the single metric "all tests green" hid diagnostic degradation — the validator began returning GENERIC_VALIDATION_FAILED instead of precise codes, and debug time grew from 15 minutes to 4 hours.
Solution: The team implemented a mutation factory based on stress-mutator from the course. Step 1: Fixed seed=20240715 and 4 operators (Nullify, FutureTime, EscalationCycle, PriorityContradiction) with FutureTime=0.45 priority per incident history. Step 2: Linked each mutant to a Given/When/Then step and JSON Schema rule — e.g., Nullify(severity) → Given:incident_received, $.properties.severity.minLength. Step 3: Vector immunity metric with thresholds: strict_reject_rate ≥ 0.98, depth_of_diagnostics ≥ 3, recovery_time_p95_ms ≤ 1200. Step 4: CI gate blocking merge on any violation or skipping of old mutation_id. Step 5: SDD evidence: mutation_id, diff, stack_route, rule link. Step 6: For graph checks, added DFS with white/gray/black states; for temporal checks, three anchors and three rejection codes.
Result: Over 3 months: strict_reject_rate grew from 0.71 to 0.99, depth_of_diagnostics from 1.8 to 4.2, recovery_time_p95_ms from 3400 to 890 ms. False escalations due to empty fields decreased by 94%. Incident investigation time related to validation dropped from a median of 4 hours to 25 minutes. A key regression was caught at the PR stage: a developer changed check order, shifting halt_before for FutureTime mutants from When:evaluate_sla_window to Then:notify_owner; the CI gate blocked merge, and the team rolled back the change before analysis.
Lessons learned: Vector immunity metric is critical: growth of strict_reject_rate to 0.99 with drop of depth_of_diagnostics to 1.2 would be false success without three-component control
Determinism (fixed seed) allows turning mutations into a regression suite: old mutation_ids become a "vaccine" against future changes
SDD evidence with stack_route and JSON Schema links reduces new engineer onboarding time from a week to one day
Operator prioritization by post-mortem history is more effective than uniform distribution: 45% budget on FutureTime gave 78% coverage of historical defects
Related concepts: Immunity metric (strict_reject_rate, depth_of_diagnostics, recovery_time_p95_ms)
"One mutant — one expected rejection" principle
Determinism and seed
Mutation CI gate
Graph checks outside JSON Schema
SDD evidence and reviewable trace
Name: Failure due to ignoring recovery_time in a major release
Scenario: An e-commerce platform was preparing for Black Friday sales. The validator team added 50 new mutation operators to check payment contour specifications. strict_reject_rate = 0.97, depth_of_diagnostics = 3.5 — within thresholds. The CI gate passed the release.
Challenge: recovery_time_p95_ms grew unnoticed from 800 ms to 4500 ms due to complex graph cycle checks on payment routes. During peak load, the validator became a bottleneck: 12% of requests timed out on validation, cascading to retries and downstream service overload. The team did not check recovery_time at release, considering it an "auxiliary metric".
Solution: Release rollback within 45 minutes. Analysis showed that DFS on a graph of 200 escalation nodes was executed for every request instead of caching topological sorting. Optimization: preliminary graph analysis on specification load, incremental check on mutations. Introduction of hard threshold recovery_time_p95_ms ≤ 1200 with merge blocking. Alert added when exceeding 800 ms for 3 consecutive runs.
Result: After optimization, recovery_time_p95_ms dropped to 600 ms. On Black Friday, the validator handled 340% peak load without timeouts. Key process change: recovery_time became an equal component of the immunity vector, not a "bonus" metric.
Lessons learned: Three-component immunity vector protects against optimizing one metric at the expense of others
recovery_time is critical for production load: a "correct" validator that slows CI provokes workaround practices and production failure
Incremental graph checks are more efficient than full DFS per request with unchanged topology
Related concepts: Immunity metric and recovery_time_p95_ms
Mutation CI gate
Graph checks outside JSON Schema
Incremental validation
Study tips: Progress through the material sequentially: first the runnable example (steps 1-7), then theory of key ideas, then review questions. Attempting to read theory without running scripts creates an illusion of understanding
Mandatory: perform a repeated run with the same seed and compare manifest.json via diff. This is the minimal quality control that cannot be skipped. Capture the result with a screenshot or terminal copy
Create a correspondence table: mutation operator → degenerate scenario class → Given/When/Then step → JSON Schema rule → diagnostic code → expected halt_before. This table is the foundation for configuring your own project
Experiment with threshold violations: temporarily modify fake_validator.py to let a mutant pass or shift halt_before, and observe which immunity vector component catches it. Practical understanding of regression is more important than theoretical knowledge
For visual learning style: draw the block diagram from the course mermaid diagram on paper, noting where your project differs from the course's appointment_latency_spike
For auditory style: explain the "one mutant — one expected rejection" principle to a colleague or in a voice message recording. If you cannot formulate it clearly in 60 seconds — return to the material
For kinesthetic style: modify base_spec.json by adding a new field (e.g., runbook_ref), create a Nullify operator for it, update expected_failures.json, and run the full pipeline. Manual creation of a new operator accelerates learning
Use the optional step with Qwen Code as review practice, not as a substitute for runnable verification. Compare its output with your facts from immunity_score.py — discrepancies signal a problem of understanding or tool
Keep a "mutation journal" in a separate file: for each run, record seed, operators, three metrics, unexpected behaviors. This journal will become the foundation of capstone/validation.md and help track learning progress
When transitioning to the production track (threshold calibration, CI gate), refer to Appendix D.1, but only after confidently completing the minimal scenario. Premature immersion in calibration subtleties distracts from understanding basic principles
Group learning: organize a "duel" — one participant changes base_spec.json or fake_validator.py, the other must find which mutation_id broke. This models real Verifier and Implementor interaction
Additional resources: Source code of the stress-mutator course example: book2/examples/stress-mutator/ — runnable example with base_spec.json, mutation scripts, validation, and immunity calculation. Primary source for all practical steps
Github spec kit: https://github.com/github/spec-kit — the "specification-first" (spec-first) approach underlying the entire chapter. Contract precedes planning and code implementation
Part 9 of Volume 1 (validation facts): ../book/part-09-feature-validation.md — the Given/When/Then fact discipline without which mutations are meaningless. A mutant checks the fact of rejection at the expected step
Part 20 of Volume 1 (SDD antipatterns): ../book/part-20-sdd-antipatterns.md — catalog of classic process errors to which mutation operators are bound
Part 11 (study feature /agents): ../book/part-11-second-feature-phase.md — origin of the study case appointment_latency_spike
Part 12 (MVP and empty review text): ../book/part-12-mvp.md — simplest example of fact discipline: empty review text must be rejected. Here the same logic is generalized to a set of operators
Part 10 (goodhart-validator and ci_gate.py): examples/goodhart-validator/scripts/ci_gate.py — a closely related example of a CI gate with threshold values
Appendix D.1 (threshold calibration): appendix-d-threshold-calibration.md#d1-mutation-testing-chapter-5 — full production track: "Low/Default/High" table, threshold shift exercise, signals for review
JSON Schema documentation: https://json-schema.org/ — reference for formal constraints that are supplemented by graph and temporal checks outside the schema
Cycle detection algorithm in directed graphs (DFS white/gray/black): Classic Tarjan implementation or CLRS (Cormen, Leiserson, Rivest, Stein) — foundation for EscalationCycle and RecursiveDependency
Summary: Mutation testing of specifications transforms the validator from a passive syntax guard into an active anatomical diagnostic tool. Key principles: deterministic mutation factory with fixed seed, strict "one mutant — one expected rejection" rule, vector immunity metric of three components (strict_reject_rate, depth_of_diagnostics, recovery_time_p95_ms), linking each mutant to a Given/When/Then step and JSON Schema rule, CI gate with regression blocking, SDD evidence with mutation_id and stack_route trace. The study minimum is running examples/stress-mutator/ with a reproducible manifest.json, passing all mutants through expected rejections, and a green immunity metric. The reviewable trace is recorded in capstone/validation.md as seed, operators, three metrics, and verdict. The practice aims to create a degenerate specification generator for real auto-incident-management projects, where absurd cases become regression protection against toxic requirements and hidden failure cascades.