Topic: Applied Part 11. Integration with a Real API: from Specification to Deployment
Difficulty level: Medium
Estimated study time: 8-12 hours (theory 3-4 h, practice 5-8 h)
Prerequisites: Completion of parts 7-9 of the first volume (specification-plan-check cycle)
Part 12 of the first volume (SQLite migrations and MVP phase)
Part 16 of the first volume (team code review)
Basic proficiency in Python 3 and command line
Understanding of REST API, webhooks, and JSON format
Experience with Git and basic CI/CD
Learning objectives: Execute the full incident processing chain: from raw webhook to normalized event, readiness check, and dry-run with correct return codes for allowed and forbidden actions
Apply SDD phase separation Specify/Plan/Tasks/Implement/Validate for the high_memory_usage incident, ensuring no specific remediation commands appear in the specification
Evaluate the pipeline using the 25-point readiness model, identify blocking conditions (audit/stateful), and form a correct evidence package for capstone
Form a reproducible audit trace linking all artifacts to incident_id, suitable for team review without chat history
Overview: This chapter teaches building a production-ready auto-remediation pipeline from the first incident signal to controlled execution. At the center is the educational case high_memory_usage for the appointments-api service: normalization of Grafana and PagerDuty webhooks, passing the readiness gateway using the 25-point model, dry-run of pre-approved actions, and strict blocking of everything else. The entire path is implemented in examples/real-api/ without external dependencies: standard library Python scripts allow running the pipeline locally and seeing which conditions block action. The full production track (GitOps, Kubernetes API, full executor) is deferred until replay evidence is accumulated; in the educational minimum, it is sufficient to prove that an allowed action passes readiness, while a forbidden one is blocked from changing the system.
Key concepts: SDD cycle (specify/plan/tasks/implement/validate): A structured process where Specify captures WHY/WHAT/constraints without choosing a specific remediation command, Plan selects a strategy, Tasks decomposes it into executable steps, Implement applies changes in a controlled manner, Validate checks the result. Protects against premature implementation and ensures provability of each step.
Webhook normalization: Transforming raw payloads from different sources (Grafana, PagerDuty) into a unified incident_event format with fields service, namespace, pod, severity, window_minutes, metric_context, source_refs. Eliminates duplication and version conflicts of a single incident.
25-point readiness model: Pipeline evaluation across five categories (Spec, Implementation, Verification, Process, Security) on a 0-5 scale each. Threshold of 23/25 for auto-admission; 20-22/25 for semi-manual mode. No category compensates for a gap in another. Zero score in Security is forbidden at any total.
Blocking conditions: Factors independent of the readiness total that block merge: failed validation (Verification ≤ 2), missing rollback (Security ≤ 2), undefined blast radius (Implementation ≤ 2 without explicit field). Additionally: audit_trace_coverage < 1.0 and stateful=true without backup_verified=true.
Dry-run: Checking an action against a list of pre-approved operations without making real system changes. Executed only after passing the readiness gateway. Distinguishes allowed actions (restart_pod, scale_up_replicas_one) from forbidden ones (delete_namespace).
Audit trace: A causal chain linking all artifacts through incident_id: webhook_received → incident_event_normalized → /sdd:specify → spec_diff_created → commit → validate. A minimal fragment is reproducible without chat history.
Blast radius: Explicitly stated intervention level (pod, deployment, namespace). Protects against unintended expansion of operation scope. Must be fixed before implement, not described in prose.
Human-in-the-loop (manual confirmation): A mandatory element of auto-remediation: a human remains in the loop during uncertainty, blast radius expansion, or repeated failure. Fully automated remediation without human review remains a frontier scenario.
Idempotency: A property of tasks where repeated execution does not break system state. A mandatory requirement for maximum Implementation score.
GitOps fixation: Fixing all changes through Git with subsequent synchronization (e.g., via ArgoCD). In the educational minimum—a reference concept; in the full track—a mandatory step before production switchover.
Practice exercises: Name: Webhook Normalization and Verification
Problem: Run the normalization script for Grafana and PagerDuty fixtures. Ensure the output matches the reference field-for-field. Then modify webhook_grafana.json: change memory_percent from 93 to 87 and window from 10m to 5m. Run the normalizer again and explain why the output is still valid but does not match the reference. Which fields are critical for incident identification, and which are for metric context?
Solution: 1. cd book2/examples/real-api
python3 scripts/normalize_webhook.py --grafana fixtures/webhook_grafana.json --pagerduty fixtures/webhook_pagerduty.json --expected fixtures/incident_event.expected.json→ expect exit code 0.- Copy the fixture:
cp fixtures/webhook_grafana.json fixtures/webhook_grafana_modified.json - Change
memory_percentto 87 andwindowto5min the copy. - Run with the modified fixture: the script will return code 1, as the output does not match
incident_event.expected.json. - Analysis:
incident_key,service,namespace,podare critical for identification (must match across both sources).memory_percent,window_minutesare metric context; they influence specify but do not break normalization as a process. Mismatch with the reference is expected: the reference is fixed for a specific incident, while the modification creates a different event.
Complexity: beginner
Name: Readiness Gateway Run with Blocker Analysis
Problem: Sequentially run three readiness fixtures: readiness_pass.json, readiness_block_audit.json, readiness_block_stateful.json. For each, record the return code, final total, and specific blocking reason. Then manually create a fourth fixture readiness_block_implementation.json with a total of 24/25 but explicit radius_of_impact: 'cluster' without the dry_run_verified: true field. Predict and verify the gateway behavior.
Solution: 1. python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json → code 0, PASS incident=HM-2026-05-17-01 score=24/25.
python3 scripts/check_readiness.py --readiness fixtures/readiness_block_audit.json→ code 1,BLOCKwith reasons:score=22/25 below threshold 23andaudit_trace_coverage=0.7 < 1.0.python3 scripts/check_readiness.py --readiness fixtures/readiness_block_stateful.json→ code 1,BLOCKwith reason:stateful workload without verified backup(total 24/25 does not save it).- Create
fixtures/readiness_block_implementation.jsonbased onreadiness_pass.json, changeradius_of_impactto'cluster'and removedry_run_verified. - Prediction: code 1, blocked by
Implementation ≤ 2(undefined blast radius without dry-run)—this is a blocking condition independent of the total. - Verification by running confirms the prediction.
Complexity: intermediate
Name: Full Cycle specify → dry-run with Forbidden Action
Problem: Read specs/high_memory_usage/specify.md. Ensure it contains WHY/WHAT/constraints and pre-approved actions, but no specific kubectl command. Run dry_run.py for restart_pod (PASS) and delete_namespace (BLOCK). Then add a third action scale_up_replicas_one to the spec with condition requires: ['stateless', 'hpa_enabled'] and create a fixture where stateless: false. Verify that the new action is blocked by the condition, not just by absence from the list.
Solution: 1. Read specs/high_memory_usage/specify.md: verify presence of WHY, WHAT, constraints, list of pre-approved actions: [restart_pod, scale_up_replicas_one].
- Ensure there is no line like
kubectl delete podor a specific API call. python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod→ code 0,PASS.python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action delete_namespace→ code 1,BLOCK: action not in pre-approved list.- Edit the spec: add
scale_up_replicas_onewithrequires: ['stateless', 'hpa_enabled']. - Create a test fixture or pass context through script arguments where
stateless=false. - Run
dry_run.py --action scale_up_replicas_onewith contextstateless=false→ expectBLOCK: precondition 'stateless' not met. - Conclusion: dry-run checks not only list membership but also runtime conditions, making it more robust than a simple whitelist.
Complexity: intermediate
Name: Filling the 25-Point Rubric for Your Own Case
Problem: Take an incident from your practice (or use high_memory_usage). Fill out the table from the chapter's "Practice" section: for each of the five categories, state the score, specific artifact-evidence (file, log, diagram), and reason for reduction. Calculate the total, check blocking conditions. If the total is below 23 or there are blockers—formulate specific changes for each cell, not vague "improve process" statements.
Solution: Example fill for high_memory_usage:
| Category | Score | Artifact-evidence | Reason for reduction |
|---|---|---|---|
| Spec | 5 | specs/high_memory_usage/specify.md with WHY/WHAT/constraints and GWT | None |
| Implementation | 4 | tasks.md with idempotent steps, but scale_up without separate dry-run | One task changes state without prior verification |
| Verification | 5 | validation.md with GWT, JSON Schema, stress spec, post-metrics in two windows | None |
| Process | 5 | Log webhook → CLI → diff → commit with incident_id=HM-2026-05-17-01 | None |
| Security | 4 | specify.md with stateful guards, rollback, emergency stop | Escalation described in prose without formal trigger |
| Total | 23/25 | Blocking: none | What to change: add dry_run_verified for the scale-up branch in tasks.md; formalize escalation trigger in security_policy.md |
Check: 23 ≥ threshold, no blockers → production-ready with two semi-manual branches.
Complexity: advanced
Name: Creating a Reproducible Trace for Capstone
Problem: Execute the minimal runnable cycle (steps 1-8 from the chapter). Record the result in capstone/readiness.md in the YAML format from the chapter. Ensure the file is readable without chat history: each line must point to a specific run, not to "I discussed this with an assistant". Add one actually executed command to capstone/validation.md.
Solution: 1. Execute steps 1-8 from the "Minimal Educational Scenario" section.
- Create
capstone/readiness.md:
readiness:
pass_fixture: "readiness_pass.json -> 24/25"
blockers:
- "audit_trace_coverage=0.7 blocks auto mode (fixture: readiness_block_audit.json)"
- "stateful=true without backup_verified blocks action (fixture: readiness_block_stateful.json)"
dry_run: "restart_pod PASS; delete_namespace BLOCK (scripts/dry_run.py --spec specs/high_memory_usage/specify.md)"
- Create
capstone/validation.md:
# Readiness gateway validation
Actually executed commands:
- `python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json` → code 0
- `python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod` → code 0
- Verify: the file is readable without chat context, contains specific fixtures, return codes, commands. No references to "the assistant said" or "we discussed in chat".
Complexity: intermediate
Case studies: Name: high_memory_usage Incident in the Educational Environment: from Webhook to Blocking Forbidden Action
Scenario: The appointments-api service, built in part 12 of the first volume on SQLite, encountered a memory consumption spike. Grafana recorded memory_percent=93 over a 10-minute window for pod api-7b4 in namespace appointments-api. PagerDuty confirmed criticality and service binding. The team deploys an auto-remediation pipeline based on chapter 11 materials.
Challenge: Three key problems: (1) different alert sources (Grafana and PagerDuty) provide different versions of the same incident, creating a risk of double processing; (2) previous specify attempts immediately stated kubectl delete pod, blocking alternative strategies and interfering with Plan; (3) the team could not distinguish an incident eligible for auto-remediation from one requiring manual intervention, leading to restarting stateful pods without backup.
Solution: The team implemented a four-stage pipeline from chapter 11: (1) normalize_webhook.py consolidated Grafana and PagerDuty into a unified incident_event with incident_id=HM-2026-05-17-01, removing sensitive fields; (2) specify was rewritten in WHY/WHAT/constraints format with pre-approved actions (restart_pod, scale_up_replicas_one), without specific commands; (3) the readiness gateway on the 25-point model with a 23/25 threshold blocked incidents with incomplete audit (audit_trace_coverage=0.7) and stateful without backup; (4) dry_run.py checked actions against the pre-approved list, blocking delete_namespace and others outside the list.
Result: Local run showed: readiness_pass.json (24/25) passes with code 0, restart_pod gets PASS. readiness_block_audit.json (22/25) and readiness_block_stateful.json (24/25 but without backup) are blocked with specific reasons in stderr. delete_namespace gets BLOCK as not in pre-approved. The evidence package was recorded in capstone/readiness.md and passed team review in the spirit of part 16.
Lessons learned: Specify must not choose a specific remediation command—this blocks Plan and removes flexibility
The readiness gateway must be checked before dry-run, not after: sequence ensures known blast radius before checking the action list
Blocking conditions (audit, stateful) work independently of the total score—24/25 will not save from blocking stateful without backup
Reproducible trace with incident_id binding is more important than chat history: the reviewer must check the file, not take it on trust
Related concepts: Webhook normalization
25-point readiness model
SDD cycle
Dry-run
Audit trace
Blocking conditions
Name: Dangerous Antipattern: Launching Dry-Run Before Readiness Gateway in a Production Team
Scenario: A team with 2 years of SDD experience attempted to accelerate the pipeline by running dry_run.py in parallel with check_readiness.py to "not waste time waiting". In an incident with memory_percent=91, the action restart_pod was formally allowed by the specification, but audit_trace_coverage was 0.8 due to a missing /sdd:specify step in chat rather than in a file.
Challenge: Parallel execution created a race condition: dry-run returned PASS in 200 ms, readiness returned BLOCK in 500 ms, but the orchestrator had already begun execution preparation. The incomplete audit trace meant the causal link between the webhook and the specification was not provable—critical for post-incident analysis and regulatory audit.
Solution: The incident was manually stopped by an operator who noticed the discrepancy in logs. The team introduced a strict rule: dry_run.py runs only on exit code 0 from check_readiness.py, sequentially, with explicit check in the wrapper script. Added guardrail: if audit_trace_coverage < 1.0, readiness returns BLOCK before any other analysis. Revised the Process rubric: maximum score of 5 now requires not just having a log, but a reproducible replay by incident_id.
Result: The pipeline slowed by 300 ms on average, but zero incidents of unjustified auto-remediation over 6 months. The Process rubric became the strictest in evaluations: teams with 4 points (requiring manual variable substitution for replay) were not admitted to the auto-environment. The lesson was integrated into chapter 11's educational material as an explicit "Bad/Good" warning.
Lessons learned: Pipeline performance must never sacrifice provability: 300 ms wait is a trivial price for preventing unjustified intervention
Maximum Process score requires reproducible replay, not just having logs
Human-in-the-loop is not a bug but a feature: even experienced teams keep a human in the loop on the critical path
Guardrails must be independent of the main flow: audit is checked first, before any risk analysis
Related concepts: Audit trace
Blocking conditions
Process
Human-in-the-loop
Guardrails
Study tips: Go through the chapter sequentially, do not jump to the full production track: the educational minimum (steps 1-8) gives 80% understanding for 20% of the time
Run each script manually and record the return code in notes, do not rely on memory: the capstone reviewer checks specific commands
Create your own "broken" fixtures (as in exercise 2)—this is the best way to understand why blocking conditions work independently of the total score
Read specs/high_memory_usage/specify.md before running scripts: find WHY, WHAT, constraints and ensure there are no specific commands—this is the key distinction of good specify
Use qwen -p with the --approval-mode plan flag for the optional analysis step (from the chapter), but do not mix its output with the mandatory readiness package: plan is for review, scripts are for evidence
In parallel with the chapter, reread parts 7-9 and 16 of the first volume: the chapter 11 pipeline is a wrapper around the already known cycle, not its replacement
For visual style: draw the chain webhook → normalization → readiness → dry-run on paper and mark where the arrow breaks for each blocker
Additional resources: Github spec kit: https://github.com/github/spec-kit — practical framework for Specify → Plan → Tasks → Implement phases, referenced by the chapter
Github spec kit quickstart: https://github.github.io/spec-kit/quickstart.html — brief guide to implementing the SDD cycle
Chapter examples (examples/real-api/): Local pipeline on stdlib Python without external dependencies; scripts normalize_webhook.py, check_readiness.py, dry_run.py
Part 12 of the first volume (book/part-12-mvp.md): MVP phase and SQLite migrations—context for the high_memory_usage incident
Part 7 of the first volume (book/part-07-feature-specification.md): Specification cycle—foundation for /sdd:specify
Part 16 of the first volume (book/part-16-team-code-review.md): Team review—where the readiness package goes
Appendix D, section D.5 (appendix-d-threshold-calibration.md#d5-production-readiness-chapter-11): Threshold calibration for readiness for different risk profiles
Summary: Chapter 11 teaches building a reproducible auto-remediation pipeline from webhook to controlled execution. The educational minimum is a four-stage chain webhook → normalization → readiness → dry-run for the high_memory_usage case, implemented locally without external dependencies. Key principles: SDD phase separation protects against premature implementation; the 25-point readiness model with a 23/25 threshold and independent blocking conditions (audit, stateful, rollback, verification) determines admission to the auto-environment; dry-run checks actions only after passing the gateway; human review remains a mandatory element, not a speed obstacle. The full production track (GitOps, Kubernetes, full executor) is a frontier requiring accumulated replay evidence. The practical result of the first pass is not an orchestrator but proof: an allowed action passes, a forbidden one is blocked, the trace is reproducible by incident_id.