Thema: Applied Part 11. Integration with Real API: From Specification to Deployment
Schwierigkeitsgrad: Mittel
Geschätzte Lernzeit: 8-12 hours (theory 3-4 h, practice 5-8 h)
Voraussetzungen: Completion of parts 7-9 of the first volume (specification-plan-check cycle)
Part 12 of the first volume (SQLite migrations and MVP phase)
Part 16 of the first volume (team code review)
Basic proficiency in Python 3 and command line
Understanding of REST API, webhooks, and JSON format
Experience with Git and basic CI/CD
Lernziele: Execute the complete incident processing chain: from raw webhook to normalized event, readiness check, and dry-run with correct return codes for allowed and prohibited actions
Apply SDD phase separation Specify/Plan/Tasks/Implement/Validate for the high_memory_usage incident, ensuring no specific remediation commands in the specification
Evaluate the pipeline using the 25-point readiness model, identify blocking conditions (audit/stateful), and form a correct evidence package for the capstone
Create a reproducible audit trace linking all artifacts to incident_id, suitable for team review without chat history
Übersicht: This chapter teaches building a production-ready auto-remediation pipeline from the first incident signal to controlled execution. At the center is the educational case high_memory_usage for the appointments-api service: normalization of Grafana and PagerDuty webhooks, passing the readiness gateway using the 25-point model, dry-run of pre-approved actions, and strict blocking of everything else. The entire path is implemented in examples/real-api/ without external dependencies: standard library Python scripts allow running the pipeline locally and seeing which conditions block an action. The full production track (GitOps, Kubernetes API, full executor) is deferred until replay evidence is accumulated; in the educational minimum, it is sufficient to prove that an allowed action passes readiness, while a prohibited one is blocked before system changes.
Schlüsselkonzepte: SDD cycle (specify/plan/tasks/implement/validate): A structured process where Specify captures WHY/WHAT/constraints without choosing a specific remediation command, Plan selects the strategy, Tasks breaks it down into executable steps, Implement applies changes in a controlled manner, Validate checks the result. Protects against premature implementation and ensures provability of each step.
Webhook normalization: Transforming raw payloads from different sources (Grafana, PagerDuty) into a unified incident_event format with fields service, namespace, pod, severity, window_minutes, metric_context, source_refs. Eliminates duplication and version conflicts of a single incident.
25-point readiness model: Pipeline evaluation across five categories (Spec, Implementation, Verification, Process, Security) on a 0-5 scale each. Threshold 23/25 for auto-admission; 20-22/25 — semi-manual mode. No category compensates for a gap in another. A zero score in Security is prohibited at any total.
Blocking conditions: Factors independent of the readiness total that block merge: failed validation (Verification ≤ 2), missing rollback (Security ≤ 2), undefined blast radius (Implementation ≤ 2 without explicit field). Additionally: audit_trace_coverage < 1.0 and stateful=true without backup_verified=true.
Dry-run (trial run): Checking an action against a list of pre-approved operations without real system changes. Only executed after passing the readiness gateway. Separates allowed actions (restart_pod, scale_up_replicas_one) from prohibited ones (delete_namespace).
Audit trace (trace log): A causal chain linking all artifacts through incident_id: webhook_received → incident_event_normalized → /sdd:specify → spec_diff_created → commit → validate. A minimal fragment is reproducible without chat history.
Blast radius: Explicitly stated intervention level (pod, deployment, namespace). Protects against unintended scope expansion. Must be fixed before implement, not described in text.
Human-in-the-loop (manual confirmation): A mandatory element of auto-remediation: a human remains in the loop during uncertainty, blast radius expansion, or repeated failure. Fully automated remediation without human review remains a frontier scenario.
Idempotency: A property of tasks where repeated execution does not break system state. A mandatory requirement for maximum score in Implementation.
GitOps fixation: Fixing all changes through Git with subsequent synchronization (e.g., via ArgoCD). In the educational minimum — a reference concept; in the full track — a mandatory step before production switchover.
Übungsaufgaben: Name: Normalization and Verification of a Webhook
Problem: Run the normalization script for the Grafana and PagerDuty fixtures. Ensure the output matches the reference field-by-field. Then modify webhook_grafana.json: change memory_percent from 93 to 87 and window from 10m to 5m. Run the normalizer again and explain why the output is still valid but does not match the reference. Which fields are critical for incident identification, and which are for metric context?
Lösung: 1. cd book2/examples/real-api
python3 scripts/normalize_webhook.py --grafana fixtures/webhook_grafana.json --pagerduty fixtures/webhook_pagerduty.json --expected fixtures/incident_event.expected.json→ expect code 0.- Copy the fixture:
cp fixtures/webhook_grafana.json fixtures/webhook_grafana_modified.json - Change
memory_percentto 87 andwindowto5min the copy. - Run with the modified fixture: the script will return code 1 because the output does not match
incident_event.expected.json. - Analysis:
incident_key,service,namespace,pod— critical for identification (must match across both sources).memory_percent,window_minutes— metric context, they influence specify but do not break normalization as a process. Mismatch with the reference is expected: the reference is fixed for a specific incident, while the modification creates a different event.
Komplexität: beginner
Name: Readiness Gateway Run with Blocker Analysis
Problem: Sequentially run three readiness fixtures: readiness_pass.json, readiness_block_audit.json, readiness_block_stateful.json. For each, record the return code, final total, and specific blocking reason. Then manually create a fourth fixture readiness_block_implementation.json with a total of 24/25 but explicit radius_of_impact: 'cluster' without the field dry_run_verified: true. Predict and verify the gateway behavior.
Lösung: 1. python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json → code 0, PASS incident=HM-2026-05-17-01 score=24/25.
python3 scripts/check_readiness.py --readiness fixtures/readiness_block_audit.json→ code 1,BLOCKwith reasons:score=22/25 below threshold 23andaudit_trace_coverage=0.7 < 1.0.python3 scripts/check_readiness.py --readiness fixtures/readiness_block_stateful.json→ code 1,BLOCKwith reason:stateful workload without verified backup(total 24/25 does not save it).- Create
fixtures/readiness_block_implementation.jsonbased onreadiness_pass.json, changeradius_of_impactto'cluster'and removedry_run_verified. - Prediction: code 1, blocked due to
Implementation ≤ 2(undefined blast radius without dry-run) — this is a blocking condition independent of the total. - Verification by running confirms the prediction.
Komplexität: intermediate
Name: Full Cycle specify → dry-run with Prohibited Action
Problem: Read specs/high_memory_usage/specify.md. Ensure it contains WHY/WHAT/constraints and pre-approved actions, but no specific kubectl command. Run dry_run.py for restart_pod (PASS) and delete_namespace (BLOCK). Then add a third action scale_up_replicas_one to the spec with condition requires: ['stateless', 'hpa_enabled'] and create a fixture where stateless: false. Verify that the new action is blocked by condition, not just by absence from the list.
Lösung: 1. Read specs/high_memory_usage/specify.md: check for WHY, WHAT, constraints, list of pre-approved actions: [restart_pod, scale_up_replicas_one].
- Ensure there is no line like
kubectl delete podor a specific API call. python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod→ code 0,PASS.python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action delete_namespace→ code 1,BLOCK: action not in pre-approved list.- Edit the spec: add
scale_up_replicas_onewithrequires: ['stateless', 'hpa_enabled']. - Create a test fixture or pass context via script arguments where
stateless=false. - Run
dry_run.py --action scale_up_replicas_onewith contextstateless=false→ expectBLOCK: precondition 'stateless' not met. - Conclusion: dry-run checks not only list membership but also runtime conditions, making it more reliable than a simple whitelist.
Komplexität: intermediate
Name: Filling the 25-Point Rubric for Your Own Case
Problem: Take an incident from your practice (or use high_memory_usage). Fill out the table from the chapter's "Practice" section: for each of the five categories, indicate the score, specific artifact-evidence (file, log, diagram), and reason for reduction. Calculate the total, check blocking conditions. If the total is below 23 or there are blockers — formulate specific changes for each cell, not vague "improve the process".
Lösung: Example fill for high_memory_usage:
| Category | Score | Artifact-evidence | Reason for reduction |
|---|---|---|---|
| Spec | 5 | specs/high_memory_usage/specify.md with WHY/WHAT/constraints and GWT | None |
| Implementation | 4 | tasks.md with idempotent steps, but scale_up without separate dry-run | One task changes state without prior check |
| Verification | 5 | validation.md with GWT, JSON Schema, stress spec, post-metrics in two windows | None |
| Process | 5 | Log webhook → CLI → diff → commit with incident_id=HM-2026-05-17-01 | None |
| Security | 4 | specify.md with stateful guards, rollback, emergency stop | Escalation described in text without formal trigger |
| Total | 23/25 | Blocking: none | What to change: add dry_run_verified for the scale-up branch in tasks.md; formalize escalation trigger in security_policy.md |
Check: 23 ≥ threshold, no blockers → production-ready with two semi-manual branches.
Komplexität: advanced
Name: Creating a Reproducible Trace for Capstone
Problem: Execute the minimal runnable cycle (steps 1-8 from the chapter). Record the result in capstone/readiness.md in the YAML format from the chapter. Ensure the file is readable without chat history: each line must point to a specific run, not "I discussed this with an assistant". Add one actually executed command to capstone/validation.md.
Lösung: 1. Execute steps 1-8 from the "Minimal Educational Scenario" section.
- Create
capstone/readiness.md:
readiness:
pass_fixture: "readiness_pass.json -> 24/25"
blockers:
- "audit_trace_coverage=0.7 blocks auto mode (fixture: readiness_block_audit.json)"
- "stateful=true without backup_verified blocks action (fixture: readiness_block_stateful.json)"
dry_run: "restart_pod PASS; delete_namespace BLOCK (scripts/dry_run.py --spec specs/high_memory_usage/specify.md)"
- Create
capstone/validation.md:
# Readiness Gateway Validation
Actually executed commands:
- `python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json` → code 0
- `python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod` → code 0
- Check: file is readable without chat context, contains specific fixtures, return codes, commands. No references to "assistant said" or "discussed in chat".
Komplexität: intermediate
Fallstudien: Name: high_memory_usage Incident in the Educational Environment: From Webhook to Blocking of Prohibited Action
Szenario: The appointments-api service, built in part 12 of the first volume on SQLite, encountered a memory consumption spike. Grafana recorded memory_percent=93 over a 10-minute window for pod api-7b4 in namespace appointments-api. PagerDuty confirmed criticality and service binding. The team deploys an auto-remediation pipeline based on chapter 11 materials.
Aufgabe: Three key problems: (1) different alert sources (Grafana and PagerDuty) provide different versions of the same incident, creating a risk of double processing; (2) previous specify attempts immediately indicated kubectl delete pod, blocking alternative strategies and hindering Plan; (3) the team could not distinguish an incident suitable for auto-remediation from one requiring manual intervention, leading to restarting stateful pods without backup.
Lösung: The team implemented the four-stage pipeline from chapter 11: (1) normalize_webhook.py consolidated Grafana and PagerDuty into a unified incident_event with incident_id=HM-2026-05-17-01, removing sensitive fields; (2) specify was rewritten in WHY/WHAT/constraints format with pre-approved actions (restart_pod, scale_up_replicas_one), without specific commands; (3) the readiness gateway on the 25-point model with threshold 23/25 blocked incidents with incomplete audit (audit_trace_coverage=0.7) and stateful without backup; (4) dry_run.py checked actions against the pre-approved list, blocking delete_namespace and others outside the list.
Ergebnis: Local run showed: readiness_pass.json (24/25) passes with code 0, restart_pod gets PASS. readiness_block_audit.json (22/25) and readiness_block_stateful.json (24/25, but without backup) are blocked with specific reasons in stderr. delete_namespace gets BLOCK as not in pre-approved. The evidence package was recorded in capstone/readiness.md and passed team review in the spirit of part 16.
Gewonnene Erkenntnisse: Specify must not choose a specific remediation command — this blocks Plan and removes flexibility
The readiness gateway must be checked before dry-run, not after: the sequence ensures a known blast radius before checking the action list
Blocking conditions (audit, stateful) work independently of the total score — 24/25 does not save from blocking stateful without backup
A reproducible trace linked to incident_id is more important than chat history: the reviewer must check the file, not take it on trust
Verwandte Konzepte: Webhook normalization
25-point readiness model
SDD cycle
Dry-run
Audit trace
Blocking conditions
Name: Dangerous Antipattern: Running Dry-Run Before Readiness Gateway in a Production Team
Szenario: A team with 2 years of SDD experience tried to accelerate the pipeline by running dry_run.py in parallel with check_readiness.py to "not waste time waiting". In an incident with memory_percent=91, the restart_pod action was formally allowed by the specification, but audit_trace_coverage was 0.8 due to a missed /sdd:specify step in chat rather than in a file.
Aufgabe: Parallel execution created a race condition: dry-run returned PASS in 200 ms, readiness returned BLOCK in 500 ms, but the orchestrator had already started preparing for execution. The missing full audit trace meant the causal link between the webhook and the specification was not provable — critical for post-incident analysis and regulatory inspection.
Lösung: The incident was manually stopped by an operator who noticed the discrepancy in logs. The team introduced a strict rule: dry_run.py is only launched on return code 0 from check_readiness.py, sequentially, with explicit checking in the wrapper script. Added guardrail: if audit_trace_coverage < 1.0, readiness returns BLOCK before any other analysis. Revised the Process rubric: maximum score 5 now requires not just the presence of a log, but a reproducible replay by incident_id.
Ergebnis: The pipeline slowed by 300 ms on average, but zero incidents of unjustified auto-remediation over 6 months. The Process rubric became the strictest in evaluations: teams with 4 points (requiring manual variable substitution for replay) were not admitted to the auto-environment. The lesson was integrated into chapter 11 educational material as an explicit "Bad/Good" warning.
Gewonnene Erkenntnisse: Pipeline performance must never sacrifice provability: 300 ms wait is a negligible price for preventing unjustified intervention
Maximum score in Process requires a reproducible replay, not just the presence of logs
Human-in-the-loop is not a bug but a feature: even experienced teams keep a human in the loop on the critical path
Guardrails must be independent of the main flow: audit is checked first, before any risk analysis
Verwandte Konzepte: Audit trace
Blocking conditions
Process
Human-in-the-loop
Guardrails
Lerntipps: Go through the chapter sequentially, do not jump to the full production track: the educational minimum (steps 1-8) gives 80% understanding for 20% of the time
Run each script manually and record the return code in notes, do not rely on memory: the capstone reviewer checks specific commands
Create your own "broken" fixtures (as in exercise 2) — this is the best way to understand why blocking conditions work independently of the total score
Read specs/high_memory_usage/specify.md before running scripts: find WHY, WHAT, constraints and ensure there are no specific commands — this is the key distinction of a good specify
Use qwen -p with flag --approval-mode plan for the optional analysis step (step from the chapter), but do not mix its output with the mandatory readiness package: plan is for review, scripts are for evidence
Alongside the chapter, re-read parts 7-9 and 16 of the first volume: the chapter 11 pipeline is a wrapper around the already known cycle, not its replacement
For visual style: draw the chain webhook → normalization → readiness → dry-run on paper and mark where the arrow breaks for each blocker
Zusätzliche Ressourcen: Github spec kit: https://github.com/github/spec-kit — practical framework for the Specify → Plan → Tasks → Implement phases referenced by the chapter
Github spec kit quickstart: https://github.github.io/spec-kit/quickstart.html — brief guide to implementing the SDD cycle
Examples for the chapter (examples/real-api/): Local pipeline on Python stdlib without external dependencies; scripts normalize_webhook.py, check_readiness.py, dry_run.py
Part 12 of the first volume (book/part-12-mvp.md): MVP phase and SQLite migrations — context for the high_memory_usage incident
Part 7 of the first volume (book/part-07-feature-specification.md): Specification cycle — foundation for /sdd:specify
Part 16 of the first volume (book/part-16-team-code-review.md): Team review — where the readiness package goes
Appendix D, section D.5 (appendix-d-threshold-calibration.md#d5-production-readiness-chapter-11): Threshold calibration for readiness for different risk profiles
Zusammenfassung: Chapter 11 teaches building a reproducible auto-remediation pipeline from webhook to controlled execution. The educational minimum is a four-stage chain webhook → normalization → readiness → dry-run for the high_memory_usage case, implemented locally without external dependencies. Key principles: SDD phase separation protects against premature implementation; the 25-point readiness model with threshold 23/25 and independent blocking conditions (audit, stateful, rollback, verification) determines admission to the auto-environment; dry-run checks actions only after passing the gateway; human-review remains a mandatory element, not a speed obstacle. The full production track (GitOps, Kubernetes, full executor) is a frontier requiring accumulated replay evidence. The practical result of the first pass is not an orchestrator, but proof: an allowed action passes, a prohibited one is blocked, the trace is reproducible by incident_id.