Lernleitfaden: Praktischer Teil 11. Integration mit einer echten API: von der Spezifikation bis zum Deployment

Lektion 3 von 5 im Modul «Praktischer Teil 11. Integration mit einer echten API: von der Spezifikation bis zum Deployment»
Sie sehen die Lektion ohne Anmeldung an. Anmelden, um Ihren Fortschritt zu speichern und Tests zu absolvieren.

Thema: Applied Part 11. Integration with Real API: From Specification to Deployment

Schwierigkeitsgrad: Mittel

Geschätzte Lernzeit: 8-12 hours (theory 3-4 h, practice 5-8 h)

Voraussetzungen: Completion of parts 7-9 of the first volume (specification-plan-check cycle)

Part 12 of the first volume (SQLite migrations and MVP phase)

Part 16 of the first volume (team code review)

Basic proficiency in Python 3 and command line

Understanding of REST API, webhooks, and JSON format

Experience with Git and basic CI/CD

Lernziele: Execute the complete incident processing chain: from raw webhook to normalized event, readiness check, and dry-run with correct return codes for allowed and prohibited actions

Apply SDD phase separation Specify/Plan/Tasks/Implement/Validate for the high_memory_usage incident, ensuring no specific remediation commands in the specification

Evaluate the pipeline using the 25-point readiness model, identify blocking conditions (audit/stateful), and form a correct evidence package for the capstone

Create a reproducible audit trace linking all artifacts to incident_id, suitable for team review without chat history

Übersicht: This chapter teaches building a production-ready auto-remediation pipeline from the first incident signal to controlled execution. At the center is the educational case high_memory_usage for the appointments-api service: normalization of Grafana and PagerDuty webhooks, passing the readiness gateway using the 25-point model, dry-run of pre-approved actions, and strict blocking of everything else. The entire path is implemented in examples/real-api/ without external dependencies: standard library Python scripts allow running the pipeline locally and seeing which conditions block an action. The full production track (GitOps, Kubernetes API, full executor) is deferred until replay evidence is accumulated; in the educational minimum, it is sufficient to prove that an allowed action passes readiness, while a prohibited one is blocked before system changes.

Schlüsselkonzepte: SDD cycle (specify/plan/tasks/implement/validate): A structured process where Specify captures WHY/WHAT/constraints without choosing a specific remediation command, Plan selects the strategy, Tasks breaks it down into executable steps, Implement applies changes in a controlled manner, Validate checks the result. Protects against premature implementation and ensures provability of each step.

Webhook normalization: Transforming raw payloads from different sources (Grafana, PagerDuty) into a unified incident_event format with fields service, namespace, pod, severity, window_minutes, metric_context, source_refs. Eliminates duplication and version conflicts of a single incident.

25-point readiness model: Pipeline evaluation across five categories (Spec, Implementation, Verification, Process, Security) on a 0-5 scale each. Threshold 23/25 for auto-admission; 20-22/25 — semi-manual mode. No category compensates for a gap in another. A zero score in Security is prohibited at any total.

Blocking conditions: Factors independent of the readiness total that block merge: failed validation (Verification ≤ 2), missing rollback (Security ≤ 2), undefined blast radius (Implementation ≤ 2 without explicit field). Additionally: audit_trace_coverage < 1.0 and stateful=true without backup_verified=true.

Dry-run (trial run): Checking an action against a list of pre-approved operations without real system changes. Only executed after passing the readiness gateway. Separates allowed actions (restart_pod, scale_up_replicas_one) from prohibited ones (delete_namespace).

Audit trace (trace log): A causal chain linking all artifacts through incident_id: webhook_received → incident_event_normalized → /sdd:specify → spec_diff_created → commit → validate. A minimal fragment is reproducible without chat history.

Blast radius: Explicitly stated intervention level (pod, deployment, namespace). Protects against unintended scope expansion. Must be fixed before implement, not described in text.

Human-in-the-loop (manual confirmation): A mandatory element of auto-remediation: a human remains in the loop during uncertainty, blast radius expansion, or repeated failure. Fully automated remediation without human review remains a frontier scenario.

Idempotency: A property of tasks where repeated execution does not break system state. A mandatory requirement for maximum score in Implementation.

GitOps fixation: Fixing all changes through Git with subsequent synchronization (e.g., via ArgoCD). In the educational minimum — a reference concept; in the full track — a mandatory step before production switchover.

Übungsaufgaben: Name: Normalization and Verification of a Webhook

Problem: Run the normalization script for the Grafana and PagerDuty fixtures. Ensure the output matches the reference field-by-field. Then modify webhook_grafana.json: change memory_percent from 93 to 87 and window from 10m to 5m. Run the normalizer again and explain why the output is still valid but does not match the reference. Which fields are critical for incident identification, and which are for metric context?

Lösung: 1. cd book2/examples/real-api

  1. python3 scripts/normalize_webhook.py --grafana fixtures/webhook_grafana.json --pagerduty fixtures/webhook_pagerduty.json --expected fixtures/incident_event.expected.json → expect code 0.
  2. Copy the fixture: cp fixtures/webhook_grafana.json fixtures/webhook_grafana_modified.json
  3. Change memory_percent to 87 and window to 5m in the copy.
  4. Run with the modified fixture: the script will return code 1 because the output does not match incident_event.expected.json.
  5. Analysis: incident_key, service, namespace, pod — critical for identification (must match across both sources). memory_percent, window_minutes — metric context, they influence specify but do not break normalization as a process. Mismatch with the reference is expected: the reference is fixed for a specific incident, while the modification creates a different event.

Komplexität: beginner

Name: Readiness Gateway Run with Blocker Analysis

Problem: Sequentially run three readiness fixtures: readiness_pass.json, readiness_block_audit.json, readiness_block_stateful.json. For each, record the return code, final total, and specific blocking reason. Then manually create a fourth fixture readiness_block_implementation.json with a total of 24/25 but explicit radius_of_impact: 'cluster' without the field dry_run_verified: true. Predict and verify the gateway behavior.

Lösung: 1. python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json → code 0, PASS incident=HM-2026-05-17-01 score=24/25.

  1. python3 scripts/check_readiness.py --readiness fixtures/readiness_block_audit.json → code 1, BLOCK with reasons: score=22/25 below threshold 23 and audit_trace_coverage=0.7 < 1.0.
  2. python3 scripts/check_readiness.py --readiness fixtures/readiness_block_stateful.json → code 1, BLOCK with reason: stateful workload without verified backup (total 24/25 does not save it).
  3. Create fixtures/readiness_block_implementation.json based on readiness_pass.json, change radius_of_impact to 'cluster' and remove dry_run_verified.
  4. Prediction: code 1, blocked due to Implementation ≤ 2 (undefined blast radius without dry-run) — this is a blocking condition independent of the total.
  5. Verification by running confirms the prediction.

Komplexität: intermediate

Name: Full Cycle specify → dry-run with Prohibited Action

Problem: Read specs/high_memory_usage/specify.md. Ensure it contains WHY/WHAT/constraints and pre-approved actions, but no specific kubectl command. Run dry_run.py for restart_pod (PASS) and delete_namespace (BLOCK). Then add a third action scale_up_replicas_one to the spec with condition requires: ['stateless', 'hpa_enabled'] and create a fixture where stateless: false. Verify that the new action is blocked by condition, not just by absence from the list.

Lösung: 1. Read specs/high_memory_usage/specify.md: check for WHY, WHAT, constraints, list of pre-approved actions: [restart_pod, scale_up_replicas_one].

  1. Ensure there is no line like kubectl delete pod or a specific API call.
  2. python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod → code 0, PASS.
  3. python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action delete_namespace → code 1, BLOCK: action not in pre-approved list.
  4. Edit the spec: add scale_up_replicas_one with requires: ['stateless', 'hpa_enabled'].
  5. Create a test fixture or pass context via script arguments where stateless=false.
  6. Run dry_run.py --action scale_up_replicas_one with context stateless=false → expect BLOCK: precondition 'stateless' not met.
  7. Conclusion: dry-run checks not only list membership but also runtime conditions, making it more reliable than a simple whitelist.

Komplexität: intermediate

Name: Filling the 25-Point Rubric for Your Own Case

Problem: Take an incident from your practice (or use high_memory_usage). Fill out the table from the chapter's "Practice" section: for each of the five categories, indicate the score, specific artifact-evidence (file, log, diagram), and reason for reduction. Calculate the total, check blocking conditions. If the total is below 23 or there are blockers — formulate specific changes for each cell, not vague "improve the process".

Lösung: Example fill for high_memory_usage:

CategoryScoreArtifact-evidenceReason for reduction
Spec5specs/high_memory_usage/specify.md with WHY/WHAT/constraints and GWTNone
Implementation4tasks.md with idempotent steps, but scale_up without separate dry-runOne task changes state without prior check
Verification5validation.md with GWT, JSON Schema, stress spec, post-metrics in two windowsNone
Process5Log webhook → CLI → diff → commit with incident_id=HM-2026-05-17-01None
Security4specify.md with stateful guards, rollback, emergency stopEscalation described in text without formal trigger
Total23/25Blocking: noneWhat to change: add dry_run_verified for the scale-up branch in tasks.md; formalize escalation trigger in security_policy.md

Check: 23 ≥ threshold, no blockers → production-ready with two semi-manual branches.

Komplexität: advanced

Name: Creating a Reproducible Trace for Capstone

Problem: Execute the minimal runnable cycle (steps 1-8 from the chapter). Record the result in capstone/readiness.md in the YAML format from the chapter. Ensure the file is readable without chat history: each line must point to a specific run, not "I discussed this with an assistant". Add one actually executed command to capstone/validation.md.

Lösung: 1. Execute steps 1-8 from the "Minimal Educational Scenario" section.

  1. Create capstone/readiness.md:
readiness:
  pass_fixture: "readiness_pass.json -> 24/25"
  blockers:
    - "audit_trace_coverage=0.7 blocks auto mode (fixture: readiness_block_audit.json)"
    - "stateful=true without backup_verified blocks action (fixture: readiness_block_stateful.json)"
  dry_run: "restart_pod PASS; delete_namespace BLOCK (scripts/dry_run.py --spec specs/high_memory_usage/specify.md)"
  1. Create capstone/validation.md:
# Readiness Gateway Validation

Actually executed commands:
- `python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json` → code 0
- `python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod` → code 0
  1. Check: file is readable without chat context, contains specific fixtures, return codes, commands. No references to "assistant said" or "discussed in chat".

Komplexität: intermediate

Fallstudien: Name: high_memory_usage Incident in the Educational Environment: From Webhook to Blocking of Prohibited Action

Szenario: The appointments-api service, built in part 12 of the first volume on SQLite, encountered a memory consumption spike. Grafana recorded memory_percent=93 over a 10-minute window for pod api-7b4 in namespace appointments-api. PagerDuty confirmed criticality and service binding. The team deploys an auto-remediation pipeline based on chapter 11 materials.

Aufgabe: Three key problems: (1) different alert sources (Grafana and PagerDuty) provide different versions of the same incident, creating a risk of double processing; (2) previous specify attempts immediately indicated kubectl delete pod, blocking alternative strategies and hindering Plan; (3) the team could not distinguish an incident suitable for auto-remediation from one requiring manual intervention, leading to restarting stateful pods without backup.

Lösung: The team implemented the four-stage pipeline from chapter 11: (1) normalize_webhook.py consolidated Grafana and PagerDuty into a unified incident_event with incident_id=HM-2026-05-17-01, removing sensitive fields; (2) specify was rewritten in WHY/WHAT/constraints format with pre-approved actions (restart_pod, scale_up_replicas_one), without specific commands; (3) the readiness gateway on the 25-point model with threshold 23/25 blocked incidents with incomplete audit (audit_trace_coverage=0.7) and stateful without backup; (4) dry_run.py checked actions against the pre-approved list, blocking delete_namespace and others outside the list.

Ergebnis: Local run showed: readiness_pass.json (24/25) passes with code 0, restart_pod gets PASS. readiness_block_audit.json (22/25) and readiness_block_stateful.json (24/25, but without backup) are blocked with specific reasons in stderr. delete_namespace gets BLOCK as not in pre-approved. The evidence package was recorded in capstone/readiness.md and passed team review in the spirit of part 16.

Gewonnene Erkenntnisse: Specify must not choose a specific remediation command — this blocks Plan and removes flexibility

The readiness gateway must be checked before dry-run, not after: the sequence ensures a known blast radius before checking the action list

Blocking conditions (audit, stateful) work independently of the total score — 24/25 does not save from blocking stateful without backup

A reproducible trace linked to incident_id is more important than chat history: the reviewer must check the file, not take it on trust

Verwandte Konzepte: Webhook normalization

25-point readiness model

SDD cycle

Dry-run

Audit trace

Blocking conditions

Name: Dangerous Antipattern: Running Dry-Run Before Readiness Gateway in a Production Team

Szenario: A team with 2 years of SDD experience tried to accelerate the pipeline by running dry_run.py in parallel with check_readiness.py to "not waste time waiting". In an incident with memory_percent=91, the restart_pod action was formally allowed by the specification, but audit_trace_coverage was 0.8 due to a missed /sdd:specify step in chat rather than in a file.

Aufgabe: Parallel execution created a race condition: dry-run returned PASS in 200 ms, readiness returned BLOCK in 500 ms, but the orchestrator had already started preparing for execution. The missing full audit trace meant the causal link between the webhook and the specification was not provable — critical for post-incident analysis and regulatory inspection.

Lösung: The incident was manually stopped by an operator who noticed the discrepancy in logs. The team introduced a strict rule: dry_run.py is only launched on return code 0 from check_readiness.py, sequentially, with explicit checking in the wrapper script. Added guardrail: if audit_trace_coverage < 1.0, readiness returns BLOCK before any other analysis. Revised the Process rubric: maximum score 5 now requires not just the presence of a log, but a reproducible replay by incident_id.

Ergebnis: The pipeline slowed by 300 ms on average, but zero incidents of unjustified auto-remediation over 6 months. The Process rubric became the strictest in evaluations: teams with 4 points (requiring manual variable substitution for replay) were not admitted to the auto-environment. The lesson was integrated into chapter 11 educational material as an explicit "Bad/Good" warning.

Gewonnene Erkenntnisse: Pipeline performance must never sacrifice provability: 300 ms wait is a negligible price for preventing unjustified intervention

Maximum score in Process requires a reproducible replay, not just the presence of logs

Human-in-the-loop is not a bug but a feature: even experienced teams keep a human in the loop on the critical path

Guardrails must be independent of the main flow: audit is checked first, before any risk analysis

Verwandte Konzepte: Audit trace

Blocking conditions

Process

Human-in-the-loop

Guardrails

Lerntipps: Go through the chapter sequentially, do not jump to the full production track: the educational minimum (steps 1-8) gives 80% understanding for 20% of the time

Run each script manually and record the return code in notes, do not rely on memory: the capstone reviewer checks specific commands

Create your own "broken" fixtures (as in exercise 2) — this is the best way to understand why blocking conditions work independently of the total score

Read specs/high_memory_usage/specify.md before running scripts: find WHY, WHAT, constraints and ensure there are no specific commands — this is the key distinction of a good specify

Use qwen -p with flag --approval-mode plan for the optional analysis step (step from the chapter), but do not mix its output with the mandatory readiness package: plan is for review, scripts are for evidence

Alongside the chapter, re-read parts 7-9 and 16 of the first volume: the chapter 11 pipeline is a wrapper around the already known cycle, not its replacement

For visual style: draw the chain webhook → normalization → readiness → dry-run on paper and mark where the arrow breaks for each blocker

Zusätzliche Ressourcen: Github spec kit: https://github.com/github/spec-kit — practical framework for the Specify → Plan → Tasks → Implement phases referenced by the chapter

Github spec kit quickstart: https://github.github.io/spec-kit/quickstart.html — brief guide to implementing the SDD cycle

Examples for the chapter (examples/real-api/): Local pipeline on Python stdlib without external dependencies; scripts normalize_webhook.py, check_readiness.py, dry_run.py

Part 12 of the first volume (book/part-12-mvp.md): MVP phase and SQLite migrations — context for the high_memory_usage incident

Part 7 of the first volume (book/part-07-feature-specification.md): Specification cycle — foundation for /sdd:specify

Part 16 of the first volume (book/part-16-team-code-review.md): Team review — where the readiness package goes

Appendix D, section D.5 (appendix-d-threshold-calibration.md#d5-production-readiness-chapter-11): Threshold calibration for readiness for different risk profiles

Zusammenfassung: Chapter 11 teaches building a reproducible auto-remediation pipeline from webhook to controlled execution. The educational minimum is a four-stage chain webhook → normalization → readiness → dry-run for the high_memory_usage case, implemented locally without external dependencies. Key principles: SDD phase separation protects against premature implementation; the 25-point readiness model with threshold 23/25 and independent blocking conditions (audit, stateful, rollback, verification) determines admission to the auto-environment; dry-run checks actions only after passing the gateway; human-review remains a mandatory element, not a speed obstacle. The full production track (GitOps, Kubernetes, full executor) is a frontier requiring accumulated replay evidence. The practical result of the first pass is not an orchestrator, but proof: an allowed action passes, a prohibited one is blocked, the trace is reproducible by incident_id.

Meine Notizen
0 / 10000

Notizen werden in diesem Browser gespeichert. Auf anderen Geräten erscheinen sie nicht.

Kursmenü

Kurs

Production SDD für Qwen Code CLI. Teil 2
Fortschritt 0 / 100