Study guide: Applied Part 11. Integration with a Real API: From Specification to Deployment

Lesson 3 of 5 in module «Applied Part 11. Integration with a Real API: From Specification to Deployment»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Topic: Applied Part 11. Integration with a Real API: from Specification to Deployment

Difficulty level: Medium

Estimated study time: 8-12 hours (theory 3-4 h, practice 5-8 h)

Prerequisites: Completion of parts 7-9 of the first volume (specification-plan-check cycle)

Part 12 of the first volume (SQLite migrations and MVP phase)

Part 16 of the first volume (team code review)

Basic proficiency in Python 3 and command line

Understanding of REST API, webhooks, and JSON format

Experience with Git and basic CI/CD

Learning objectives: Execute the full incident processing chain: from raw webhook to normalized event, readiness check, and dry-run with correct return codes for allowed and forbidden actions

Apply SDD phase separation Specify/Plan/Tasks/Implement/Validate for the high_memory_usage incident, ensuring no specific remediation commands appear in the specification

Evaluate the pipeline using the 25-point readiness model, identify blocking conditions (audit/stateful), and form a correct evidence package for capstone

Form a reproducible audit trace linking all artifacts to incident_id, suitable for team review without chat history

Overview: This chapter teaches building a production-ready auto-remediation pipeline from the first incident signal to controlled execution. At the center is the educational case high_memory_usage for the appointments-api service: normalization of Grafana and PagerDuty webhooks, passing the readiness gateway using the 25-point model, dry-run of pre-approved actions, and strict blocking of everything else. The entire path is implemented in examples/real-api/ without external dependencies: standard library Python scripts allow running the pipeline locally and seeing which conditions block action. The full production track (GitOps, Kubernetes API, full executor) is deferred until replay evidence is accumulated; in the educational minimum, it is sufficient to prove that an allowed action passes readiness, while a forbidden one is blocked from changing the system.

Key concepts: SDD cycle (specify/plan/tasks/implement/validate): A structured process where Specify captures WHY/WHAT/constraints without choosing a specific remediation command, Plan selects a strategy, Tasks decomposes it into executable steps, Implement applies changes in a controlled manner, Validate checks the result. Protects against premature implementation and ensures provability of each step.

Webhook normalization: Transforming raw payloads from different sources (Grafana, PagerDuty) into a unified incident_event format with fields service, namespace, pod, severity, window_minutes, metric_context, source_refs. Eliminates duplication and version conflicts of a single incident.

25-point readiness model: Pipeline evaluation across five categories (Spec, Implementation, Verification, Process, Security) on a 0-5 scale each. Threshold of 23/25 for auto-admission; 20-22/25 for semi-manual mode. No category compensates for a gap in another. Zero score in Security is forbidden at any total.

Blocking conditions: Factors independent of the readiness total that block merge: failed validation (Verification ≤ 2), missing rollback (Security ≤ 2), undefined blast radius (Implementation ≤ 2 without explicit field). Additionally: audit_trace_coverage < 1.0 and stateful=true without backup_verified=true.

Dry-run: Checking an action against a list of pre-approved operations without making real system changes. Executed only after passing the readiness gateway. Distinguishes allowed actions (restart_pod, scale_up_replicas_one) from forbidden ones (delete_namespace).

Audit trace: A causal chain linking all artifacts through incident_id: webhook_received → incident_event_normalized → /sdd:specify → spec_diff_created → commit → validate. A minimal fragment is reproducible without chat history.

Blast radius: Explicitly stated intervention level (pod, deployment, namespace). Protects against unintended expansion of operation scope. Must be fixed before implement, not described in prose.

Human-in-the-loop (manual confirmation): A mandatory element of auto-remediation: a human remains in the loop during uncertainty, blast radius expansion, or repeated failure. Fully automated remediation without human review remains a frontier scenario.

Idempotency: A property of tasks where repeated execution does not break system state. A mandatory requirement for maximum Implementation score.

GitOps fixation: Fixing all changes through Git with subsequent synchronization (e.g., via ArgoCD). In the educational minimum—a reference concept; in the full track—a mandatory step before production switchover.

Practice exercises: Name: Webhook Normalization and Verification

Problem: Run the normalization script for Grafana and PagerDuty fixtures. Ensure the output matches the reference field-for-field. Then modify webhook_grafana.json: change memory_percent from 93 to 87 and window from 10m to 5m. Run the normalizer again and explain why the output is still valid but does not match the reference. Which fields are critical for incident identification, and which are for metric context?

Solution: 1. cd book2/examples/real-api

  1. python3 scripts/normalize_webhook.py --grafana fixtures/webhook_grafana.json --pagerduty fixtures/webhook_pagerduty.json --expected fixtures/incident_event.expected.json → expect exit code 0.
  2. Copy the fixture: cp fixtures/webhook_grafana.json fixtures/webhook_grafana_modified.json
  3. Change memory_percent to 87 and window to 5m in the copy.
  4. Run with the modified fixture: the script will return code 1, as the output does not match incident_event.expected.json.
  5. Analysis: incident_key, service, namespace, pod are critical for identification (must match across both sources). memory_percent, window_minutes are metric context; they influence specify but do not break normalization as a process. Mismatch with the reference is expected: the reference is fixed for a specific incident, while the modification creates a different event.

Complexity: beginner

Name: Readiness Gateway Run with Blocker Analysis

Problem: Sequentially run three readiness fixtures: readiness_pass.json, readiness_block_audit.json, readiness_block_stateful.json. For each, record the return code, final total, and specific blocking reason. Then manually create a fourth fixture readiness_block_implementation.json with a total of 24/25 but explicit radius_of_impact: 'cluster' without the dry_run_verified: true field. Predict and verify the gateway behavior.

Solution: 1. python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json → code 0, PASS incident=HM-2026-05-17-01 score=24/25.

  1. python3 scripts/check_readiness.py --readiness fixtures/readiness_block_audit.json → code 1, BLOCK with reasons: score=22/25 below threshold 23 and audit_trace_coverage=0.7 < 1.0.
  2. python3 scripts/check_readiness.py --readiness fixtures/readiness_block_stateful.json → code 1, BLOCK with reason: stateful workload without verified backup (total 24/25 does not save it).
  3. Create fixtures/readiness_block_implementation.json based on readiness_pass.json, change radius_of_impact to 'cluster' and remove dry_run_verified.
  4. Prediction: code 1, blocked by Implementation ≤ 2 (undefined blast radius without dry-run)—this is a blocking condition independent of the total.
  5. Verification by running confirms the prediction.

Complexity: intermediate

Name: Full Cycle specify → dry-run with Forbidden Action

Problem: Read specs/high_memory_usage/specify.md. Ensure it contains WHY/WHAT/constraints and pre-approved actions, but no specific kubectl command. Run dry_run.py for restart_pod (PASS) and delete_namespace (BLOCK). Then add a third action scale_up_replicas_one to the spec with condition requires: ['stateless', 'hpa_enabled'] and create a fixture where stateless: false. Verify that the new action is blocked by the condition, not just by absence from the list.

Solution: 1. Read specs/high_memory_usage/specify.md: verify presence of WHY, WHAT, constraints, list of pre-approved actions: [restart_pod, scale_up_replicas_one].

  1. Ensure there is no line like kubectl delete pod or a specific API call.
  2. python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod → code 0, PASS.
  3. python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action delete_namespace → code 1, BLOCK: action not in pre-approved list.
  4. Edit the spec: add scale_up_replicas_one with requires: ['stateless', 'hpa_enabled'].
  5. Create a test fixture or pass context through script arguments where stateless=false.
  6. Run dry_run.py --action scale_up_replicas_one with context stateless=false → expect BLOCK: precondition 'stateless' not met.
  7. Conclusion: dry-run checks not only list membership but also runtime conditions, making it more robust than a simple whitelist.

Complexity: intermediate

Name: Filling the 25-Point Rubric for Your Own Case

Problem: Take an incident from your practice (or use high_memory_usage). Fill out the table from the chapter's "Practice" section: for each of the five categories, state the score, specific artifact-evidence (file, log, diagram), and reason for reduction. Calculate the total, check blocking conditions. If the total is below 23 or there are blockers—formulate specific changes for each cell, not vague "improve process" statements.

Solution: Example fill for high_memory_usage:

CategoryScoreArtifact-evidenceReason for reduction
Spec5specs/high_memory_usage/specify.md with WHY/WHAT/constraints and GWTNone
Implementation4tasks.md with idempotent steps, but scale_up without separate dry-runOne task changes state without prior verification
Verification5validation.md with GWT, JSON Schema, stress spec, post-metrics in two windowsNone
Process5Log webhook → CLI → diff → commit with incident_id=HM-2026-05-17-01None
Security4specify.md with stateful guards, rollback, emergency stopEscalation described in prose without formal trigger
Total23/25Blocking: noneWhat to change: add dry_run_verified for the scale-up branch in tasks.md; formalize escalation trigger in security_policy.md

Check: 23 ≥ threshold, no blockers → production-ready with two semi-manual branches.

Complexity: advanced

Name: Creating a Reproducible Trace for Capstone

Problem: Execute the minimal runnable cycle (steps 1-8 from the chapter). Record the result in capstone/readiness.md in the YAML format from the chapter. Ensure the file is readable without chat history: each line must point to a specific run, not to "I discussed this with an assistant". Add one actually executed command to capstone/validation.md.

Solution: 1. Execute steps 1-8 from the "Minimal Educational Scenario" section.

  1. Create capstone/readiness.md:
readiness:
  pass_fixture: "readiness_pass.json -> 24/25"
  blockers:
    - "audit_trace_coverage=0.7 blocks auto mode (fixture: readiness_block_audit.json)"
    - "stateful=true without backup_verified blocks action (fixture: readiness_block_stateful.json)"
  dry_run: "restart_pod PASS; delete_namespace BLOCK (scripts/dry_run.py --spec specs/high_memory_usage/specify.md)"
  1. Create capstone/validation.md:
# Readiness gateway validation

Actually executed commands:
- `python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json` → code 0
- `python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod` → code 0
  1. Verify: the file is readable without chat context, contains specific fixtures, return codes, commands. No references to "the assistant said" or "we discussed in chat".

Complexity: intermediate

Case studies: Name: high_memory_usage Incident in the Educational Environment: from Webhook to Blocking Forbidden Action

Scenario: The appointments-api service, built in part 12 of the first volume on SQLite, encountered a memory consumption spike. Grafana recorded memory_percent=93 over a 10-minute window for pod api-7b4 in namespace appointments-api. PagerDuty confirmed criticality and service binding. The team deploys an auto-remediation pipeline based on chapter 11 materials.

Challenge: Three key problems: (1) different alert sources (Grafana and PagerDuty) provide different versions of the same incident, creating a risk of double processing; (2) previous specify attempts immediately stated kubectl delete pod, blocking alternative strategies and interfering with Plan; (3) the team could not distinguish an incident eligible for auto-remediation from one requiring manual intervention, leading to restarting stateful pods without backup.

Solution: The team implemented a four-stage pipeline from chapter 11: (1) normalize_webhook.py consolidated Grafana and PagerDuty into a unified incident_event with incident_id=HM-2026-05-17-01, removing sensitive fields; (2) specify was rewritten in WHY/WHAT/constraints format with pre-approved actions (restart_pod, scale_up_replicas_one), without specific commands; (3) the readiness gateway on the 25-point model with a 23/25 threshold blocked incidents with incomplete audit (audit_trace_coverage=0.7) and stateful without backup; (4) dry_run.py checked actions against the pre-approved list, blocking delete_namespace and others outside the list.

Result: Local run showed: readiness_pass.json (24/25) passes with code 0, restart_pod gets PASS. readiness_block_audit.json (22/25) and readiness_block_stateful.json (24/25 but without backup) are blocked with specific reasons in stderr. delete_namespace gets BLOCK as not in pre-approved. The evidence package was recorded in capstone/readiness.md and passed team review in the spirit of part 16.

Lessons learned: Specify must not choose a specific remediation command—this blocks Plan and removes flexibility

The readiness gateway must be checked before dry-run, not after: sequence ensures known blast radius before checking the action list

Blocking conditions (audit, stateful) work independently of the total score—24/25 will not save from blocking stateful without backup

Reproducible trace with incident_id binding is more important than chat history: the reviewer must check the file, not take it on trust

Related concepts: Webhook normalization

25-point readiness model

SDD cycle

Dry-run

Audit trace

Blocking conditions

Name: Dangerous Antipattern: Launching Dry-Run Before Readiness Gateway in a Production Team

Scenario: A team with 2 years of SDD experience attempted to accelerate the pipeline by running dry_run.py in parallel with check_readiness.py to "not waste time waiting". In an incident with memory_percent=91, the action restart_pod was formally allowed by the specification, but audit_trace_coverage was 0.8 due to a missing /sdd:specify step in chat rather than in a file.

Challenge: Parallel execution created a race condition: dry-run returned PASS in 200 ms, readiness returned BLOCK in 500 ms, but the orchestrator had already begun execution preparation. The incomplete audit trace meant the causal link between the webhook and the specification was not provable—critical for post-incident analysis and regulatory audit.

Solution: The incident was manually stopped by an operator who noticed the discrepancy in logs. The team introduced a strict rule: dry_run.py runs only on exit code 0 from check_readiness.py, sequentially, with explicit check in the wrapper script. Added guardrail: if audit_trace_coverage < 1.0, readiness returns BLOCK before any other analysis. Revised the Process rubric: maximum score of 5 now requires not just having a log, but a reproducible replay by incident_id.

Result: The pipeline slowed by 300 ms on average, but zero incidents of unjustified auto-remediation over 6 months. The Process rubric became the strictest in evaluations: teams with 4 points (requiring manual variable substitution for replay) were not admitted to the auto-environment. The lesson was integrated into chapter 11's educational material as an explicit "Bad/Good" warning.

Lessons learned: Pipeline performance must never sacrifice provability: 300 ms wait is a trivial price for preventing unjustified intervention

Maximum Process score requires reproducible replay, not just having logs

Human-in-the-loop is not a bug but a feature: even experienced teams keep a human in the loop on the critical path

Guardrails must be independent of the main flow: audit is checked first, before any risk analysis

Related concepts: Audit trace

Blocking conditions

Process

Human-in-the-loop

Guardrails

Study tips: Go through the chapter sequentially, do not jump to the full production track: the educational minimum (steps 1-8) gives 80% understanding for 20% of the time

Run each script manually and record the return code in notes, do not rely on memory: the capstone reviewer checks specific commands

Create your own "broken" fixtures (as in exercise 2)—this is the best way to understand why blocking conditions work independently of the total score

Read specs/high_memory_usage/specify.md before running scripts: find WHY, WHAT, constraints and ensure there are no specific commands—this is the key distinction of good specify

Use qwen -p with the --approval-mode plan flag for the optional analysis step (from the chapter), but do not mix its output with the mandatory readiness package: plan is for review, scripts are for evidence

In parallel with the chapter, reread parts 7-9 and 16 of the first volume: the chapter 11 pipeline is a wrapper around the already known cycle, not its replacement

For visual style: draw the chain webhook → normalization → readiness → dry-run on paper and mark where the arrow breaks for each blocker

Additional resources: Github spec kit: https://github.com/github/spec-kit — practical framework for Specify → Plan → Tasks → Implement phases, referenced by the chapter

Github spec kit quickstart: https://github.github.io/spec-kit/quickstart.html — brief guide to implementing the SDD cycle

Chapter examples (examples/real-api/): Local pipeline on stdlib Python without external dependencies; scripts normalize_webhook.py, check_readiness.py, dry_run.py

Part 12 of the first volume (book/part-12-mvp.md): MVP phase and SQLite migrations—context for the high_memory_usage incident

Part 7 of the first volume (book/part-07-feature-specification.md): Specification cycle—foundation for /sdd:specify

Part 16 of the first volume (book/part-16-team-code-review.md): Team review—where the readiness package goes

Appendix D, section D.5 (appendix-d-threshold-calibration.md#d5-production-readiness-chapter-11): Threshold calibration for readiness for different risk profiles

Summary: Chapter 11 teaches building a reproducible auto-remediation pipeline from webhook to controlled execution. The educational minimum is a four-stage chain webhook → normalization → readiness → dry-run for the high_memory_usage case, implemented locally without external dependencies. Key principles: SDD phase separation protects against premature implementation; the 25-point readiness model with a 23/25 threshold and independent blocking conditions (audit, stateful, rollback, verification) determines admission to the auto-environment; dry-run checks actions only after passing the gateway; human review remains a mandatory element, not a speed obstacle. The full production track (GitOps, Kubernetes, full executor) is a frontier requiring accumulated replay evidence. The practical result of the first pass is not an orchestrator but proof: an allowed action passes, a forbidden one is blocked, the trace is reproducible by incident_id.

My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100