Applied Part 11. Integration with a Real API: From Specification to Deployment
Status: Recommendation. SDD phase separation of Specify/Plan/Tasks/Implement/Validate and the 25-point readiness model — recommended framework. It does not require actual Kubernetes, GitOps, or an external executor for the educational pass.
Frontier. Fully automated auto-remediation without human review on the critical path remains frontier: even teams with extensive SDD experience keep a human in the loop. Of the built-in Qwen Code commands, only /plan is available here; the other steps are user commands or direct qwen -p via project scripts.
For the educational pass, the local pipeline examples/real-api/ is sufficient: normalize webhooks, pass the readiness gate, and block a forbidden action. GitOps, Kubernetes API, and full auto-remediation belong to the full production track.
> [runnable] — a runnable equivalent of the "webhook → normalization → readiness gate → dry run" pipeline is in [examples/real-api/](examples/real-api/README.md). Scripts work on stdlib without external dependencies; they do not replace production infrastructure, but allow you to run the gate locally and see which conditions block the action.
The high_memory_usage scenario is the peak of readings on the same SQLite we built in Part 12 of the first volume, and the same idempotent migration technique. Only now it is viewed from the operations side. The Specify → Plan → Tasks → Implement cycle, practiced in Part 7, Part 8, and Part 9 of the first volume, is not cancelled or replaced here. It is wrapped in a production gate and completed with team review of the evidence package in the spirit of Part 16.
Before Reading
- Foundation from the first volume: Parts 7–9 establish the specification-plan-validation cycle, Part 16 — team review.
- Local educational case:
high_memory_usage, the canonical case of the entire first pass.
- Trace for
capstone/: readiness verdict, two blocking conditions, and dry-run of the permitted action. - Key terms of the first pass:
readinessand dry-run. 25-point rubric,audit_trace, GitOps, executor — reference material. - What to defer: GitOps, Kubernetes API, full executor, and auto-remediation without manual confirmation.
Objective
In the educational minimum, this chapter checks the short chain webhook -> normalization -> readiness -> dry-run for high_memory_usage. The full production track expands it to GitOps deployment, rollback of changes, and readiness assessment before limited auto-remediation. Every action must be linked to specify/plan/tasks/implement/validate artifacts, not lost in manual commands.
The practical result of the first pass is not a production orchestrator, but proof that a permitted action passes readiness, while a forbidden one is blocked until the system is changed.
readiness here is a formal pipeline assessment on a 25-point scale with a threshold of 23/25. Auto-remediation in this chapter means a limited playbook with pre-approved actions, rollback conditions, and human review. This is not an agent's right to arbitrarily change production.
Of the built-in Qwen Code commands in this pipeline, only /plan is available. The other steps — /sdd:specify, /sdd:tasks, /sdd:validate — should be implemented as user commands in .qwen/commands/sdd/ or replaced with regular prompts via qwen -p and project scripts.
Minimal Educational Scenario
Educational Case
Production incident high_memory_usage for appointments-api — derived from the MVP phase and SQLite migrations in book/part-12-mvp.md. Pipeline: webhook from Grafana+PagerDuty → normalize_webhook.py → readiness gate by 25-point model → dry run against the list of pre-approved actions. Goal — to complete the full path from raw payload to controlled restart_pod and ensure that blocking conditions (audit, stateful) catch failure exactly where they should.
Preparation
book2/examples/real-api/fixtures/webhook_grafana.json,webhook_pagerduty.json— raw payloads with the sameincident_key.book2/examples/real-api/fixtures/incident_event.expected.json— reference for normalized event.book2/examples/real-api/fixtures/readiness_pass.json(24/25),readiness_block_audit.json(22/25 + audit below 1.0),readiness_block_stateful.json(24/25, but stateful without confirmed backup).book2/examples/real-api/specs/high_memory_usage/specify.md— pre-approvedrestart_podandscale_up_replicas_one.book2/examples/real-api/scripts/normalize_webhook.py,check_readiness.py,dry_run.py.
Steps
cd book2/examples/real-api. Expected: you are in the example directory, no additional dependencies.python3 scripts/normalize_webhook.py --grafana fixtures/webhook_grafana.json --pagerduty fixtures/webhook_pagerduty.json --expected fixtures/incident_event.expected.json. *Expected: exit code 0, normalizedincident_eventmatches reference.*python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json. *Expected: exit code 0,PASS incident=HM-2026-05-17-01 score=24/25.*
python3 scripts/check_readiness.py --readiness fixtures/readiness_block_audit.json. *Expected: exit code 1, reason —audit_trace_coverage=0.7 < 1.0, plus failure by total score (22/25).*python3 scripts/check_readiness.py --readiness fixtures/readiness_block_stateful.json. *Expected: exit code 1, reason —stateful workload without confirmed backup, even though total is 24/25.*
Bad: run dry_run.py before the readiness gate — the action is formally permitted by specification, but audit_trace_coverage or backup_verified may be missing. Good: gate first, dry run only on exit code 0 from gate — this sequence ensures the blast radius is known before checking the action list.
python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod. *Expected: exit code 0,PASS: action=restart_pod permitted (2 actions in spec).*python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action delete_namespace. *Expected: exit code 1,BLOCK: action="delete_namespace" not found among pre-approved.*
- For the educational minimum, stop here: the runnable chain has demonstrated normalization, PASS for the permitted path, and BLOCK for audit/stateful/delete-namespace.
If you have Qwen Code installed and need an explanation for review, perform a separate optional step:
qwen -p "Read @fixtures/readiness_block_audit.json and @specs/high_memory_usage/specify.md. What needs to be added to reach readiness 23/25 and audit_trace_coverage=1.0? Do not modify files." --approval-mode plan
This request is not part of the runnable minimum. Its output can be attached to review, but readiness clearance must rely on check_readiness.py and dry_run.py.
Control Fact
Steps 3, 6 — PASS. Steps 4, 5, 7 — BLOCK with specific reason in stderr. If step 5 passes with stateful=true, backup_verified=false, the readiness gate is broken: the hard block for stateful cannot be bypassed.
How This Goes Into capstone/
Transfer to capstone/readiness.md the readiness result, two blocking conditions, and the dry_run.py result for the permitted action. In capstone/validation.md list the commands that were actually run. GitOps, Kubernetes API, and full executor are not part of the educational minimum unless they were implemented.
Read this fragment as follows: one positive fixture shows the permitted path, two blockers record specific rejection reasons, dry_run is the boundary case of permitted and blocked action. If any line is missing, the readiness package is incomplete.
readiness:
pass_fixture: "readiness_pass.json -> 24/25"
blockers:
- "audit_trace_coverage=0.7 blocks auto mode"
- "stateful=true without backup_verified blocks action"
dry_run: "restart_pod PASS; delete_namespace BLOCK"
Reviewable Trace
Scripts write to stdout/stderr and do not create out/. Record the run as a readable artifact: a short capstone/readiness.md or a CI report if your project has one. Minimum content — the same four YAML lines from the block above (pass_fixture, two blockers, dry_run); full 25-point report is only needed in the full track.
Do not create a commit marker for the sake of the commit itself. For the textbook, a reproducible trace that can be read without chat history is what matters.
Key Ideas
The starting point of traceability is audit_trace (live log of Qwen Code), in which the incoming webhook and specification diffs are recorded as a single causal chain. For incident HM-2026-05-17-01, the first record links incident_event.json, user command /sdd:specify, and created file specs/high_memory_usage/specify.md. If any element is missing, the pipeline has already lost provability. Minimal log fragment: webhook_received -> incident_event_normalized -> /sdd:specify -> spec_diff_created; each subsequent diff references the same incident_id. /sdd:specify is a project extension; implement it as a user command in .qwen/commands/sdd/specify.md or replace with direct qwen -p.
Normalize alerts from Grafana and PagerDuty into a single incident-event. Otherwise different sources will dictate different versions of the same incident. Grafana provides metrics and observation window, e.g. memory_percent=93 over 10m. PagerDuty adds priority, service binding, and escalation status. The normalizer reduces them to fields service, namespace, pod, severity, window_minutes, metric_context, source_refs. After this, the specify step describes only WHY and WHAT: why intervention is needed and what result counts as success. It does not choose a library, SDK, or specific API endpoint.
What this means in practice. Let's compare two specify variants for the same incident:
Bad:
> Specify for high_memory_usage: Restart pod via kubectl delete pod ...
Problem: specify immediately chooses the implementation command and blocks Plan.
Good:
> Specify for high_memory_usage: Keep memory_percent < 80% for 5 minutes after action. Pre-approved actions: restart_pod, scale_up_replicas_one. Audit trace required.
SDD phase separation protects the pipeline from premature implementation. Each phase is responsible for its own:
- Specify captures user story, success criteria, functional and non-functional constraints;
- Plan chooses strategy;
- Tasks turns it into executable steps;
- Implement applies changes through a controlled mechanism.
This structure corresponds to the practical phase framework Specify → Plan → Tasks → Implement from GitHub Spec Kit (see also GitHub Spec Kit Quickstart). In production this matters because the model does not get the right to "immediately fix" an incident until cause, intervention boundaries, and result verification method are proven.
Do not expand the core of the chapter to the entire production orchestrator. In the first pass, only the chain webhook -> normalization -> readiness -> dry-run is checked here. Other mechanisms from previous chapters serve as control points:
- Verifier from Parts 4 and 8 is needed if dry-run produces a disputed counterexample.
- Tiered budgets from Part 9 are needed if
frontier-reviewerstarts serving not only high-risk branches. - Anti-Goodhart from Part 10 is needed if memory drops at the cost of 5xx, latency, or manual audit.
If these mechanisms are not yet assembled, do not try to model them inside Chapter 11. Record them as blockers or links to corresponding chapters, and complete the minimum run with a readiness verdict and dry-run result.
For high_memory_usage, start planning with minimal impact. The base /plan chooses restarting a specific pod with priority on blast radius. Then checks if scale-up is necessary. Only after that permits action expansion while maintaining rollback path.
The tasks step breaks this down into operations: confirm stateless nature of workload, perform dry-run without real changes, delete only the target pod, observe RSS, CPU, 5xx; if no improvement within the set window — activate rollback and create human_review.
Validation completes the auto-remediation loop only after checking real metrics, security gate, and GitOps commit (this is part of the frontier scenario, see chapter header). In validation.md check four conditions:
- memory stays below threshold in two consecutive windows;
- 5xx does not increase;
- latency does not degrade;
- rollback is described and executable.
After successful verification, six basic artifacts go into GitOps: specification, plan, tasks, diff, decision log, and 25-point report. Constitution update is added when necessary. Without this, the incident may be technically mitigated but not considered managed-closed.
Full Track: 25-Point Readiness Model
On the first pass, it is sufficient to understand two facts: readiness_pass.json passes, and audit/stateful fixtures are blocked. The full rubric below is needed when you transfer this gate to a real production process and must explain why the threshold was chosen this way.
The model assesses five categories on a 0–5 scale and gives a total sum. Points are assigned by artifacts, not by impression. If a criterion cannot be confirmed by file, log, or schema, a lower score is given. Below are rubrics for each category.
Threshold 23/25 — a "strict but not paralyzing" compromise for the educational AgentClinic-production model: up to two "fixable" complaints at "4" in different categories (4+4+5+5+5 = 23) or one "4" with the rest at "5" (24/25). "3" or below in at least one category immediately drops the total to 22 or less and removes auto-clearance. Below 23: 20–22/25 shifts the pipeline to semi-manual mode with human confirmation after each implement step. Higher — threshold 24/25 — makes auto fall to semi-manual from any minor complaint, and teams begin ignoring the model. Calibrate to risk profile: payments and healthcare — auto ≥24/25; internal tools permit 21–22/25, but only as semi-manual or canary, not as production-ready auto-remediation.
Spec — Completeness of WHY/WHAT/constraints
| Points | Spec |
|---|---|
| 5 | WHY/WHAT/constraints explicit, acceptance criteria present, no out-of-scope in plan, Given/When/Then present |
| 4 | WHY/WHAT explicit, constraints present, but one plan item lacks implements: |
| 3 | WHY present, WHAT vague, constraints partial |
| 2 | One of three blocks (WHY/WHAT/constraints) missing |
| 1 | Only symptom description, no WHY, WHAT, or constraints |
| 0 | No specification |
Implementation — Idempotency and Controlled Changes
| Points | Implementation |
|---|---|
| 5 | All tasks idempotent, dry-run present, blast radius explicitly stated at pod/deployment level, changes go through GitOps |
| 4 | Idempotency and dry-run present, but one task changes state without prior check |
| 3 | Dry-run only for some steps, blast radius described in text without explicit field |
| 2 | No dry-run, changes applied directly to cluster bypassing GitOps |
| 1 | Tasks not idempotent, rerun breaks state |
| 0 | Actions performed manually, not recorded in tasks |
Verification — Given/When/Then, Schemas, Stress, Monitoring
| Points | Verification |
|---|---|
| 5 | Given/When/Then covers happy and negative path, JSON Schema validates inputs and outputs, stress specification and post-metrics in two windows |
| 4 | All elements present, but stress specification covers only one violation class |
| 3 | Given/When/Then and schema present, but monitoring checked in one window |
| 2 | Only Given/When/Then without schema and without post-metrics |
| 1 | Validation reduced to exit code check or single screenshot |
| 0 | validation.md missing or not running |
Process — Traceability "webhook → CLI → diff → replay"
| Points | Process |
|---|---|
| 5 | Every step (webhook, normalization, CLI command, diff, commit, validate) linked via incident_id, log reproducible, replay gives same diff |
| 4 | Traceability complete, but replay requires manual substitution of one variable |
| 3 | Webhook and CLI linked, but diff not tied to incident_id |
| 2 | Log exists, but step order recoverable only by timestamp |
| 1 | Actions recorded in chat, not in files |
| 0 | No traceability, incident source unknown |
Security — Guardrails, Emergency Stop, Rollback, Escalation
| Points | Security |
|---|---|
| 5 | Guardrails prevent blast radius expansion, emergency stop present, rollback condition written before execution, escalation to manual confirmation on uncertainty |
| 4 | All elements present, but escalation described only in text without formal trigger |
| 3 | Rollback and guardrails present, emergency stop missing |
| 2 | Only rollback, without guardrails and escalation |
| 1 | Declared "manual rollback" but no executable path described |
| 0 | Security gate undefined, actions proceed without restrictions |
How It Is Calculated and What Blocks Merge
The sum of points gives a total from 0 to 25. Passing threshold for auto-clearance is 23/25: below this boundary the pipeline does not get production-ready status, even if three categories are at maximum. Zero score in Security is forbidden at any total. 0 in this column means absence of protective contour and blocks even semi-manual mode until minimum rollback, guardrails, and escalation appear.
Blocking conditions do not depend on total. Each of these cases blocks merge separately:
- failed validation (Verification ≤ 2);
- missing rollback (Security ≤ 2);
- undefined blast radius (Implementation ≤ 2 without explicit field).
At total 20–22, the pipeline is permitted only in semi-manual mode and only if no blocking conditions above: stop after each implement step, explicit human confirmation, mandatory specification update, and re-assessment before return to auto loop.
Checklist Before Production Cutover
Used when transferring the gate to a real process — each item is tied to the rubric in which hidden score drop is possible:
- [ ] Spec contains WHY/WHAT/constraints and is tied to
incident_id; each task hasimplements:pointing to REQ-identifier. - [ ] Dry-run is logged before real changes; blast radius fixed at pod or deployment level, not in words.
- [ ] JSON Schema validates
incident_eventand finalvalidation_report; Given/When/Then cover happy and negative path. - [ ] Rollback condition written before execution and tested on staging; emergency stop available to operator without cluster access.
- [ ] Trace
webhook → CLI → diff → commit → validatereproducible byincident_id; manual confirmation triggered automatically on repeated failure or blast radius expansion.
Example Filled Rubric for high_memory_usage
| Category | Score | Justification |
|---|---|---|
| Spec | 5 | WHY (prevent OOMKill), WHAT (RSS below 80% for 10 minutes), constraints (don't touch stateful, rollback after 6 minutes) explicit, Given/When/Then assembled |
| Implementation | 4 | Tasks idempotent, dry-run present, but scale-up branch lacks separate dry-run step |
| Verification | 5 | Given/When/Then, JSON Schema on incident_event and validation_report, stress specification on hidden leak, post-metrics in two windows |
| Process | 5 | incident_id=HM-2026-05-17-01 links webhook, /sdd:specify, diff, commit, and replay |
| Security | 4 | Guardrails on stateful workloads, rollback and emergency stop present, escalation described in text without formal trigger |
| Total | 23/25 | Production-ready by threshold, but scale-up branch remains semi-manual until separate dry-run |
Full Track: Threshold Calibration
Table "Low / Default / High" for readiness threshold, exercise with THRESHOLD override, and signals for review — in Appendix D, Section D.5. On the first pass, the chapter minimum is already proven if readiness_pass.json passes, audit/stateful fixtures are blocked, and delete_namespace does not get into the pre-approved action list.
Examples and Application
Practical input log for Qwen Code may start like this: POST /hooks/grafana reports memory_percent=93, pod=api-7b4, namespace=appointments-api, window=10m. Then POST /hooks/pagerduty confirms severity=critical and links the event to service appointments-api. The normalizer creates incident_event with incident_id=HM-2026-05-17-01, removes sensitive fields, attaches source references, and triggers user command /sdd:specify --event incident_event.json --preset high_memory_usage or equivalent qwen -p prompt — both variants belong to the recommended framework from the chapter header and are implemented as project commands around Qwen Code.
The first diff in specify.md captures three blocks: WHAT (reduce RSS below 80% for 10 minutes), WHY (prevent OOMKill and latency growth), constraints (don't touch stateful workloads, don't change HPA, have rollback after 6 minutes without improvement). On /plan the system compares two strategies: (A) restart target pod and observe; (B) restart plus temporary scale-up to four replicas. The verifier runs Given/When/Then: given pod is stateless and memory above 90% for 10 minutes; when only target pod is restarted; then memory must drop below 80%, and 5xx must not exceed acceptable threshold. If stress specification shows that scale-up requires changing rollout policy or hides memory leak by growing replica count, variant B remains a backup branch with manual confirmation, not automatic action.
At implement step, dry-run is performed first. Then commit goes through GitOps and syncs to ArgoCD only on green validator status. The executor does not close the PagerDuty incident immediately after restart. It waits two monitoring windows, checks validation.md, verifies the security gate, and adds spec, tasks, commit, and validation result links to the comment. If after 6 minutes memory does not decrease or 5xx grows, rollback path activates, human_review is created, and readiness score is recalculated with the failed verification criterion.
Summary
Production pipeline readiness is fixed by the 25-point model: five categories (Spec, Implementation, Verification, Process, Security) with equal weight of 5 points each repeat the SDD cycle stages. Equal weight is a principle: no category compensates for a gap in another, so threshold 23 permits no more than two partial gaps total. Production-ready — not below 23/25 with no critical validation and security gate violations. Dropping below threshold shifts auto-remediation to semi-manual mode until specification, policy, or execution path is fixed. Fully automated remediation remains the frontier scenario from the chapter header: permit it only after accumulating replay evidence and operator dry-run. Such a loop turns every future incident into a verifiable system improvement.
Artifacts and Readiness Criteria
| Artifact | Ready when |
|---|---|
Normalized incident_event | matches examples/real-api/fixtures/incident_event.expected.json field-for-field; Specify captures WHY/WHAT/constraints and does not choose remediation command |
| Local readiness gate run | readiness_pass.json passes; audit/stateful fixtures blocked with specific reason |
dry_run.py on permitted and forbidden action | restart_pod PASS, delete_namespace BLOCK |
Record in capstone/readiness.md | score, blocking conditions, one actually run command |
Full track adds specs/high_memory_usage/specify.md, plan.md, tasks.md, and validation.md, GitOps diff or commit linked to incident_id, decision log webhook → CLI → diff → commit → validate, and filled 25-point readiness table with evidence. Consider it ready if plan and tasks have blast radius, dry-run, rollback condition, and manual confirmation trigger; validation checks two metric windows, 5xx, latency, and security gate; user commands are either implemented as project commands or replaced with qwen -p prompts or project scripts; readiness total not below 23/25 without blocking conditions on rollback, verification, or blast radius.
Practice
cd book2/examples/real-api && python3 scripts/normalize_webhook.py --grafana fixtures/webhook_grafana.json --pagerduty fixtures/webhook_pagerduty.json --expected fixtures/incident_event.expected.json— *expected: code 0, normalizedincident_eventmatches reference field-for-field.*- Run four checks separately (each returns its own code, so
&&between them is not suitable):
python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json
python3 scripts/check_readiness.py --readiness fixtures/readiness_block_audit.json
python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod
python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action delete_namespace
*Expected: readiness_pass → code 0, PASS incident=HM-… score=24/25; readiness_block_audit → code 1, BLOCK … score=22/25 with reasons "score 22/25 below threshold 23" and "audit_trace_coverage=0.7 < 1.0 — full coverage mandatory"; restart_pod PASS, delete_namespace BLOCK.*
- Assess your case by the 25-point model and fill the table below. For each category indicate score, evidence artifact, and reason for reduction if score is below 5. Calculate total, check blocking conditions, and formulate what needs to change for the pipeline to pass threshold 23/25. *Expected: all three fields filled in each table row; "Evidence artifact" points to specific file or run, not general statement; total cell contains number in format
N/25and list of blocking conditions, or explicit "no blockers".*
| Category | Score (0–5) | Evidence artifact | Reason for reduction |
|---|---|---|---|
| Spec | |||
| Implementation | |||
| Verification | |||
| Process | |||
| Security | |||
| Total | Blocking conditions: | What to change before cutover: |
Control Questions
- Why should specify not choose a specific remediation command?
- What conditions make auto-remediation unacceptable?
- What blocks clearance if readiness is below 23/25?
- A webhook about
high_memory_usagearrives off-hours, automatic remediation is ready to restart pod. The readiness model gives 22/25 (minus 3 for incomplete audit). What will you do — restart, wait until morning, or call the on-call?