Reading: Practical Part 11. Integration with a Real API: From Specification to Deployment

Lesson 1 of 5 in module «Practical Part 11. Integration with a Real API: From Specification to Deployment»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Applied Part 11. Integration with a Real API: From Specification to Deployment

Status: Recommendation. The SDD phase separation Specify/Plan/Tasks/Implement/Validate and the 25-point readiness model are a recommended framework. It does not require a real Kubernetes, GitOps, or external executor on the training pass.

Frontier. Fully automated auto-remediation without manual confirmation (human-review) on the critical path remains a frontier: even teams with extensive SDD experience keep a human in the loop. Of the built-in Qwen Code commands, only /plan is used here; the remaining steps are custom commands or direct qwen -p invocations through project scripts.

For the training walkthrough, a local pipeline in examples/real-api/ is enough: normalize webhooks, pass the readiness gateway, and block a forbidden action. GitOps, the Kubernetes API, and full auto-remediation belong to the full production track.

> [runnable] — a runnable analogue of the "webhook → normalization → readiness gateway → dry run" pipeline lives in [examples/real-api/](examples/real-api/README.md). The scripts run on stdlib without external dependencies; they do not replace production infrastructure, but let you run the gateway locally and see which conditions block an action.

The high_memory_usage scenario is the same peak of read traffic against the same SQLite that we built in part 12 of volume one, and the same idempotent migration technique. Only now it is viewed from the operations side. The Specify → Plan → Tasks → Implement cycle, worked through in part 7, part 8, and part 9 of volume one, is neither cancelled nor replaced here. It is wrapped by a production gateway and ends with a team review of the evidence package in the spirit of part 16.

Before Reading

Foundation from volume one: parts 7–9 establish the spec-plan-validate cycle, part 16 establishes team review.
Local training case: high_memory_usage, the canonical case of the entire first pass.

Trace for capstone/: the readiness verdict, two blocking conditions, and a dry-run of the allowed action.
Key terms of the first pass: readiness and dry-run. The 25-point rubric, audit_trace, GitOps, and executor are for reference.
What to defer: GitOps, the Kubernetes API, the full executor, and auto-remediation without manual confirmation.

Goal

In the training minimum, this chapter checks the short chain webhook -> normalization -> readiness -> dry-run for high_memory_usage. The full production track extends it to GitOps deployment, rollback of changes, and a readiness assessment before limited auto-remediation. Every action must be linked to specify/plan/tasks/implement/validate artifacts, not lost in manual commands.

The practical outcome of the first pass is not a production orchestrator, but proof that an allowed action passes readiness and a forbidden one is blocked before the system is changed.

readiness here is a formal pipeline assessment on a 25-point scale with a threshold of 23/25. Auto-remediation in this chapter means a limited playbook with pre-approved actions, rollback conditions, and human review. It is not a license for the agent to arbitrarily change production.

Of the built-in Qwen Code commands in this pipeline, only /plan is used. Shape the other steps — /sdd:specify, /sdd:tasks, /sdd:validate — as custom commands in .qwen/commands/sdd/ or replace them with regular prompts via qwen -p and project scripts.

Minimal Training Scenario

Training Case

The high_memory_usage production incident for appointments-api is derived from the MVP phase and SQLite migrations from book/part-12-mvp.md. Pipeline: Grafana+PagerDuty webhook → normalize_webhook.py → readiness gateway on the 25-point model → dry run against a list of pre-approved actions. The goal is to walk the full path from a raw payload to a controlled restart_pod and verify that the blocking conditions (audit, stateful) catch a failure exactly where they should.

Preparation

book2/examples/real-api/fixtures/webhook_grafana.json, webhook_pagerduty.json — raw payloads with the same incident_key.
book2/examples/real-api/fixtures/incident_event.expected.json — the reference for the normalized event.
book2/examples/real-api/fixtures/readiness_pass.json (24/25), readiness_block_audit.json (22/25 + audit below 1.0), readiness_block_stateful.json (24/25, but stateful without a backup).
book2/examples/real-api/specs/high_memory_usage/specify.md — the pre-approved restart_pod and scale_up_replicas_one.
book2/examples/real-api/scripts/normalize_webhook.py, check_readiness.py, dry_run.py.

Steps

cd book2/examples/real-api. Expectation: you are in the example directory, no extra dependencies.
python3 scripts/normalize_webhook.py --grafana fixtures/webhook_grafana.json --pagerduty fixtures/webhook_pagerduty.json --expected fixtures/incident_event.expected.json. *Expectation: return code 0, the normalized incident_event matches the reference.*
python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json. *Expectation: return code 0, PASS incident=HM-2026-05-17-01 score=24/25.*

python3 scripts/check_readiness.py --readiness fixtures/readiness_block_audit.json. *Expectation: return code 1, the reason is audit_trace_coverage=0.7 < 1.0, plus a drop in the score sum (22/25).*
python3 scripts/check_readiness.py --readiness fixtures/readiness_block_stateful.json. *Expectation: return code 1, the reason is stateful workload without a confirmed backup, even though the sum is 24/25.*

Bad: run dry_run.py before the readiness gateway — the action is formally allowed by the spec, but audit_trace_coverage or backup_verified may be missing. Good: readiness gateway first, dry run only when the gateway returns 0 — the sequence ensures the blast radius is known before the action list is checked.

python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod. *Expectation: return code 0, PASS: action=restart_pod is allowed (2 actions in spec).*
python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action delete_namespace. *Expectation: return code 1, BLOCK: action="delete_namespace" not found among pre-approved.*

For the training minimum, stop here: the runnable chain has shown normalization, PASS for the allowed path, and BLOCK for audit/stateful/delete-namespace.

If Qwen Code is installed and you need an explanation for the review, perform an additional optional step:

qwen -p "Read @fixtures/readiness_block_audit.json and @specs/high_memory_usage/specify.md. What needs to be added so that readiness reaches 23/25 and audit_trace_coverage=1.0? Do not change the files." --approval-mode plan

This request is not part of the runnable minimum. Its output can be attached to the review, but the readiness acceptance must rest on check_readiness.py and dry_run.py.

Verification Fact

Steps 3, 6 — PASS. Steps 4, 5, 7 — BLOCK with a specific reason in stderr. If step 5 passes with stateful=true, backup_verified=false, the readiness gateway is broken: the hard block for stateful cannot be bypassed.

How This Gets into `capstone/`

Move the readiness outcome, the two blocking conditions, and the dry_run.py result for the allowed action into capstone/readiness.md. In capstone/validation.md, list the commands that actually ran. GitOps, the Kubernetes API, and the full executor are not part of the training minimum unless they were actually implemented.

Read this fragment as follows: one positive fixture shows the allowed path, two blockers record specific failure reasons, and dry_run is the boundary case of an allowed and a blocked action. If even one line is missing, the readiness package is incomplete.

readiness:
  pass_fixture: "readiness_pass.json -> 24/25"
  blockers:
    - "audit_trace_coverage=0.7 blocks auto mode"
    - "stateful=true without backup_verified blocks action"
  dry_run: "restart_pod PASS; delete_namespace BLOCK"

Reviewable Trace

The scripts write to stdout/stderr and do not create out/. Capture the run as a readable artifact: a short capstone/readiness.md or a CI report if one exists in your project. The minimum content is the same four YAML lines from the block above (pass_fixture, two blockers, dry_run); the full 25-point report is only needed on the full track.

Do not create a commit marker for its own sake. For the textbook, what matters is a reproducible trace that can be read without the chat history.

Key Ideas

The starting point of the trace is audit_trace (the live Qwen Code log), in which the incoming webhook and specification diffs are captured as a single causal trace. For the HM-2026-05-17-01 incident, the first record links incident_event.json, the /sdd:specify custom command, and the created specs/high_memory_usage/specify.md file. If any of those elements is missing, the pipeline has already lost provability. The minimum log fragment is: webhook_received -> incident_event_normalized -> /sdd:specify -> spec_diff_created; each subsequent diff references the same incident_id. /sdd:specify is a project extension; shape it as a custom command in .qwen/commands/sdd/specify.md or replace it with a direct qwen -p invocation.

Normalize Grafana and PagerDuty alerts into a single incident-event. Otherwise different sources will dictate different versions of the same incident. Grafana provides metrics and an observation window, for example memory_percent=93 over 10m. PagerDuty adds priority, service binding, and escalation status. The normalizer reduces them to the fields service, namespace, pod, severity, window_minutes, metric_context, source_refs. After that, the specify step describes only WHY and WHAT: why intervention is needed and what outcome counts as success. It does not pick a library, SDK, or specific API handle.

What this means in practice. Compare two variants of specify for the same incident:

Bad:

> Specify for high_memory_usage: Restart the pod via kubectl delete pod ...

Problem: the specify immediately picks an implementation command and blocks the Plan.

Good:

> Specify for high_memory_usage: Hold memory_percent < 80% for 5 minutes after the action. Pre-approved actions: restart_pod, scale_up_replicas_one. The audit trace is mandatory.

The SDD phase separation protects the pipeline from premature implementation. Each phase is responsible for its own:

Specify captures the user story, success criteria, and functional and non-functional constraints;
Plan picks the strategy;
Tasks turns it into executable steps;
Implement applies changes through a controlled mechanism.

This structure matches the practical framework of the Specify → Plan → Tasks → Implement phases from GitHub Spec Kit (see also GitHub Spec Kit Quickstart). In production this matters because the model is not given the right to "treat" the incident immediately, until the cause, the boundaries of intervention, and the way to verify the outcome are proven.

Do not expand the core of the chapter into the entire production orchestrator. In the first pass, only the webhook -> normalization -> readiness -> dry-run chain is checked here. The other mechanisms from previous chapters act as checkpoints:

The verifier from parts 4 and 8 is needed if the dry run triggers a controversial counterexample.
The tiered budgets from part 9 are needed if frontier-reviewer starts serving branches other than high-risk ones.
The Anti-Goodhart from part 10 is needed if memory drops at the cost of 5xx, latency, or manual audit.

If those mechanisms are not yet assembled, do not try to simulate them inside chapter 11. Record them as blockers or references to the corresponding chapters, and finish the minimum run with a readiness verdict and a dry-run result.

For high_memory_usage, start the plan with the smallest impact. The baseline /plan picks restarting a specific pod with blast-radius priority. Then it checks whether a scale-up is needed. Only after that does it allow extending the action while keeping a rollback path.

The tasks step breaks this down into operations: confirm the stateless nature of the workload, perform a dry run without real changes, delete only the target pod, watch RSS, CPU, and 5xx; if there is no improvement within the specified window, activate rollback and create a human_review.

Validation closes the auto-remediation loop only after checking real metrics, the safety gateway, and the GitOps commit (this is part of the frontier scenario, see the chapter header). In validation.md check four conditions:

memory stays below the threshold in two consecutive windows;
5xx does not grow;
latency does not degrade;
the rollback is described and executable.

After a successful check, six base artifacts land in GitOps: the specification, the plan, the tasks, the diff, the decision log, and the 25-point readiness report. A Constitution update is added when needed. Without this, the incident may be technically mitigated but is not considered managed-closed.

Full Track: The 25-Point Readiness Model

On the first pass, it is enough to understand two facts: readiness_pass.json passes, and the audit/stateful fixtures are blocked. The full rubric below is needed when you move this gateway into a real production process and have to explain why the threshold was chosen exactly that way.

The model rates five categories on a 0–5 scale and gives a final sum. Scores are assigned by artifacts, not by impression. If a criterion cannot be backed by a file, a log, or a schema, the lower score is given. Below are the rubrics for each category.

The 23/25 threshold is a "strict but not paralyzing" compromise for the AgentClinic-production training model: up to two "remediable" claims of "4" in different categories (4+4+5+5+5 = 23) or one "4" with the rest "5" (24/25). A "3" or lower in even one category immediately drops the sum to 22 or less and removes the auto-allowance. Below 23: 20–22/25 moves the pipeline to semi-manual mode with human confirmation after every implement step. Above — the 24/25 threshold — pushes auto into semi-manual at the slightest small claim, and the team starts ignoring the model. Calibrate to the risk profile: payments and healthcare — auto ≥24/25; internal tools allow 21–22/25, but only as semi-manual or canary, not as production-ready auto-remediation.

Spec — completeness of WHY/WHAT/constraints

Score	Spec
5	WHY/WHAT/constraints are explicit, acceptance criteria are present, there is no out-of-scope in the plan, Given/When/Then is present
4	WHY/WHAT are explicit, constraints are present, but one item in the plan lacks `implements:`
3	WHY is present, WHAT is unclear, constraints are partial
2	One of the three blocks (WHY/WHAT/constraints) is missing
1	Only a symptom description, neither WHY, nor WHAT, nor constraints
0	No specification

Implementation — idempotency and controlled changes

Score	Implementation
5	All tasks are idempotent, a dry run is present, the blast radius is explicit at the pod/deployment level, changes go through GitOps
4	Idempotency and a dry run are present, but one task changes state without a precheck
3	A dry run exists only for some steps, the blast radius is described in prose without an explicit field
2	No dry run, changes are applied directly to the cluster, bypassing GitOps
1	Tasks are not idempotent, a re-run breaks state
0	Actions are performed manually, without being recorded in tasks

Verification — Given/When/Then, schemas, stress, monitoring

Score	Verification
5	Given/When/Then covers happy and negative paths, JSON Schema validates inputs and outputs, a stress spec and post-metrics over two windows are present
4	All elements are present, but the stress spec covers only one class of violations
3	Given/When/Then and the schema are present, but monitoring is checked in one window
2	Only Given/When/Then, without a schema and without post-metrics
1	Validation reduces to checking the exit code or a single screenshot
0	`validation.md` is missing or is not run

Process — "webhook → CLI → diff → replay" tracing

Score	Process
5	Every step (webhook, normalization, CLI command, diff, commit, validate) is linked through `incident_id`, the log is reproducible, replay yields the same diff
4	Tracing is complete, but replay requires manually substituting one variable
3	The webhook and CLI are linked, but the diff is not tied to incident_id
2	A log exists, but the order of steps can only be recovered by time
1	Actions are recorded in chat, not in files
0	There is no trace, the source of the incident is unknown

Security — guardrails, emergency stop, rollback, escalation

Score	Security
5	Guardrails forbid expanding the blast radius, an emergency stop is present, the rollback condition is recorded before execution, escalation to manual confirmation on uncertainty
4	All elements are present, but escalation is described only in prose, without a formal trigger
3	Rollback and guardrails are present, emergency stop is missing
2	Only rollback is present, without guardrails and without escalation
1	A "manual rollback" is claimed, but no executable path is described
0	The safety gateway is not defined, actions proceed without restrictions

How the Sum Is Calculated and What Blocks Merge

The sum of scores gives a total from 0 to 25. The pass threshold for auto-allowance is 23/25: below this line the pipeline does not receive production-ready status, even if three categories are at the maximum. A zero in Security is forbidden at any total. A 0 in this column means the absence of a protective ring and blocks even semi-manual mode until a minimum of rollback, guardrails, and escalation appear.

Blocking conditions do not depend on the sum. Each of these cases blocks merge on its own:

failed validation (Verification ≤ 2);
missing rollback (Security ≤ 2);
undefined blast radius (Implementation ≤ 2 without an explicit field).

At a total of 20–22 the pipeline is allowed only in semi-manual mode and only if there are no blocking conditions above: a stop after every implement step, explicit human confirmation, a mandatory spec update, and a re-assessment before returning to the auto loop.

Checklist Before Production Cutover

Used when moving the gateway into a real process — each item is tied to the rubric where a hidden score drop is possible:

[ ] The spec contains WHY/WHAT/constraints and is tied to incident_id; every task has implements: pointing to a REQ identifier.
[ ] The dry run is logged before real changes; the blast radius is fixed at the pod or deployment level, not in words.
[ ] JSON Schema validates incident_event and the final validation_report; Given/When/Then cover happy and negative paths.
[ ] The rollback condition is recorded before execution and tested on a stand; an emergency stop is available to the operator without entering the cluster.
[ ] The webhook → CLI → diff → commit → validate trace is reproducible by incident_id; manual confirmation is triggered automatically on a repeated failure or a blast-radius expansion.

Example of a Filled Rubric for `high_memory_usage`

Category	Score	Justification
Spec	5	WHY (prevent OOMKill), WHAT (RSS below 80% for 10 minutes), constraints (do not touch stateful, rollback in 6 minutes) are explicit, Given/When/Then is assembled
Implementation	4	Tasks are idempotent, a dry run is present, but the scale-up branch lacks a separate dry-run step
Verification	5	Given/When/Then, JSON Schema on `incident_event` and `validation_report`, stress spec for hidden leak, post-metrics in two windows
Process	5	`incident_id=HM-2026-05-17-01` links the webhook, `/sdd:specify`, the diff, the commit, and the replay
Security	4	Guardrails on stateful workloads, rollback and emergency stop are present, escalation is described in prose without a formal trigger
Total	23/25	Production-ready by the threshold, but the scale-up branch remains semi-manual until a separate dry run

Full Track: Threshold Calibration

The "Low / Default / High" table for the readiness threshold, the exercise with overriding THRESHOLD, and the signals for a revisit are in Appendix D, section D.5. On the first pass the minimum of the chapter is already proven if readiness_pass.json passes, the audit/stateful fixtures are blocked, and delete_namespace is not in the pre-approved action list.

Examples and Application

A practical input log for Qwen Code can start like this: POST /hooks/grafana reports memory_percent=93, pod=api-7b4, namespace=appointments-api, window=10m. Then POST /hooks/pagerduty confirms severity=critical and links the event to the service appointments-api. The normalizer creates an incident_event with incident_id=HM-2026-05-17-01, strips sensitive fields, attaches source references, and runs the custom command /sdd:specify --event incident_event.json --preset high_memory_usage or the equivalent qwen -p prompt — both options are part of the recommended framework from the chapter header and are implemented through project commands around Qwen Code.

The first diff in specify.md records three blocks: WHAT (lower RSS below 80% in 10 minutes), WHY (prevent OOMKill and latency growth), constraints (do not touch stateful workloads, do not change the HPA, have a rollback in 6 minutes without improvement). On /plan the system compares two strategies: (A) restart the target pod and observe; (B) restart plus a temporary scale-up to four replicas. The verifier runs Given/When/Then: given a stateless pod and memory above 90% for 10 minutes; when only the target pod is restarted; then memory must drop below 80% and 5xx must not exceed the allowed threshold. If the stress spec shows that scale-up requires a change in rollout policy or hides a memory leak by growing the replica count, option B remains a backup branch with manual confirmation, not an automatic action.

At the implement step, a dry run runs first. Then the commit goes through GitOps and is synchronized into ArgoCD only when the validator status is green. The executor does not close the PagerDuty incident immediately after the restart. It waits for two monitoring windows, reconciles validation.md, checks the safety gateway, and adds links to the spec, tasks, commit, and validation result in the comment. If after 6 minutes memory does not drop or 5xx grows, the rollback path is activated, a human_review is created, and the readiness score is recalculated against the failed verification criterion.

Summary

The readiness of the production pipeline is captured by the 25-point model: five categories (Spec, Implementation, Verification, Process, Security) with equal weight of 5 points repeat the stages of the SDD cycle. Equal weight is the principle: no category compensates for a gap in another, so the 23 threshold allows at most two partial gaps in total. Production-ready means at least 23/25 with no critical violations in validation and the safety gateway. A drop below the threshold moves auto-remediation into semi-manual mode until the specification, the policy, or the execution path is fixed. Fully automated remediation remains a frontier scenario from the chapter header: allow it only after accumulating replay evidence and an operator-led dry run. Such a loop turns every future incident into a verifiable improvement of the system.

Errors as Part of the Contract

A production API should return not only PASS or BLOCK, but also an error type by which the orchestrator chooses recovery. Mixing all failures into failed is dangerous: a missing field in a webhook, an LLM timeout, an unavailable Kubernetes API, and a Safety prohibition require different actions.

Minimum taxonomy for this chapter:

Code	Where it occurs	Action
`VALIDATION_ERROR`	`incident_event` failed the schema	stop, return a fixable reason
`LLM_CALL_FAILED`	the model did not produce a spec or a plan	retry with a limit, then degraded mode without auto actions
`TOOL_EXECUTION_FAILED`	`check_readiness.py`, `dry_run.py`, or an external API returned a failure	if retryable, retry; otherwise escalate
`AGENT_WORKFLOW_FAILED`	the `webhook → specify → readiness → dry_run` chain lost a required step	block auto mode and record the `correlation_id`

For high_memory_usage degraded mode means: normalize the event, write readiness.md, show the operator the recommended next step, but do not run restart_pod automatically. This is honest degradation: the system keeps the evidence and does not expand the blast radius when the model or the tool is unavailable.

Artifacts and Readiness Criteria

Artifact	Ready when
Normalized `incident_event`	matches `examples/real-api/fixtures/incident_event.expected.json` field by field; Specify records WHY/WHAT/constraints and does not pick a remediation command
Local run of the readiness gateway	`readiness_pass.json` passes; `audit/stateful` fixtures are blocked with a specific reason
`dry_run.py` on an allowed and a forbidden action	`restart_pod` PASS, `delete_namespace` BLOCK
Error taxonomy	every BLOCK has a stable code, retryability, and `correlation_id`
Record in `capstone/readiness.md`	score, blocking conditions, one actually run command

The full track adds specs/high_memory_usage/specify.md, plan.md, tasks.md, and validation.md, a GitOps diff or commit linked to incident_id, the decision log webhook → CLI → diff → commit → validate, and a filled 25-point readiness table with evidence. Consider it ready if the plan and tasks have a blast radius, a dry run, a rollback condition, and a human-confirmation trigger; validation checks two metric windows, 5xx, latency, and the safety gateway; the custom commands are either shaped as project commands or replaced with qwen -p prompts or project scripts; the readiness total is at least 23/25 with no blocking conditions on rollback, verification, or blast radius.

Practice

cd book2/examples/real-api && python3 scripts/normalize_webhook.py --grafana fixtures/webhook_grafana.json --pagerduty fixtures/webhook_pagerduty.json --expected fixtures/incident_event.expected.json — *expectation: code 0, the normalized incident_event matches the reference field by field.*
Run the four checks separately (each returns its own code, so && between them is not appropriate):

   python3 scripts/check_readiness.py --readiness fixtures/readiness_pass.json
   python3 scripts/check_readiness.py --readiness fixtures/readiness_block_audit.json
   python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action restart_pod
   python3 scripts/dry_run.py --spec specs/high_memory_usage/specify.md --action delete_namespace

*Expectation: readiness_pass → code 0, PASS incident=HM-… score=24/25; readiness_block_audit → code 1, BLOCK … score=22/25 with the reasons "score 22/25 below threshold 23" and "audit_trace_coverage=0.7 < 1.0 — full coverage is required"; restart_pod PASS, delete_namespace BLOCK.*

Score your case on the 25-point model and fill in the table below. For each category, indicate the score, the artifact-evidence, and the reason for a reduction if the score is below 5. Sum the total, check the blocking conditions, and articulate what needs to change for the pipeline to pass the 23/25 threshold. *Expectation: all three fields are filled in each row; "Artifact-evidence" points to a specific file or run, not a generic wording; the total cell contains a number of the form N/25 and a list of blocking conditions, or an explicit "no blockers".*

Category	Artifact-evidence	Reason for reduction
Spec
Implementation
Verification
Process
Security
Total	Blocking conditions:	What to change before cutover:

Review Questions

Why should specify not pick a specific remediation command?
What conditions make auto-remediation unacceptable?
What blocks acceptance if readiness is below 23/25?

A high_memory_usage webhook arrived after hours, auto-remediation is ready to restart the pod. The readiness model gives 22/25 (minus 3 for an incomplete audit). What will you do — restart, wait until morning, or call the on-call engineer?