Topic: Applied Volume. Production SDD for Qwen Code CLI
Difficulty level: Medium
Estimated study time: 40-60 hours (first pass), 80-120 hours (with full track and practical exam)
Prerequisites: Completion of the first volume of the textbook (book/) — basic SDD cycle on AgentClinic
Understanding of requirements.md, plan.md, validation.md, QWEN.md structure
Working with feature boundaries, negative requirements, and fact verification
Basic command line and bash scripting skills
Experience with LLM tools (preferably Qwen Code CLI)
Understanding of CI/CD basics and production infrastructure
Learning objectives: Upon completion, the student will be able to reconstruct a specification from legacy code and document provenance in genealogy.md for one production case
The student will be able to diagnose specification defects, create a poisoned/fixed pair, and document constitution.md with immutable and mutable rules
The student will be able to run and interpret LLM duel results, mutation testing, and Spec CI, recording control facts in validation.md
The student will be able to accept or reject a shadow specification, assess model tier budget risk, and document judgment.md with evidence_ref
The student will be able to assemble the final capstone/README.md package, check it for antipatterns, and prepare three risks in blocker/owner/next_check format
Overview: The applied volume of the textbook transfers the basic SDD (Specification-Driven Development) cycle from the first volume into production scenarios. If the first volume taught constitution, feature specification, plan, verifiable facts, implementation, review, and replanning on the AgentClinic platform, then the second volume works with legacy traces, validators, multi-agent checks, Spec CI, metrics, model budgets, and limited auto-remediation. The main rule: the first pass must leave one small verifiable trace in capstone/, not introduce all terminology. The main exam case is high_memory_usage; other cases (autoscale_200pct, node_not_ready, appointment_latency, cdn_error_budget_burn) are used to demonstrate specific mechanisms with subsequent one-line transfer of the principle to the main case.
Key concepts: Agentclinic-production: Extension of basic AgentClinic for production scenarios: working with existing code, not greenfield development. Requires recovering requirements from legacy, not writing from scratch.
Genealogy.md: Provenance artifact: where a requirement came from. Records source (post-mortem, ticket, interview), uncertainty level, and transformation chain from raw signal to specification.
Poisoned-spec.md / fixed-spec.md: Diagnostic artifact pair for specification defects. Poisoned — specification with hidden defect (contradiction, implicit assumption, priority conflict); fixed — corrected version with explicit rationale.
Constitution.md: Project constitution: rules constraining agent actions. Immutable rules — unchangeable without referendum (e.g., 'do not touch production configuration during business hours'). Mutable rules — with ttl and rollback_condition, can adapt as precedents accumulate.
Llm-duel (verifier vs implementor): Multi-agent verification of formal claims. Verifier searches for counterexamples to Then-claims in specification; Implementor defends. Result — counterexample or next_guard (condition under which specification is violated).
Stress-mutator / mutation testing of specifications: Automatic generation of specification mutants to check validator robustness. Smoke result shows which mutants passed undetected — this is the validator's immunity vector.
Shadow specifications (shadow specs): Candidate specifications generated by agent in parallel to main. Selected based on criteria: do not contradict constitution, cover new facts, do not duplicate existing. Accepted — into capstone/README.md, rejected — with reason.
Spec ci: Specification CI: specification as executable artifact. Checks requirement coverage, contract schema, validation.md presence. Outputs PASS/BLOCK on every commit.
File arbitration (multiagent tribunal): Role-based review of disputed changes: Safety (security), Liveness (functionality), Economy (resources). Result — judgment.md with verdict, evidence_ref and dissenting opinion if present.
Tier budgets (tier budgeting): Task routing across models of different levels and costs. Cheap tier — quick checks, expensive — arbitration. token_health tracks spend; upon cheap tier failure — fallback or manual escalation.
Goodhart protection: Paired metric: KPI (target) + guard metric (watchdog, preventing system gaming). For example: KPI 'remediation time' + guard 'number of false remediation triggers'.
Readiness / dry-run: Production API integration: readiness metric shows environment readiness for deploy (e.g., 23/25 checks passed). Dry-run — execution without side effects for validation before real application.
Antipattern-audit.md: Three diagnostic risks in blocker / owner / next_check format after antipattern pass. Not converting every antipattern into CI policy — this is the full track.
Capstone/readme.md: Final assembly of all artifacts for one case. Must be readable without chat history as a self-contained evidence package from legacy trace to production-ready solution.
Important dates: 2026-05-20: Version v1.0 of applied volume — verified and fixed. Cutoff date for textbook materials.
Chapter 0: Getting started — selecting high_memory_usage case, creating empty capstone/
Chapters 1-3: Forming basic artifacts: genealogy.md, poisoned/fixed pair, constitution.md
Chapters 4-5: Obtaining counterexample and stress-mutator smoke result
Chapters 6-7: Shadow candidate selection, launching Spec CI
Chapters 8-9: Building judgment.md, playing out cheap tier failure
Chapters 10-11: Checking guard metrics, readiness and dry-run for high_memory_usage
Chapter 12: Recording three risks blocker/owner/next_check
Chapter 13: Building final capstone/README.md and practical exam
Practice exercises: Name: Recovering requirements from post-mortem
Problem: Given a fragment of post-mortem for node_not_ready incident: 'At 03:15 node-7 went NotReady, pods did not evacuate within 5 minutes, clients received timeouts. Manual drain + reboot helped in 12 minutes. Auto-evacuation failed due to taint custom-scheduler=protected'. Recover one verifiable requirement and document genealogy.md. Specify uncertainty level (high/medium/low) and justify.
Solution: 1. Extract raw signal: auto-evacuation failed on node with taint custom-scheduler=protected. 2. Transform into verifiable requirement: 'If a node has taint custom-scheduler=protected, then automatic pod evacuation must be disabled or replaced with manual workflow with on-call notification'. 3. In genealogy.md record: source='post-mortem-2024-03-15-node7', raw_signal='auto-evacuation skipped due to protected taint', uncertainty='medium' (unclear whether intentional or bug), requirement='protected-taint-evacuation-policy', derivation_steps=['incident report', 'taint analysis', 'policy decision']. 4. Justification for medium: single incident, no confirmation that behavior is intentional, needs replay or additional case for low.
Complexity: intermediate
Name: Creating poisoned/fixed pair for priority conflict
Problem: Given specification: 'Given appointment booking system; When number of simultaneous requests exceeds 100; Then all requests are processed in FIFO order with maximum delay 2 seconds'. Find defect, create poisoned-spec.md and fixed-spec.md. Show backwards walk from defect to root.
Solution: 1. Defect: implicit assumption that 'all requests' are equal. VIP patients and emergency requests should not wait in FIFO. 2. Poisoned-spec.md: preserve original, add label DEFECT='priority-blindness', trigger='emergency request during peak load', expected_failure='VIP patient waits 2+ seconds while routine appointment processed'. 3. Fixed-spec.md: 'Given appointment booking system with prioritized queues (routine, urgent, emergency); When number of simultaneous requests exceeds 100; Then emergency requests processed with delay ≤ 500ms, urgent ≤ 1s, routine — FIFO with delay ≤ 2s when capacity available'. 4. Backwards walk: expected failure (VIP timeout) → cause (no priorities in Then) → cause (no classification in Given) → root (domain assumption 'all patients equal', incorrect for healthcare).
Complexity: intermediate
Name: Running stress-mutator and interpreting immunity
Problem: In directory book2/examples/ run bash smoke_all.sh. Find stress-mutator output for payment_latency_spike case. How many mutants passed undetected? What validator immunity vector does this indicate? Record result in validation.md.
Solution: 1. Navigate to book2/examples/ and execute bash smoke_all.sh. 2. Find block [stress-mutator] payment_latency_spike. Example expected output: 'MUTANTS_GENERATED=12, CAUGHT=9, ESCAPED=3, ESCAPE_VECTOR=timing-assertion-weakness'. 3. ESCAPED=3 indicates validator does not check time boundary conditions strictly enough. Immunity vector: 'add strict P99 < 200ms check when spike > 500% baseline, not just average'. 4. Add to validation.md: 'Mutation immunity: stress-mutator payment_latency_spike, 3/12 escaped, vector timing-assertion-weakness. Strengthening: strict P99-guard under spike-condition'.
Complexity: intermediate
Name: Shadow specification selection: accept or reject
Problem: Given two shadow candidates for voice_handoff case: A) 'When transferring voice call between operators, full audio recording of conversation is preserved' and B) 'When transferring voice call between operators, audio recording is preserved only from moment of acceptance by receiving operator'. Project constitution contains immutable rule: 'Full audio recording of all client interactions stored 7 years for compliance'. Accept or reject each candidate? Justify and record in capstone/README.md Shadow notes block.
Solution: 1. Candidate A: check against constitution — does not contradict immutable rule (full recording). Covers new fact (handoff as part of interaction). Does not duplicate existing (handoff not previously explicitly specified). DECISION: accept. 2. Candidate B: contradicts immutable rule ('only from moment of acceptance' ≠ 'full audio recording of all interactions'). DECISION: reject, reason 'violates constitution.md §3.1 immutable rule full-recording-compliance'. 3. In capstone/README.md Shadow notes block: 'Accepted: shadow.p0.voice_handoff.full-recording — compliant with constitution, covers handoff-gap. Rejected: shadow.p0.voice_handoff.partial-recording — violates immutable full-recording-compliance. Precedent: compliance priority over storage optimization'.
Complexity: intermediate
Name: Assessing budget risk and documenting budget-note.md
Problem: Scenario: cheap tier (Qwen2.5-7B-instruct, $0.10/1K tokens) failed when checking complex autoscale_200pct contract — gave false PASS on violated Then-condition. Expensive tier (Qwen2.5-72B-instruct, $1.20/1K tokens) caught the error. Current token_health: 3400 tokens remaining in cheap tier daily budget, need to check 15 specifications. Document budget-note.md with budget risk and failure scenario.
Solution: 1. Calculate risk: 15 specifications × average 400 tokens consumption = 6000 tokens needed, 3400 available. Deficit: 2600 tokens (43% short). 2. Cheap tier failure probability: high — one false PASS already showed 7B model cannot handle scale contracts. 3. Failure scenario: upon cheap budget exhaustion, fallback to expensive tier increases check cost from $0.60 to $7.20 (12×). If expensive also insufficient — manual escalation with 4-24 hour delay. 4. budget-note.md: 'tier: cheap, model: Qwen2.5-7B-instruct, token_health: 3400/6000 needed, risk_level: critical, failure_mode: false_PASS_on_scale_contracts, fallback: expensive_tier_with_cost_spike, mitigation: pre-filter_scale_contracts_to_expensive_tier, owner: on-call-SRE, next_check: 2026-05-21T09:00Z'.
Complexity: advanced
Name: Paired anti-Goodhart metric for remediation KPI
Problem: SRE team KPI: 'Mean time to remediate incident (MTTR) ≤ 15 minutes'. What guard metric prevents gaming this metric? Document goodhart-note.md for cdn_error_budget_burn case.
Solution: 1. Gaming strategy: automatically 'remediate' incidents with resolved marker without actual fix, making MTTR look good. 2. Guard metric: 'Percentage of remediations with confirmed fix within 24 hours' (confirmed_fix_rate) + 'Percentage of repeat incidents for same cause within 7 days' (recurrence_rate). 3. Alert condition: if MTTR < 15min but confirmed_fix_rate < 80% or recurrence_rate > 15% — emergency mode, prohibit automatic remediation without human confirmation. 4. goodhart-note.md: 'KPI: MTTR ≤ 15min, guard: confirmed_fix_rate ≥ 80% AND recurrence_rate ≤ 15%, emergency_mode_trigger: MTTR_green_but_guard_red, action: disable_auto_remediation_require_human_approval, case: cdn_error_budget_burn_2024-11'.
Complexity: intermediate
Case studies: Name: Implementing Spec CI in fintech startup payment environment
Scenario: Fintech startup with 50 microservices processing 10M transactions/day. Team of 8 developers, 2 SRE. Previously used ad-hoc specifications in Confluence, outdated within 2-3 weeks. After double-charging incident ($340K compensation loss), management demanded 'verifiable requirements for every change'.
Challenge: Legacy payment gateway code, 8 years old — no current specifications, only code comments and verbal agreements. Developers resisted 'extra bureaucracy'. CI pipeline took 23 minutes, adding Spec CI risked increasing to 40+. Needed to show value without blocking delivery.
Solution: Applied applied volume approach: chapters 1-7 in compressed form. 1) Selected one critical case — payment_latency_spike (chapter 5). 2) Recovered genealogy.md from double-charging post-mortem (chapter 1). 3) Found poisoned-spec: 'when latency > 2s retry automatically' without idempotency-key — creates double-charging (chapter 2). 4) Created constitution.md with immutable rule 'all payment retries require idempotency-key' (chapter 3). 5) LLM duel found counterexample: retry on network_timeout vs retry on insufficient_funds — different behavior (chapter 4). 6) Stress-mutator checked validator: 2/8 mutants passed — strengthened idempotency check (chapter 5). 7) Spec CI implemented only on payment services, not all 50 — one line in CI: 'spec-ci --scope=payment-* --block-on-uncovered' (chapter 7). Time: +4 minutes to pipeline.
Result: After 3 months: 0 double-charging incidents (4 in previous 6 months). Spec CI caught 12 uncovered requirements before production. Developer team assessed — 'we see why this is needed, not just paperwork'. Expansion to other services initiated by developers, not management. Average specification addition time reduced from 45 minutes to 12 (templates from capstone/).
Lessons learned: Start with one critical case, not 'implement everywhere' — the 'one small verifiable trace' principle works for organizational change too
Poisoned-spec with real financial damage is more convincing than any training — 'here's where we already failed'
Spec CI on subset of services delivers value faster than waiting for full coverage
Developers accept process when they see it catches real bugs before production, not when 'dictated from above'
Related concepts: genealogy.md
poisoned-spec.md / fixed-spec.md
constitution.md
LLM duel
stress-mutator
Spec CI
capstone/README.md
Name: Arbitration of disputed autoscale change in cloud platform
Scenario: Cloud platform with 2000+ Kubernetes clusters. Platform-SRE team proposed change: 'When CPU > 80% for 5 minutes — autoscaling +200% nodes'. Change disputed: Liveness team believes this will save from cascade failures; Safety team — that 200% sharp jump creates overprovisioning and cost-spike risk for short bursts.
Challenge: Classic priority conflict: availability vs economics. Previously such decisions were made by senior SRE unilaterally — created perception of 'dictatorship', decisions not reproducible. Needed formal process with evidence, suitable for audit and training new engineers.
Solution: Applied file arbitration (chapter 8) with role model. 1) Safety prepared counterexample: burst at 4 minutes 50 seconds — should not trigger +200%, but under current specification it does trigger. 2) Liveness prepared evidence: incident history where 80% for 6+ minutes led to cascade failure in 73% of cases. 3) Economy calculated: cost of false-positive (extra nodes for 10 minutes) vs cost of cascade failure (downtime minutes × SLA penalties). 4) judgment.md: verdict='CONDITIONAL_APPROVE', condition 'add guard metric burst_duration < 5min AND rate_of_change > 50%/min to exclude short spikes', evidence_ref=['safety-counterexample-burst-290s', 'liveness-historical-73pct', 'economy-cost-model-v2']. Dissenting opinion Safety: 'would prefer 70% threshold, but accept 80% with guard condition'.
Result: Change implemented with guard condition. After 2 months: 3 cases of guard condition triggering (no scaling), all confirmed as short bursts. 1 case where 80% held for 7 minutes — scaling worked, prevented potential cascade failure. False-positive costs reduced 67% compared to original proposal. Arbitration process adopted as standard for all disputed changes > $10K potential impact.
Lessons learned: Role-based arbitration with dissenting opinion creates reproducible precedents — new engineers learn from history, not authority
Counterexample + historical data + economic model = three evidence types, together more convincing than any single one
CONDITIONAL_APPROVE better than binary approve/reject — preserves decision-making speed while adding guard conditions
Dissenting opinion in judgment.md does not weaken decision, but strengthens its legitimacy — shows alternatives were considered
Related concepts: file arbitration
judgment.md
evidence_ref
Safety/Liveness/Economy roles
guard metric
dissenting opinion
Name: Goodhart protection in CDN incident automation
Scenario: Major media streaming service with global CDN. KPI: 'CDN incident remediation time ≤ 5 minutes'. Achieved through automatic remediation: when error_rate > 5% — automatic cache purge + origin fallback. MTTR dropped from 12 minutes to 4.
Challenge: 3 months after implementation: MTTR = 3.2 minutes, but customers complain about buffering. Investigation: automatic remediation at error_rate > 5% triggered on false spikes (measurement error, not real 5xx). Cache purge on false trigger created thundering herd on origin, worsening real UX. Team 'gamed' KPI: MTTR excellent, actual service degraded.
Solution: Applied chapter 10 approach. 1) Diagnosis: KPI (MTTR) 'lies' — optimizing it worsened actual outcome. 2) Guard metric: 'Percentage of remediations with confirmed real 5xx (not measurement error)' + 'Origin load spike after remediation < 150% baseline'. 3) Emergency mode: when guard_red + KPI_green — disable auto-remediation, require human confirmation, alert on-call. 4) goodhart-note.md: 'KPI: MTTR ≤ 5min, guard: confirmed_5xx_rate ≥ 90% AND origin_post_remediation_load ≤ 150%, emergency_mode: 3 consecutive guard_red in 24h'. 5) Additionally: added dry-run mode (chapter 11) — for new incident types remediation runs without side effects, result checked by human.
Result: After guard metric implementation: MTTR rose to 6.5 minutes (worse than KPI), but confirmed real remediations — 94% (vs 61% before). Buffering complaints reduced 78%. Team stopped 'gaming' the metric — became unprofitable. Management revised bonus structure: now 50% weight on guard metrics, 50% on KPI. Dry-run for new incident types prevented 2 potential false remediations in first month.
Lessons learned: KPI without guard metric actively harms — engineers rationally optimize measurable, even when it doesn't match actual outcome
Emergency mode must be automatic — 'require human confirmation' on guard_red must trigger without additional decisions
Dry-run for new incident types — indispensable when expanding automation, otherwise every new case is potential Goodhart trap
Bonus structure must include guard metrics or KPI will win — organizational engineering more important than technical
Related concepts: Goodhart protection
KPI and guard metric
goodhart-note.md
emergency mode
dry-run
readiness.md
Study tips: Never read a chapter linearly from beginning to end. First find the 'Before reading' block, then complete the minimal learning scenario, then return to 'Key ideas'. Only after that — calibrations, [project script] and [conceptual interface].
Keep five questions while reading each chapter: Foundation from first volume? Minimal learning scenario? Control fact? How does it get into capstone/? What belongs to full track?
One new term rule: if chapter introduces five names, but only one needed for current capstone/ file — remember one, defer rest to second pass.
Create physical or digital 'throughput map' on wall/screen: which chapter → which capstone/ file → what exactly gets written there. Check against it after each chapter.
For chapters with other cases (not high_memory_usage) immediately write transfer line: 'Principle X from case Y protects high_memory_usage, because...'. If you don't write — chapter not closed.
Run bash book2/examples/smoke_all.sh after every chapter with runnable examples. Expected blocks are part of learning, not error. If example doesn't block where it should — check version.
Use capstone-dossier.md template as minimalism standard. If your package is 3× longer — likely you captured full track in first pass.
For visual learners: draw artifact pipeline genealogy → poisoned/fixed → constitution → validation → judgment → budget → goodhart → readiness → antipattern. Mark where you are.
For auditory learners: voice control facts aloud before writing to file. If sounds vague — fact insufficiently verifiable.
For kinesthetic learners: physically move 'cards' of artifacts across board from 'raw idea' to 'verified in capstone/'. Tactile progress motivates.
Form study group of 2-3 people: one reads chapter, explains minimal scenario to others, third checks capstone/. Learning through teaching accelerates retention 2-3×.
Don't try to achieve 'production-ready implementation' on first pass. First pass goal — reproducible contour for one case. Full track — separate project after exam.
Additional resources: Book2/examples/readme.md: Local smoke runs and templates for all chapters. Mandatory resource for practice.
Book2/examples/templates/capstone-dossier.md: Completed standard of minimal package for high_memory_usage. Shows how short a good first pass can be.
Book2/glossary.md: Definitions of all second volume terms. Use as reference, not memorization text.
Book2/appendix-a-bridges-to-book.md: Bridges to first volume — prerequisites and full AgentClinic domain map. Useful when questions arise 'where was this explained before'.
Book2/appendix-b-qwen-code-compatibility.md: Built-in commands, custom commands and project scripts for Qwen Code. Necessary for adapting runnable examples to your CLI version.
Book2/appendix-c-checklists.md: Checklists for Spec CI, arbitration, metrics and production readiness. Use for self-check before exam.
Book2/appendix-d-threshold-calibration.md: 'Low / Default / High' tables, threshold shift exercises. Defer to second pass or real implementation.
Book2/instructor.md: Workshop formats and typical errors. Useful if training team or studying in group.
Book2/changelog.md: Text revision history. Check currency when using materials.
Source course document (applied volume readme): Base document on which this guide is built. Return to it to check your progress against plan.
Summary: The Applied Volume Production SDD for Qwen Code CLI teaches transferring the basic specification-driven development cycle into real production conditions. Key principle — one small verifiable trace per pass: each chapter adds exactly one line, file, or blocker to capstone/. Main case high_memory_usage goes through all stages: requirement recovery from legacy (genealogy.md), defect diagnosis (poisoned/fixed pair), project constitution, LLM duel and mutation testing (validation.md), shadow specification selection, Spec CI, file arbitration (judgment.md), tier budgets (budget-note.md), metric Goodhart protection (goodhart-note.md), readiness and dry-run (readiness.md), antipatterns (antipattern-audit.md). Result — reproducible contour from trace to production-ready solution, understandable without chat history. First pass requires discipline of cutting: full track — after exam, not instead of it.