Study guide: Appendix C. Applied SDD Checklists

Lesson 3 of 5 in module «Appendix C. Applied SDD Checklists»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Topic: Appendix C. Applied SDD Checklists

Difficulty level: Medium

Estimated study time: 8-12 hours (theory + practice)

Prerequisites: Completed study of the first volume of the SDD course (basic checklists before specification, implementation, and merge)

Understanding of artifact structure: requirements.md, plan.md, validation.md, QWEN.md

Experience with CI/CD and basic understanding of Git workflow

Familiarity with concepts: domain model, API contract, JSON Schema

Part 0 of the applied volume has been read (selecting a training incident-case)

Learning objectives: Apply Appendix C checklists as an operational reference at every stage of the applied SDD cycle, from preparation to final production acceptance

Implement quality control mechanisms: specification gates (Spec CI), file arbitration, poisoned specifications, and automatic remediation with clear rollback conditions

Identify and eliminate applied cycle antipatterns using 12-point diagnostics, making the decision to stop automation at ≥3 failures

Form a completed capstone artifact package with traceability: genealogy.md, poisoned/fixed pairs, judgment.md, and evidentiary base

Overview: Appendix C of the applied volume is an operational reference — a layer above the basic checklists of the first volume. It covers eight critical control points of the applied cycle: preparation for work, specification recovery from legacy systems, introduction of poisoned specifications for process validation, enabling Spec CI, file arbitration of disputed changes, production metric optimization with Goodhart effect protection, automatic incident remediation, antipattern audit, and final production acceptance. Each checklist is designed as a gate: passing checks is mandatory to proceed to the next stage. Process templates (merge request, retrospective, requests for /clear and replanning) are placed in the examples/templates/ catalog. The feature of the applied level is the transition from "having artifacts" to "evidentiary operation of artifacts in production conditions": every requirement is tracked, every change is justified, every risk is controlled by an invariant.

Key concepts: Specification gate (spec ci): Automated check that blocks change promotion upon violation of structural rules. Requirements: stable REQ-* identifiers, backward traceability of plan to requirements, JSON example validation against schema, informative error messages (file, line, rule, cause, action). Spec CI operates as "gates before execution," not as review after.

Poisoned specification (poisoned spec): Process validation method: exactly one controlled defect is introduced, expected symptom is described, recovery criterion is fixed. The fix must affect spec/plan/validation, not just textual explanation. Verifies that the quality control system can detect and localize errors.

File arbitration: Dispute resolution mechanism through diff in files, not chat discussions. Roles: Coordinator (process), Implementor (code), Verifier (verifiable evidence). Recurring decisions are recorded in precedents.md. There is a stop condition for transitioning to manual review.

Goodhart effect and anti-goodhart metrics: Principle: when a metric becomes a target, it ceases to be a good metric. Protection: each target metric has a paired invariant metric, there is a "red button" rule for stopping upon behavior distortion. Changing the threshold is a risk change requiring procedure compliance.

Blast radius: Boundary of affected systems during automatic remediation. Control questions: is the radius known, is there a dry-run, is the rollback condition recorded before execution, is manual confirmation required when expanding the radius or upon repeated failure. High-risk incidents require two independent recovery confirmations.

Trace and evidence ref: Reproducible identifier of a decision: source, policy version, prompt hash. Each entry in QWEN.md contains author, evidence, and TTL. The evidence_ref field is mandatory for all trace entries. Ensures auditability and system drift debugging.

Mutable rules and ttl: Rules with limited validity period. Prohibition: rules without TTL or with TTL > 90 days. Forced review prevents stagnation of outdated restrictions and accumulation of technical debt in governance.

Budget ceiling and tiers: Financial and operational control when switching between system levels. Each switch has a budget ceiling and emergency mode. Prevents uncontrolled cost growth during scaling.

Manual review floor: Immutable minimum of manual review, independent of KPI value. Protection against full automation of critical decisions. Preserved even when high automation metrics are achieved.

Genealogy.md: Final package artifact: requirement, sources, confidence level, open question. Ensures transparency of origin for every specification element and honesty about uncertainty.

Practice exercises: Name: Readiness Audit for Applied Cycle

Problem: You received a repository from a team claiming readiness to start the applied volume. Check the following facts: (1) capstone/ directory is missing, (2) README does not distinguish [runnable] and [project script], (3) requirements.md exists but plan.md refers to it implicitly through text rather than REQ-* identifiers, (4) validation.md contains JSON examples without schema. Compile a list of blockers and a remediation plan with priorities.

Solution: Step 1: Blocker — missing capstone/. Action: create directory structure, plan final package. Step 2: Blocker — no command type distinction. Action: review all chapters, mark explicitly, update README. Step 3: Blocker — missing stable identifiers. Action: implement REQ-* numbering, update plan.md with direct links. Step 4: Blocker — non-validatable examples. Action: create JSON Schema, add validation to CI. Priority: 3 and 4 — critical for Spec CI, 1 and 2 — critical for final acceptance. Timeline: 2 business days.

Complexity: intermediate

Name: Designing a Poisoned Specification

Problem: Your payment processing system has requirement REQ-PAY-07: "Operation timeout — 30 seconds." Create a poisoned specification: introduce exactly one defect, describe expected symptom, set recovery criterion. Ensure the fix affects at least two artifacts from spec/plan/validation.

Solution: Step 1: Defect — change timeout to 300 seconds in plan.md (in service configuration section), leaving requirements.md unchanged. Step 2: Symptom — p99 latency metric rises above 45 seconds, users receive timeout on client before server completes operation, leading to payment duplication. Step 3: Recovery criterion — p99 latency < 10 seconds, no duplicates in validation.md (check by trace_id). Step 4: Fix affects: plan.md (revert to 30 seconds), validation.md (add trace_id + timeout consistency check). requirements.md requires no change, demonstrating separation of concerns.

Complexity: intermediate

Name: Incident Analysis via Anti-Goodhart Protocol

Problem: A team optimized the metric "number of tickets closed per day." After one month: tickets close in 2 minutes, but reincidence rose 400%, customer satisfaction dropped. Metric reached 150% target. Team wants to "raise the bar to 200%." Apply the pre-metric-optimization checklist.

Solution: Step 1: Stop — apply red button rule: reincidence growth > 50% is a stop condition. Step 2: Separation: target metric (tickets/day) separated from invariant (reincidence < 10%, NPS > 30). Step 3: Replay: check not only aggregates (average) but behavior drift — resolution time distribution, correlation "quick closure → reopening." Step 4: Trace: record policy version, hash of prompt that decided on 150%. Step 5: Threshold change — process as risk change: requires anti-Goodhart metric analysis, Safety approval, governance_protocol update. Conclusion: increase to 200% — rejected, return to 100% with added reincidence invariant.

Complexity: advanced

Name: File Arbitration Simulation

Problem: Dispute: Implementor added to plan.md step "cache API responses for 24 hours," Verifier rejects — requirement REQ-DATA-03 requires "data freshness within 1 hour." Coordinator received 15 messages in chat with arguments. Apply file arbitration protocol.

Solution: Step 1: Stop chat discussion, declare transition to file arbitration. Step 2: Implementor creates branch with two plan.md variants: (A) original with 24h cache, (B) modified with 1h cache + fallback to stale data. Step 3: Verifier provides verifiable evidence: diff between (A) and (B), JSON test with time-travel freshness check, CI logs for both variants. Step 4: Decision: accept (B), record in precedents.md: "When cache conflicts with freshness — fallback to stale with TTL=required freshness." Step 5: Stop condition: if dispute affects storage architecture — escalation to architecture committee. Resolution time: 2 hours instead of 2 days of chat.

Complexity: intermediate

Name: Antipattern Audit: Decision Making

Problem: Conduct an audit of a hypothetical team using the 12-point checklist. Facts: (1) constitution.md checked only at sprint end, (2) 3 rules in mutable_rules without TTL, (3) after poisoned specification failure only changed explanation in README, (4) CI fails — weakened validation.md, (5) metric "deploys/day" has no anti-Goodhart pair. Remaining points — positive. What decision does Appendix C require?

Solution: Step 1: Count negative answers: 5 points (1, 2, 3, 4, 5). Step 2: Apply rule: ≥3 negative answers → prohibition on adding new automation layers. Step 3: Prioritization: point 4 (weakening validation when CI fails) — critical, requires immediate revert and root cause analysis. Point 3 — second most critical, requires repeating poisoned specification cycle with artifact changes. Point 2 — operational, assign owner, set TTL 90 days. Points 1 and 5 — structural, include in sprint plan. Step 4: Control: repeat audit in 2 weeks, target — ≤2 negative answers.

Complexity: advanced

Case studies: Name: Spec CI Implementation in Fintech Startup Payment Platform

Scenario: Startup with 50 microservices, 8 teams, 200+ PRs daily. Problem: 40% of production rollbacks caused by implementation not meeting requirements, while requirements.md existed but was not linked to code. Management decision: "implement automation in a month."

Challenge: Teams perceived requirement as "another CI." Implemented check for requirements.md file existence but not structure. After 2 months: files exist, but REQ-* identifiers are duplicated, plan.md references non-existent requirements, JSON examples in validation.md are not validated. Rollbacks did not decrease. Management demanded "strengthen automation" — add ML semantic checking.

Solution: Applied "Before enabling specification gate" checklist from Appendix C. Stop: 7 violations found out of 5 checklist points. Decision — not to add ML, but to close basic failures. Actions: (1) REQ-* standardization with domain prefix (PAY-, AUTH-, LED-), (2) backward traceability implementation: every plan.md item starts with [REQ-...], (3) JSON Schema for all API examples, (4) structured CI messages: file, line, rule, cause, action. Key: Spec CI configured as mandatory status check — PR blocked on failure.

Result: After 6 weeks: rollbacks due to requirements mismatch decreased from 40% to 8%. Review time reduced 30%, as Verifiers received machine-checkable artifacts. After 4 months: added anti-Goodhart metric "time from mismatch detection to fix" — prevented distortion toward "beautiful but useless" requirements.md.

Lessons learned: Automation without basic gates amplifies chaos, not order — ≥3 failures rule works

CI messages must be actionable: developer spends < 2 minutes understanding failure cause

Backward traceability plan→requirements allows automatic finding of affected tasks when requirements change

Related concepts: Specification gate (Spec CI)

Backward requirements traceability

Automation stop rule at ≥3 failures

Name: Auto-Remediation Incident with Controlled Blast Radius

Scenario: Cloud provider, Kubernetes cluster auto-scaling system. Night peak incident: API latency rose 300%, trigger fired, system began scaling — created 500 new pods instead of 50, exhausting cluster budget and causing cascading failure of neighboring services.

Challenge: Original auto-remediation system lacked: (1) blast radius limitation, (2) dry-run for scaling, (3) recorded rollback condition, (4) manual confirmation when expanding. "Intelligent" system made decision based on aggregates without behavior drift check.

Solution: Applied "Before optimizing production metrics" and "Before auto-remediation" checklists. Introduced: (1) blast radius — maximum 2x current replicas for single service, cascading scaling prohibited without manual confirmation; (2) dry-run: Kubernetes scheduler simulation with resource estimation; (3) rollback condition: if latency does not decrease 5 minutes after scaling — rollback to original state; (4) double recovery confirmation for incidents with radius > 1 service.

Result: Next similar peak: system created 45 pods (within 2x limit), latency normalized. Dry-run predicted CPU exhaustion in cluster — trigger switched to emergency mode with traffic redirection to backup region. Recovery time reduced from 47 minutes to 8. Incident cost: from $120K (SLA penalties + resource overuse) to $3K (traffic switching).

Lessons learned: Blast radius without limitation is not auto-remediation, but auto-catastrophe

Dry-run in distributed systems must account for global resources, not only local metrics of target service

Rollback condition recorded before execution prevents "forgetting" in incident panic

Related concepts: Blast Radius

Dry-run and safe preliminary check

Red button rule for Goodhart effect

Double recovery confirmation

Name: Final Production Acceptance: From Chat to Evidentiary Base

Scenario: 4-person team completed 6-month billing system migration project. During acceptance preparation discovered: capstone/README.md contains copy of chat history with customer, genealogy.md missing, judgment.md — one sentence "everything works," no poisoned/fixed pair.

Challenge: Team perceived applied cycle as "write code and explain in chat." Artifacts created retrospectively, without traceability. Customer required auditable base for regulatory check (PCI DSS).

Solution: Applied "Before final production acceptance" checklist as recovery framework. Actions: (1) README.md rewritten: case described through domain model, no chat mentions; (2) genealogy.md created for each critical requirement (PAN storage, encryption, audit) — with source (PCI DSS item), confidence level (high/medium/low), open question; (3) poisoned/fixed pair recovered: defect introduced in PAN validation, fix recorded in validation.md; (4) judgment.md: verdict "conditionally fit," reason — 2 medium confidence levels in genealogy, evidence_ref on tests, next step — reassessment in quarter; (5) added budget and anti-Goodhart layers with blocking invariants.

Result: Acceptance passed with mark "requires confidence improvement." Regulatory check: genealogy.md accepted as proof of requirement origin. After quarter: 2 open questions closed, confidence level raised, judgment.md updated to "fit." Team implemented genealogy creation in parallel with development, not retrospectively.

Lessons learned: Genealogy.md created during process requires 10% time; recovered retrospectively — 300%

Open questions in genealogy are not weakness but honesty; their absence raises suspicion

Judgment.md with "conditionally fit" verdict and plan is more valuable than "fit" without justification

Related concepts: Genealogy.md and confidence level

Poisoned/fixed pair as process proof

Judgment.md with verdict and next step

Readiness conclusion: separation of blockers and improvements

Study tips: Go through checklists not as formality but as gates: on negative answer — stop and fix, not a "will fix later" checkbox

Create a physical or digital "project passport" — table with 8 Appendix C control points, recording passage date, owner, evidence link

For visual learners: build a flow diagram — from "Before start" to "Final acceptance" — and mark which templates from examples/templates/ are used at each stage

For practitioners: take your current project and conduct an audit using the 12-point antipattern checklist; real count of negative answers is stronger than theoretical reading

For auditory learners: discuss cases in pairs — Coordinator, Implementor, Verifier roles in file arbitration are better understood through role-play

Record "precedents" (precedents.md) from first dispute — time savings on recurring situations exceed maintenance costs

Use TTL as "memento mori" for rules: set calendar reminders one week before expiration

Additional resources: Process templates (examples/templates/): pr-template.md — merge request structure with mandatory requirement links; retrospective.md — retrospective format focused on evidence; clear-prompt.md and replan-prompt.md — minimal requests for /clear and replanning

Part 0 — production lab: Selection and preparation of training incident-case, distinguishing [runnable] and [project script], creating capstone/

Part 12 — production antipatterns: Full analysis of 12 antipatterns targeted by Appendix C quick checklist

Appendix c of first volume: Basic checklists: before specification, implementation, merge — foundation for applied level layering

Json schema specification: Documentation for creating validation schemas for validation.md (json-schema.org)

Pci dss requirements checklist: Example external requirement source for practicing genealogy.md creation with regulatory traceability

Summary: Appendix C of the applied volume transforms SDD from methodology into operating system: eight control points with clear gates, 12-point antipattern audit as process health sensor, final acceptance with evidentiary base. Key principles: gates before execution (not review after), file arbitration instead of chat discussions, anti-Goodhart metric protection, controlled auto-remediation radius, automation stop at ≥3 failures. Successful application requires not memorizing points but internalizing the culture: "a negative answer is not a failure, but a prevented failure in production".

My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100