Topic: Applied Volume Glossary: a practical guide to production-SDD
Difficulty level: Medium
Estimated study time: 12-16 hours (2 weeks at 1 hour per day)
Prerequisites: Completed first volume of the SDD textbook (core artifacts: QWEN.md, mission.md, tech-stack.md, roadmap.md, requirements.md, plan.md, validation.md)
Understanding of Qwen Code, MCP, ACP, EARS, Given/When/Then skills
Basic experience with Git, YAML/JSON, CI/CD
Familiarity with the study project AgentClinic (TypeScript, Hono, SQLite, Vitest)
Learning objectives: Correctly translate 40+ key terms of the applied volume from English form into Russian prose, preserving technical names in code and YAML/JSON keys
Apply production refinements of basic SDD artifacts (validation.md, QWEN.md, constitution) in real-world scenarios with multi-agent arbitration and drift protection
Design paired metrics (anti-Goodhart) and safety gates (spec gate, red button) for production environments with LLM agents
Reproduce the full file arbitration cycle: from poisoned spec through mutation testing to judgment.md and precedents
Independently complete mandatory first-pass artifacts (capstone dossier) for a specific incident
Overview: The Applied Volume Glossary is not merely a dictionary of terms, but an operational handbook for production-SDD: a specification-driven development system where LLM agents work under human supervision in real production environments. Each term here receives a working definition, a runnable scenario, and clear translation rules: Russian prose for explanations, English technical names for code, YAML, and JSON. The main reading rule: do not memorize the glossary in its entirety, but open a term when it helps fill a specific file or understand a runnable example. Key groups: agent roles (Verifier, Implementor, Safety, Coordinator), risk management artifacts (constitution.md, judgment.md, precedents.md, genealogy.md), immunity metrics and protection against metric distortion (anti-Goodhart), stress testing mechanisms and multi-agent arbitration. The study project AgentClinic serves as the reference point for all production scenarios.
Key concepts: Silent p0: The proportion of P0-level incidents that passed through automation without human confirmation and without being recorded in the audit trail. Anti-Goodhart metric: if MTTR decreases while silent_p0 increases, auto-remediation is accelerating at the expense of hidden risks. In code: silent_p0, silent_p0_cap, silent_p0_ratio — technical names, not translated.
Spec gate (spec ci): A CI check that blocks merge if the specification is not covered by the plan, the plan by tasks, or tasks by facts in validation.md. In prose: "Spec CI" or "spec gate (Spec CI)"; in code: spec_gate — only as a task name in .github/workflows/spec-ci.yml.
File arbitration (tribunal): A collegial decision procedure for a disputed amendment: Verifier, Implementor, and Safety vote according to a fixed protocol, Coordinator prepares judgment.md. In prose: "file arbitration"; in code: tribunal — only as the directory name examples/tribunal/ and its scripts.
Emergency mode (red button): A formal safety gate before a dangerous action: deployment, rollback, migration, or auto-remediation. "Red button" is a short conversational label. In code: red_button, red_button_mttr_blindness — only as invariant names in YAML.
Shadow specification (shadow spec): A specification for unformalizable nuances: intonations, unspoken priorities, historical decisions. Stored separately, wins at auction based on the scorebook log, does not replace the main specification. Both spellings are permitted in headings.
Immunity score: A vector of validator assessments with three components: strict_reject_rate (proportion of degenerate cases strictly rejected at the expected step), depth_of_diagnostics (useful depth of explanation before refusal), recovery_time_p95_ms (p95 time to return a stable verdict). Not a single aggregate number, but a gate for the validator loop.
Guard metric: An antibody metric to protect against distortion of the target indicator. Each target metric (MTTR, edge_drift) is paired with a guard metric (silent_p0, manual_review_floor, audit_trace_coverage), and the CI gate passes only when both are simultaneously satisfied.
Project constitution (constitution.md): An extension of the basic constitution from the first volume with an explicit section: immutable_principles (not automatically disabled, changed only by team referendum), mutable_rules (evolve through accumulation of incidents with fields incident_type, pipeline_phase, permitted_actions, max_scope, ttl, rollback_condition), governance_protocol (roles and voting procedures).
Poisoned spec: A study specification with one controlled defect: escalation cycle, priority conflict, or hidden boundary violation. Used for training the Verifier and validators through mutation testing.
Mutation operator: A function that introduces exactly one defect of a known class into a correct specification. Each mutation is assigned a mutation_id, expected expected_failure, and halt_before step. Examples: Nullify, FutureTime, EscalationCycle, PriorityContradiction.
Dispute resolution (judgment.md): The final artifact of file arbitration: vote log, decision_hash, references to specification, constitution, and incident, active ttl and rollback_condition. Stored in the repository as an immutable record.
Precedent: A record in precedents.md about a recurring conflict type and its adopted resolution. Used as tie-breaker latest_matching_precedent in governance_protocol and reduces the cost of the next arbitration.
Genealogy (genealogy.md): Provenance of a recovered specification: for each requirement — a list of sources, confidence level (confirmed, inferred, hypothesis), and open question. Created when recovering from legacy code and logs.
Model tier (tier): Model level in tiered routing: local-coder (cheap local model for code generation and drafts), frontier-reviewer (expensive frontier model for critical reviews and disputed verdicts). In prose: "low/medium/high tier"; in YAML keys and role names — untranslated.
Drift: Divergence between specification, implementation, and actual agent behavior in production. Three types: spec_drift (specification outdated), code_drift (implementation deviated from plan), edge_drift (validator reacts differently to edge cases).
Replay: Re-running historical incidents through the current validator and current constitution. A gate for Goodhart metrics: the new version must not worsen verdicts on already analyzed cases.
Override rule: A mutable norm in constitution.md permitting bypass of standard behavior in a narrow context: for a specific incident_type, at a specific pipeline_phase, with limited max_scope and mandatory ttl. Without restrictions, competes with invariants.
Evidence chain: A structured chain of artifacts attached to an agent's decision: input payload, specification version, active constitution rules, arbitration vote log, change diff, post-condition checks. Minimum requirement for production SDD.
Shadow spec auction: Evaluation and ranking of informal heuristics before inclusion in the working context. The auction winner is added to QWEN.md as a few-shot with a review deadline. Score log — scorebook (shadow-scorebook.json).
Process antipatterns: ask_storm — cycle of clarifying questions instead of stopping; stage_regress — rollback to a previous SDD phase without cause; phase_context_loss — loss of context between phases. Control strings and protections described in the glossary.
Important dates: First pass moment: Do not read the glossary in its entirety. Sufficient to understand capstone/ and the ten mandatory first-pass artifacts (full list in README)
Terms introduced in parts: Roles (part 4), Constitution artifacts (part 3), Immunity metrics (part 5), Shadow specs (part 6), Spec CI (part 7), File arbitration (part 8), Tiered routing (part 9), Anti-Goodhart (part 10), Deployment (part 11), Antipatterns and capstone (parts 12-13)
Production refinements: Applied on top of basic first volume terms: validation.md supplemented with failing case, anti-Goodhart checks, drift fields; QWEN.md becomes the place for few-shots from auction; constitution expanded with immutable/mutable sections
Practice exercises: Name: Term translation: technical name vs prose
Problem: Three contexts for using the term 'silent_p0' are given. Write the correct form for each: (1) explanation in business documentation, (2) YAML key in CI configuration, (3) CLI command for checking the metric. Check yourself against the translation table.
Solution: (1) Prose: "silent P0" — proportion of P0-level incidents that passed without human confirmation; (2) YAML key: silent_p0, silent_p0_cap, silent_p0_ratio — technical name, not translated; (3) CLI: silent_p0 — metric name in command, e.g. qwen -p metrics check silent_p0_ratio. Rule: English key only in code blocks and at first mention in parentheses.
Complexity: beginner
Name: Designing paired metrics for AgentClinic
Problem: The doctor appointment system is introducing a metric 'average appointment confirmation time' (MTTR-like). What guard metric would protect against distortion? Describe: (1) target metric, (2) guard metric, (3) gate condition, (4) scenario where the target metric improves but the guard worsens.
Solution: (1) Target: booking_confirmation_time_ms — average time from request to confirmation; (2) Guard: silent_override_rate — proportion of appointments auto-confirmed without checking doctor schedule conflicts; (3) Gate: both metrics in green zone, otherwise BLOCKED; (4) Scenario: agent starts confirming appointments while ignoring double-booking — MTTR drops, silent_override_rate rises, gate blocks deployment. Anti-Goodhart technique: never optimize one metric without checking its antibody.
Complexity: intermediate
Name: Recovering genealogy.md for a legacy feature
Problem: In AgentClinic, a feature 'automatic appointment reminder' implemented 2 years ago was discovered, with no specification. Recover a minimal genealogy.md: describe 3 information sources, confidence levels, and 2 open questions. Use the format from the glossary.
Solution: Sources: (1) SMS gateway logs — confirmed (records exist with message template and 24h trigger); (2) Code in src/reminders/auto-sms.ts — inferred (logic exists but no comments on business rules); (3) Patient feedback in support — hypothesis (some complain about missing reminders, possible bug or opt-out). Open questions: (a) What threshold of 'sufficiently close time' triggers — 24h, 2h, or depends on appointment type? (b) Is there a fallback channel (email, push) when SMS is unavailable? Format of genealogy.md: table with columns requirement_id, sources[], confidence, open_questions[].
Complexity: intermediate
Name: Full file arbitration cycle
Problem: Disputed change in AgentClinic: agent proposes adding automatic appointment cancellation after 3 no-shows, but this affects the SLA refund policy. Conduct file arbitration: determine votes of Verifier, Implementor, Safety, role of Coordinator, final verdict, and contents of judgment.md.
Solution: Verifier: reject — violation of hidden boundary breach (specification is about alert routing, agent modifies SLA policy); Implementor: abstain — amendment is technically applicable but exceeds max_scope; Safety: veto (critical_risk) — blast radius includes financial obligations, no rollback_condition for erroneous cancellations; Coordinator: records verdict REJECTED, publishes judgment.md with decision_hash, references to routing specification and constitution (mutable_rules for financial operations), active ttl=0 (change not applied), rollback_condition=N/A. Precedent added to precedents.md: 'auto-cancellation with financial impact requires explicit mutable_rule for billing, not just routing spec'.
Complexity: advanced
Name: Diagnosing the ask_storm antipattern
Problem: Agent is in a loop asking 5 clarifying questions about appointment priority without proceeding to planning. Check the control string: cycle_count > 0 && ask_storm >= 4 && escalation_path_resolved=false. Describe: (1) what indicates a poisoned spec, (2) which constitution rule is violated, (3) remediation steps.
Solution: (1) ask_storm >= 4 with cycle_count > 0 — agent stuck in clarifications instead of stopping or escalating; escalation_path_resolved=false — no conflict resolution; (2) Violation of mutable_rule: at ask_storm >= 3 Coordinator must interrupt the cycle and initiate human-in-the-loop (if such rule is recorded in constitution.md); if absent — gap in constitution; (3) Remediation: (a) record incident, (b) update specification with explicit tie_breaker for priority conflicts, (c) add guard metric max_ask_storm=2 in CI, (d) conduct mutation testing with PriorityContradiction operator.
Complexity: advanced
Case studies: Name: Implementing tiered routing in AgentClinic: from budget crisis to controlled costs
Scenario: The AgentClinic team used a single frontier model (GPT-4-class) for all tasks: writing code, review, bug fixes, support responses. As load grew, API costs increased 340% in a quarter, while 70% of requests were routine (SMS template generation, schedule updates). It was necessary to preserve quality of critical operations and reduce costs.
Challenge: (1) No task separation by criticality — frontier model used for everything; (2) No blocking mechanism when budget exceeded — costs grew continuously; (3) Fear of quality degradation: team feared that a cheap model would 'break' complex therapy logic; (4) No metrics to verify that task downgrade is safe.
Solution: Tiered routing implemented per the textbook model: (1) Tiers defined: local-coder (Qwen2.5-Coder 7B locally) for drafts and routine, frontier-reviewer (GPT-4) only for critical reviews, disputed verdicts, and red button checks; (2) Budget keeper introduced — external script with daily token quota per tier, blocking frontier when exceeded; (3) Guard metric created for each task: local_coder_acceptance_rate — proportion of tasks that passed frontier model review without edits after local-coder generation; (4) Pilot: 30% of tasks moved to local-coder with automatic frontier review on a 10% sample.
Result: After 6 weeks: cost reduction of 62%, local_coder_acceptance_rate stabilized at 78% (target: 75%), average completion time for routine tasks dropped from 4.2 min to 1.1 min (local-coder requires no network call). Critical incidents (therapy conflicts, complaints) still pass frontier-reviewer. Budget keeper triggered 3 times, correctly deferring non-urgent frontier tasks to next day without incidents.
Lessons learned: Guard metric local_coder_acceptance_rate is critical for trust: without it, the team will sabotage downgrade 'just in case'
Budget keeper must be external to the agent — Qwen Code does not manage budget itself, this is an architectural constraint
Pilot with sample frontier review provides data for scaling justification, not blind trust in the cheap model
Tiered routing requires updating QWEN.md: few-shot examples for local-coder and frontier-reviewer differ in style and depth
Related concepts: Model tier (tier, local-coder / frontier-reviewer)
Budget keeper (budget keeper)
Guard metric (guard metric)
Shadow spec auction (shadow specs)
Immunity score (immunity score)
Name: Spec drift in auto-remediation loop: how silent_p0 revealed a hidden crisis
Scenario: AgentClinic had an auto-remediation pipeline: when SMS gateway failed, the system automatically switched to email channel, recorded the incident, and notified oncall. MTTR (mean time to recovery) steadily decreased: 12 min → 8 min → 5 min. Seemingly a success.
Challenge: Detailed audit revealed: 18% of P0 incidents (complete failure of booking for critical specialists) passed without human confirmation and without full audit trail recording. The agent 'fixed' the problem by channel switching, but did not escalate when email was also unavailable (rare case, but critical). MTTR dropped due to 'silent' omissions. Classic Goodhart distortion: optimizing one metric at the expense of the system.
Solution: Anti-Goodhart loop implemented: (1) Guard metric silent_p0 added to MTTR — proportion of P0 without human confirmation and full audit trail; (2) manual_review_floor = 15% set — minimum decisions mandatory through human; (3) audit_trace_coverage added — proportion of actions with full evidence chain; (4) Emergency mode (red button) blocks auto-deploy when silent_p0 > 10% or manual_review_rate < 15%; (5) Replay of historical incidents through new validator — gate: new version does not worsen verdicts.
Result: First run showed: silent_p0 = 18%, manual_review_rate = 12%, audit_trace_coverage = 73%. Red button = BLOCKED. Team conducted investigation, updated mutable_rules in constitution.md: now when both channels fail (SMS+email) mandatory human-in-the-loop escalation, not auto-remediation. After 3 weeks: silent_p0 = 4%, manual_review_rate = 22%, audit_trace_coverage = 97%, MTTR rose to 9 min (honest time, without silent omissions). Red button = UNLOCKED.
Lessons learned: MTTR without guard metrics is a dangerous 'honey trap' metric: easy to optimize at the expense of safety
silent_p0 reveals exactly what MTTR hides — this is its sole and critical function
manual_review_floor is not bureaucracy, but protection against human displacement from the loop
Replay of historical cases is the only way to prove that 'improvement' did not worsen old scenarios
Emergency mode must be a formal gate with verifiable conditions, not a 'red button' in words only
Related concepts: Silent P0 (silent_p0)
Guard metric (guard metric, anti-Goodhart)
Emergency mode (red button)
Replay (replay)
Manual review floor (manual_review_floor)
Audit trace coverage (audit_trace_coverage)
Spec drift (spec_drift, edge_drift)
Name: Mutation testing of booking specification: from poisoned spec to vaccinated validator
Scenario: AgentClinic updated booking priority logic: previously priority was determined only by urgency (urgent/elective), now a 'chronic disease' factor was added (chronic flag). The new specification appeared correct, but integration testing revealed conflicts: a chronic patient with an elective booking received higher priority than an urgent patient without chronic flag, violating the old invariant 'urgent always first'.
Challenge: (1) Manual testing did not reveal the conflict — testers checked flags separately; (2) The old validator missed the case because the specification formally did not violate Given/When/Then — the conflict was in the implicit priority; (3) The validator needed to be 'vaccinated': taught to find exactly such logical defects.
Solution: Mutation testing applied per the textbook model: (1) Poisoned spec created — copy of new specification with PriorityContradiction operator introduced: one rule demotes urgent to high when chronic=false, another promotes high back to urgent when chronic=true without tie_breaker; (2) Expected failure: PRIORITY_REVERSAL; (3) Validator went through mutation cycle: Nullify (chronic fields), FutureTime (appointment timestamps), EscalationCycle (priority escalation loop), PriorityContradiction; (4) Immunity metric tracked: strict_reject_rate, depth_of_diagnostics, recovery_time_p95_ms; (5) Validator failing the gate (recovery_time_p95_ms > 1200ms) sent for rework.
Result: After 3 iterations: strict_reject_rate = 94% (target: 90%), depth_of_diagnostics = 4.2 steps (target: 3+), recovery_time_p95_ms = 890ms (target: <1200ms). Validator stably catches PriorityContradiction at 'When priority is calculated' step. Updated specification with explicit tie_breaker deployed to production: urgent > chronic > elective, with conflict resolution table. Integration conflict eliminated before deployment.
Lessons learned: Mutation testing of specifications — analog of unit tests for logic, but at the requirements level, not code
Poisoned spec is not a 'broken document', but a controlled study tool with a known defect
Immunity score is a vector, not a scalar: high strict_reject_rate with low depth_of_diagnostics means 'blind strictness', useless for diagnosis
Mutation operators must be classified by defect type, not random — this enables targeted validator strengthening
Validator vaccination is not a one-time event: when specification changes, replay of old mutants is needed
Related concepts: Poisoned spec (poisoned spec)
Mutation operator (mutation operator)
Immunity score (immunity score, strict_reject_rate, depth_of_diagnostics, recovery_time_p95_ms)
Counterexample (counterexample)
Stress spec (stress spec)
Replay (replay)
Study tips: Use the glossary as a reference, not a textbook: open a term when it appears in a specific chapter or when filling out a file. Reading rule from README: 'file name or YAML/JSON key may remain English, but in explanation choose one Russian meaning'
Create a personal 'translation map': print the translation table of key terms and mark those you have already used in your artifacts. Goal: automate the correct choice between prose and technical name
Practice 'prose explanation' aloud: take any YAML fragment from examples and read it aloud, replacing each key with its Russian equivalent. For example, silent_p0_ratio: 0.18 → 'silent P0 proportion: eighteen percent'
For visual style: draw connection diagrams between artifacts. For example, chain: poisoned spec → mutation operator → immunity score → spec gate → judgment.md → precedents.md. This helps see how terms work together, not in isolation
For kinesthetic style: physically fill out templates. Take examples/templates/proposal.md and write a real proposal for a change in your project, then conduct a mental referendum
Use the 'three contexts rule': for each new term find its use in (1) prose explanation, (2) YAML/JSON key, (3) file name or CLI command. This reinforces dual spelling
Do an 'audit of your artifacts': once a week check whether an anglicism slipped into prose or whether you translated a technical name. Fix it — this trains discipline critical for team work
For studying anti-Goodhart: find a 'honey trap metric' in your project — an indicator that is easy to optimize at the expense of the system. Design a guard metric for it per the model from part 10
Proceed through capstone dossier in stages: do not try to assemble all 13 parts at once. Start with one real incident from your practice and go through the cycle: genealogy → spec → mutation → constitution → tribunal → metrics
Use the study project AgentClinic as a 'mental sandbox': when a term seems abstract, ask 'how would this apply in the booking clinic?' — the domain is familiar, and mapping to production scenarios is fixed in appendix A
Additional resources: Course source document (applied volume glossary): Main reference, updated by authors; read as you progress through parts, not in full
Applied volume README, section "mandatory first-pass artifacts": Minimum set to start: capstone/ and 10 artifacts
Part 1: spec archaeology (part-01-spec-archaeology.md): Introduction to genealogy.md, recovering specifications from legacy
Part 2: poisoned specs (part-02-poisoned-specs.md): Poisoned specs, ask_storm antipattern
Part 3: project constitution (part-03-project-constitution.md): Immutable/mutable sections, governance_protocol, proposal.md
Part 4: llm duel (part-04-llm-duel.md): Verifier/Implementor roles, counterexamples, repair.patch
Part 5: stress specs (part-05-stress-specs.md): Mutation testing, mutation operators, immunity metrics
Part 6: shadow specs (part-06-shadow-specs.md): Shadow specs, auction, scorebook, shadow-candidates.yaml
Part 7: specification ci (part-07-specification-ci.md): Spec gate, spec gate, GitHub integration
Part 8: multi-agent tribunal (part-08-multiagent-tribunal.md): File arbitration, judgment.md, precedents.md, Coordinator role
Part 9: tier budgeting (part-09-tier-budgeting.md): Tiered routing, local-coder/frontier-reviewer, budget keeper
Part 10: goodhart metrics (part-10-goodhart-metrics.md): Anti-Goodhart, guard metrics, red button, silent_p0, replay
Part 11: real api deployment (part-11-real-api-deployment.md): Readiness gate, dry run, evidence_ref, 25-point model
Part 12: production antipatterns (part-12-production-antipatterns.md): Connections between terms, typical implementation errors
Part 13: capstone (part-13-capstone.md): Final package, full production SDD path for an incident
Appendix a: bridges to book (appendix-a-bridges-to-book.md): Mapping of study code and production incidents
Appendix b of first volume: agentclinic domain: Domain entities for mental experiments
First volume glossary (../book/glossary.md): Basic SDD terms, supplemented with production refinements here
Example templates (examples/templates/): proposal.md, judgment.md and others — for practical filling
Runnable examples (examples/tribunal/, examples/stress-mutator/): Python stdlib scripts for local checks, not the main application stack
External frameworks for comparison: github spec kit, aws kiro: Reference SDD cycle implementations, comparison in first volume appendix A
Summary: The Applied Volume Glossary is the operating system of production-SDD, where each term has a clear usage rule: Russian prose for explanations, English technical names for code. Key principles: (1) Read as needed, not in full; (2) Maintain dual spelling discipline — it is critical for team communication; (3) All metrics are vector and paired, guard metrics protect against Goodhart distortion; (4) Arbitration, constitution, and precedents are not bureaucracy, but formal protection against unpredictable agent behavior; (5) Mutation testing and poisoned specs are standard tools, not exotic; (6) Capstone dossier connects all parts into a reproducible production path. Successful glossary mastery is measured not by knowledge of definitions, but by the ability to correctly fill out judgment.md for a disputed incident, design a guard metric for a new feature, and conduct arbitration with clear verdicts for all roles.