Reading: Practical Part 13. Practical Assessment: Build a Production SDD Pipeline

Lesson 1 of 5 in module «Practical Part 13. Practical Assessment: Build a Production SDD Pipeline»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Practical Part 13. Practical capstone: assemble a production SDD pipeline

Status: Recommendation. This part introduces no new mechanism. It assembles the second volume into a single verifiable route modeled on the capstone of the first volume. The goal is to prove that you can carry a production SDD scenario from a legacy trail to a decision that is permitted by facts, not by the agent's confidence.

The capstone is best completed after chapters 1–12. If you are reading the volume selectively, use this part as a map of missing artifacts: any gap in the capstone/ package points to the chapter you need to revisit. If it is unclear how to link the files into a single case, return to part 0: it sets the AgentClinic-production lab frame and explains what counts as the teaching minimum.

Goal

By the end of the capstone you should have one coherent package of evidence for AgentClinic-production:

a recovered requirement with provenance;
a corrected controllably defective specification;
a constitution.md with immutable and mutable rules;
at least one counterexample and one duel record;
a local Spec CI or its runnable analogue;
a judgment.md or a precedent record;
a budget and anti-Goodhart control;
a readiness gate and a list of blockers;
a diagnostic checklist of antipatterns.

The capstone is considered passed not when all the files look complete, but when another person can open the package, repeat the key checks, and understand why the decision is safe to admit or why it must be deferred.

Final case

Work with one production incident. The recommended primary case is high_memory_usage, because it traverses webhook normalization, the readiness gate, and the dry-run from part 11. autoscale_200pct can be chosen instead if you build the capstone around the duel and file arbitration. Do not mix the two cases in a single capstone.

Minimal framing:

AgentClinic-production received an alert from Grafana or PagerDuty;
the legacy trails are incomplete: some rules are known from the post-mortem, some from QWEN.md, some from verbal practice;
automatic remediation looks useful, but it may violate the blast-radius limit, the tier budget, or the anti-Goodhart invariant;
before admission you must prove that the specification, plan, verification, and readiness do not contradict each other.

Package structure

Create a directory:

capstone/
  README.md
  genealogy.md
  poisoned-spec.md
  fixed-spec.md
  constitution.md
  validation.md
  judgment.md
  budget-note.md
  goodhart-note.md
  readiness.md
  antipattern-audit.md

If you work in a real project, the names can be adapted. But the roles of the files must remain the same: provenance, defect, fix, rules, facts, arbitration, budget, metrics, readiness, and process audit.

Before filling out your package, open [examples/templates/capstone-dossier.md](examples/templates/capstone-dossier.md). This is the reference "golden path" of the first pass through high_memory_usage: it shows how many facts are enough for the capstone, without turning the chapter into a large production document.

Use it as a size governor. If your capstone/README.md or validation.md comes out noticeably longer than the reference, first check whether artifacts from the full track have leaked in: scorebook, metric_network, the full out/duel.json, the entire budget plan, or a detailed chat history.

In chapters 1–12 look for a block titled "How this gets into capstone/". It is more important on the first pass than the full list of the chapter's artifacts. If the block says to move over one line, one accepted candidate, one defensive invariant, or one readiness verdict, do not expand the evidence package to all the files of the full production track.

Before starting, write five template lines into capstone/README.md:

Incident-case:
Main risk:
Key check:
Main blocker:
Next fix:

For the default route, the first line must be Incident-case: high_memory_usage. If autoscale_200pct is chosen, state this right away and do not add high_memory_usage as a second equally valid case.

If these lines cannot be filled in, the package is not yet assembled around a single case.

Minimal teaching scenario

Teaching case

Take high_memory_usage from [examples/real-api/](examples/real-api/) as the default route. If you instead choose autoscale_200pct from [examples/tribunal/](examples/tribunal/), write that explicitly in capstone/README.md and do not add high_memory_usage as a second equally valid case. The goal is not to assemble a perfect production process, but a small reproducible package of evidence: one incident, one specification defect, one counterexample or readiness output, one list of blockers.

Preparation

Read the README of the chosen runnable example.
Copy the needed templates from [examples/templates/](examples/templates/).
Create an empty capstone/ directory.
Decide in advance what will count as a blocker: a weak evidence_ref, a priority conflict, a violation of manual_review_floor, a budget overrun, or readiness below the threshold.

Steps

Fill in capstone/genealogy.md: one recovered requirement, at least two sources, a confidence level, and an open question.
Create capstone/poisoned-spec.md: introduce exactly one defect — a priority conflict, a cycle, or a hidden boundary violation.
Create capstone/fixed-spec.md: fix the defect with an exception rule, a schema, or an explicit negative requirement.
Fill in capstone/constitution.md: at least two immutable_principles, one mutable_rule with ttl, max_scope, rollback_condition, and a short governance_protocol.
Run one runnable example for the chosen case.

For high_memory_usage — the commands from the "Minimal teaching scenario" section of part 11: one positive readiness, one blocking stateful, one allowed and one forbidden dry-run. The commands with readiness_block_stateful.json and delete_namespace are expected to return code 1 — this is not a broken example, but a source of blockers for capstone/validation.md.
For autoscale_200pct — the three scripts from the "Minimal teaching scenario" section of part 8: run_duel.py, check_invariants.py, write_judgment.py.

The commands are not fully duplicated here, so that the capstone does not turn into copy-paste. If you have both chapters open, follow their steps in the same order.

Transfer the result to capstone/validation.md: the command, the expected fact, the actual result, and the admission blocker. For real-api the positive readiness run shows the admissible path, readiness_block_stateful.json gives a stateful blocker, and delete_namespace shows the boundary of pre-agreed actions. If the command came from another runnable directory, explain which principle carries over into the main case.
Fill in capstone/judgment.md: a verdict of APPROVE, DENY, or DEFERRED, a reason, an evidence_ref, and the next step. judgment.md is a decision record on a specific dispute; a recurring class of conflict is additionally recorded in capstone/precedents.md with five fields (case_id / verdict / evidence_ref / applies_to / next_check), see part 8.
Add capstone/budget-note.md: what happens if local-coder fails, what limit frontier-reviewer guards, when the emergency mode triggers.
Add capstone/goodhart-note.md: which target metric may start lying and which guard metric constrains it.
Fill in capstone/readiness.md: the final score, blocking conditions, why 23/25 with evidence is better than 25/25 without it.
Walk through the diagnostic checklist from part 12 and record three risks in capstone/antipattern-audit.md.
Finish capstone/README.md: one paragraph of context, a list of commands, a final status, and a list of fixes before production.

After step 12, reread capstone/README.md as a fresh reviewer. It should show not all the details, but the verification route: where the requirement came from, what was broken, which command was run, what verdict came out, and what blocks production admission.

A minimal capstone/README.md for the first pass fits in five lines:

Incident-case: high_memory_usage
Main risk: auto-remediation without a full audit_trace or backup evidence
Key check: python3 scripts/check_readiness.py --readiness fixtures/readiness_block_stateful.json
Main blocker: stateful workload without backup_verified blocks the action
Next fix: add evidence_ref for backup and repeat the dry-run

Verification fact

The package is good enough for the capstone if another reader can open capstone/README.md and answer five questions without the history of your chat:

Which requirement was recovered and where did the evidence come from?
What defect was introduced and how was it fixed?
Which check was actually run?
Why is the file-arbitration verdict or the readiness gate the way it is?
What remains a blocker before production?

If even one question requires a verbal comment from the author, the package is not ready.

Reviewable trail

Do not transfer out/ from the runnable examples into the final package. The final trail is a short capstone/ with files that answer the five questions above. If you work in your own repository, fix this evidence package itself, not the local run directories.

Quick questions

Answer in writing, without Qwen Code.

How does genealogy.md differ from validation.md?
Why must a controllably defective specification contain exactly one defect?
When can a shadow specification end up in QWEN.md but not in requirements.md?
Why does Spec CI not replace the Verifier?
What must judgment.md contain so that the dispute can be repeated?
Why can manual_review_floor not be zeroed out even with good KPIs?
What makes token_health more useful than a simple count of tokens spent?
Why is a readiness score without evidence_ref not an admission?
When is DEFERRED better than a formal APPROVE?
Which antipattern from part 12 most often breaks your package?

Grading criteria

Grade the package out of 30 points. Five categories of 6 points each reflect the five pillars of production SDD: provenance of the fact, its verifiability, dispute resolution, holding the limiters, and package clarity. Equal weight means that one strong category does not compensate for a weak one, and within each category the 6 points cover typical blind spots without over-detail.

Provenance and specification — 6 points

1: genealogy.md links the requirement to at least two sources;
1: disputed facts are not passed off as approved requirements;
1: the poisoned/fixed pair contains one defect and one fix;
1: the fix changes a verifiable artifact, not only the explanation;
1: constitution.md separates the immutable and mutable layers;
1: the mutable rule has ttl, max_scope, rollback_condition.

Checks and facts — 6 points

1: at least one runnable example from book2/examples/ is launched;
1: the result is transferred to validation.md with the command and the expectation;
1: a negative or blocking scenario is described explicitly;
1: Spec CI or its analogue verifies the link between the requirement and the plan;
1: the readiness or dry-run does not bypass blocking conditions;
1: out/ is not passed off as a reviewable artifact.

Arbitration and roles — 6 points

1: judgment.md contains a verdict, a reason, and an evidence_ref;
1: the Verifier/Implementer/Safety roles are not mixed; the Coordinator only keeps judgment.md;
1: the counterexample is minimal or is explicitly marked as non-minimal;
1: in a dispute there is a DEFERRED or a next verifiable step;
1: the precedent is recorded so that it can be applied again;
1: a Safety veto or its analogue cannot be overridden by majority vote.

Production limiters — 6 points

1: the budget scenario describes the failure of a cheap tier;
1: frontier-reviewer is limited by risk or quota;
1: the anti-Goodhart pair links a KPI to a guard metric;
1: manual_review_floor is preserved;
1: the readiness score is accompanied by evidence;
1: a rollback or blocker is specified before admission.

Package clarity — 6 points

1: capstone/README.md explains the case without an external chat;
1: the list of commands can be repeated locally or replaced with a link to a runnable analogue;
1: blockers are separated from improvements;
1: links to chapters and templates help to return to the source;
1: the diagnostic checklist from part 12 is completed;
1: the package contains no extra mechanisms unrelated to the chosen case.

25–30 points — the production SDD pipeline is ready for team review.

19–24 — the pipeline is suitable for a teaching pass, but you need to strengthen the evidence or the blockers.

Below 19 — return to the minimal scenarios of chapters 1–12 and reduce the size of the case.

What to do after the capstone

Do not move the whole package into production as a template in its entirety. Pick two or three of the most useful artifacts and automate them first:

if requirement provenance is most often lost — start with genealogy.md;
if CI lets weak specifications through — start with Spec CI;
if disputes keep recurring — start with judgment.md and precedents.md;
if KPIs start lying — start with the anti-Goodhart validation.md.

The main result of the second volume is not a set of terms, but the habit of demanding a verifiable trail before admitting a dangerous automatic action.