Reading: Applied Part 6. Selection of Shadow Specifications

Lesson 1 of 5 in module «Applied Part 6. Selection of Shadow Specifications»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Applied Part 6. Shadow Specification Selection

Status: Frontier. The technique itself—extracting informal heuristics into a separate layer and constraining them by few-shot slot budget—is in use. But the scoring formula and acceptance thresholds require calibration for a specific project. The idea of "not replacing the main specification" is a recommendation.

For educational purposes, it is enough to run examples/shadow-auction/ and see why one heuristic ends up in QWEN.md while another goes to quarantine. Weight calibration on 50+ incidents belongs to the full production track.

Let us introduce key concepts. Shadow specifications (shadow specs) are verifiable heuristics from operational practice. They help during the triage phase but are not mandatory system requirements. Few-shot example is a short example in a prompt that shows the agent the desired response format for similar cases. Scorebook is the economics journal of shadow specifications: seed data, scoring formula, thresholds, budget, candidate versions, and decision protocol.

When we assembled mission.md in Part 6 of the first volume, participants had wishes that fell short of requirements. Typical ones were:

  • "respond more briefly at night",
  • "don't scare the patient with the word emergency",
  • "on repeated symptom, immediately request history".

This chapter answers the deferred question from back then—what to do with such wishes in production. Where they land, how they prove their usefulness, and when they can be removed. The few-shot example that ends up in QWEN.md is the same agent memory as in Part 19 of the first volume, but with an explicit time-to-live (ttl) and an admission auction.

Before Reading

  • Foundation from the first volume: Part 6 shows that wishes are not equal to requirements; Part 19 separates memory from specification.
  • Local educational case: auction of shadow.p0.voice_handoff versus a noisy heuristic about dashboard color.
  • Trail for capstone/: short Shadow notes block with one accepted and one rejected candidate for high_memory_usage.
  • Key terms for the first pass: shadow specification and scorebook. Auction, few-shot example, quarantine—reference terms.
  • What to defer: candidate collection from 50+ incidents, weight calibration, and automatic QWEN.md updates.

Objective

In this chapter, you will convert informal observations from incident management into a verifiable layer of shadow specifications with measurable value. The word "auction" here means ranked selection under a limited context budget, not a separate product or mandatory external service. Observations that land here include:

  • tone of communication,
  • intuitive assumptions,
  • environmental signals,
  • "magical" solutions of experienced engineers.

The objective is not to replace the formal specification. The objective is to separate useful heuristics from operational folklore. As a result, you will be able to:

  • run a shadow specification auction (i.e., evaluation and selection of heuristics under a limited context budget);
  • assign predictive value to each nuance based on historical incidents;
  • keep in QWEN.md only those few-shot examples that actually improve Qwen Code quality.

Minimal Educational Scenario

Educational Case

You need to decide whether the heuristic shadow.p0.voice_handoff gets into QWEN.md, while the noisy heuristic about red dashboard color goes to quarantine. The goal is to see that an informal observation goes through evaluation and budget constraints, rather than becoming a requirement by authority.

Preparation

  • book2/examples/shadow-auction/candidates/candidates.yaml.
  • book2/examples/shadow-auction/data/incidents.jsonl.
  • Scripts score.py, decide.py, write_qwen_block.py.

Steps

  1. cd book2/examples/shadow-auction. Expected: you are in the runnable example directory.
  2. python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights 0.5,0.3,0.2,0.4 --out out/scorebook.json. Expected: a scorebook is created with score components.
  3. python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json. *Expected: some candidates received winner status, some went to quarantine (quarantine).*
  1. python3 scripts/write_qwen_block.py --auction out/auction.json --target-anchor "QWEN.md#incident-triage-shadow" --today 2026-05-17 --out out/qwen_block.md. *Expected: the block for QWEN.md contains only winners and a reference to the decision source; with the educational date, it matches outputs/qwen_block.example.md.*
  2. Compare out/auction.json and out/quarantine.json: Expected: the losing candidate does not disappear, but receives a rejection reason.

Control Fact

The winner does not become a mandatory requirement. It is formatted as a versioned few-shot example with source_ref, score, and review deadline. The candidate below threshold remains in quarantine with a reason.

How This Goes Into capstone/

Transfer a short Shadow notes section into capstone/README.md: one winner and one rejected candidate, id, score, keep/reject reason, and review deadline. Do not add the winner to requirements.md: in the capstone package, it remains a shadow prompt, not an approved requirement.

Minimal fragment:

shadow_notes:
  keep:
    id: shadow.p0.voice_handoff.v1
    score: 0.727
    ttl: "14d"
    reason: "early signal for manual handoff"
  reject:

id: shadow.alert.red_color_urgency
    reason: "false escalation risk"

Reviewable Trail

out/ is not needed in the educational package. For credit, it is enough to save a short excerpt in QWEN.md or capstone/README.md with a reference to the auction criterion.

Key Ideas

Start normalization by translating observations into the shadow specification format: context → signal → observed effect. Fields are:

  • Context sets applicability boundaries. For example, "P0 incident with cascade risk in appointments-api".
  • Signal captures an observable detail. For example, "on-call writes short imperative messages and skips the standard template".
  • Effect describes a verifiable consequence. For example, "within 5–10 minutes, a manual bypass or urgent rollback occurs".

This format does not make the nuance a fully formal contract. But turns it into a slot that can be compared against incident history. Additional fields evidence, scope, risk, and source_ref are needed so that Qwen Code does not guess the heuristic meaning from free text.

In your own project, candidate collection is done by a pair of scripts harvest.py + normalize.py: the first gathers excerpts from interviews, postmortems, and incidents into .specify/memory/shadow-candidates.raw.ndjson, the second expands them into the context → signal → effect template in .specify/memory/shadow-candidates.yaml. There is no runnable equivalent for this stage in the textbook; it depends on where your sources are stored. The runnable equivalent for the evaluation and auction itself is in examples/shadow-auction/README.md.

After normalization, each candidate slot is evaluated on historical incidents across three metric groups:

  • impact on MTTR,
  • false escalation rate,
  • ability to provide early cascade warning.

Evaluation is built along three axes.

MTTR shows whether the heuristic helped reach the correct action faster. But this metric alone is dangerous. A rule may accelerate individual cases while simultaneously creating noise during triage.

False escalations capture the cost of incorrect triggering. Especially if a shadow specification raises P2 to P1 without sufficient grounds.

Early cascade warning measures whether the signal appeared before the standard alert. Not after the formal system has already flagged the problem.

Record the final score as a reproducible formula, not as an expert "seems useful" assessment. For example, for the educational track use score = 0.5*mttr_gain + 0.3*early_signal + 0.2*coverage - 0.4*false_escalation. Here coverage limits overly narrow rules, while false_escalation penalizes noisy heuristics.

The weights in this formula are a starting calibration, not law. The sum of positive weights is set to one (0.5+0.3+0.2), so that the final score lies in the interval [-0.4; 1] and reads as "useful signal share". Within this unit, the ratio 0.5 / 0.3 / 0.2 reflects the educational priority order for AgentClinic-production: MTTR reduction is the main measurable effect, early signal is only valuable as a reduction of the same MTTR, and coverage is merely insurance against overly narrow rules. The penalty coefficient for false escalation (0.4) is chosen so that one false escalation consumes ~80% of the useful effect of one ideal MTTR reduction (0.4 / 0.5 = 0.8): a heuristic that on one ideal MTTR reduction (mttr_gain=1) generates one false escalation (false_escalation=1) loses almost all final score (0.5 - 0.4 = 0.1) and does not go into final delivery. How to calibrate further:

  • if your team's cost of error is higher—increase the penalty to 0.6–0.8;
  • if early warning is more important—increase early_signal at the expense of mttr_gain.

After calibration, run the formula on 50+ historical incidents. Compare winners with how the team currently makes decisions manually. If the discrepancy is too large, the weights are calibrated to someone else's risk profile.

Take enough historical cases so that rare cascades do not disappear from evaluation. For serious decisions, use 50+ incidents: this is the lower bound at which a rare cascade class (with frequency ~1 in 25 incidents) appears in the sample at least twice, and early_signal can be distinguished from random coincidence. Smaller sets should only be used for smoke testing.

What "data drift" means in this context. Drift is desynchronization of timelines and identifiers across incident sources. If time axes in sources are not aligned, Qwen Code may mistake a post-hoc observation for an early signal. Therefore, before evaluation, perform three actions: deduplication, timestamp normalization, and event binding to a unified incident identifier.

In your own project, evaluation is formatted as python3 scripts/shadow_specs/score.py --candidates .specify/memory/shadow-candidates.yaml --incidents .data/incidents_hist_50plus.jsonl --weights "0.5,0.3,0.2,0.4" --out .specify/memory/shadow-scorebook.json. The runnable equivalent on educational data is in examples/shadow-auction/README.md.

The auction turns evaluation into managed allocation of a limited context budget.

Bad:

> heuristic "on-call ASAP in Slack—raise severity to P1" added directly to requirements.md as a mandatory requirement.

Problem: unverified observation becomes a contract without proof. And generates false P1s on every "ASAP" in chat.

Good:

> the same heuristic formatted as shadow specification shadow.slack.asap_urgency with score 0.55 and status review: the value is above the rejection threshold reject_threshold=0.40, but below the acceptance threshold keep_threshold=0.70, so the candidate goes to manual revision rather than into the formal specification.

How the process works. Qwen Code sorts candidates by value_score. Then spends a predetermined budget—for example, 8 few-shot slots or 2,000 tokens. Results are classified into three categories:

  • keep—winner, goes into QWEN.md;
  • review—disputed, goes to manual revision;
  • quarantine—rejected, goes to quarantine.

Winners are automatically included only when exceeding the upper threshold. Disputed ones go to manual revision. Rejected ones do not remain in a gray zone. This scheme protects QWEN.md from bloat. Even a plausible nuance loses if its predictive value is below the cost of prompt space.

In your own project, the auction decision is formatted as python3 scripts/shadow_specs/decide.py --scorebook .specify/memory/shadow-scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction .specify/memory/shadow-auction.json --out-quarantine .specify/memory/shadow-quarantine.json. The same step on educational data runs in examples/shadow-auction/README.md.

Turn winners into versioned few-shot blocks in QWEN.md, rather than simply appending them to the end of the file. For each block, set:

  • id,
  • version,
  • source_ref,
  • score,
  • valid_from,
  • next_review (or ttl—acceptable short form for short reviews like "14d"),
  • a short application example.

Why these fields. The subsequent team must understand why this nuance exists.

Remove low-value candidates explicitly. Send them to quarantine with a reason, review date, and reference to the calculation. Do not let them disappear from history without a trace. This is important for challenging decisions: if after a month the alerting policy or failover architecture changes, a previously rejected shadow specification can be returned to auction without re-gathering source data.

- id: shadow.p0.voice_handoff.v1
  status: keep
  score: 0.727
  source_ref:
    - postmortem: "appointments-api-2026-02-11"
    - incident: "INC-1842"
  valid_from: "2026-05-17"
  next_review: "2026-08-17"
  few_shot_target: "QWEN.md#incident-triage-shadow"

Where exactly 0.727 comes from: this is the value output by examples/shadow-auction/scripts/score.py on 20 historical incidents from data/incidents.jsonl with default weights 0.5/0.3/0.2 − 0.4. Reference check—examples/shadow-auction/outputs/scorebook.example.json.

The scorebook is the economics journal of shadow specifications. It stores together the seed data, scoring formula, thresholds, budget, candidate versions, and decision protocol.

Without a scorebook, the auction quickly turns into an authority contest. A senior engineer can push through their favorite heuristic, and Qwen Code gets contradictory few-shot examples. Here it is useful to introduce another concept. Anti-Goodhart is protection against optimizing a metric at the expense of meaning. A reproducible journal enables three things: recalculate results after weight changes, verify which incidents influenced the win, and separate real improvement from a Goodhart trap.

In the SDD track, keep this file next to memory and constitutional constraints of the project. In Spec Kit, .specify/memory/constitution.md is convenient for such permanent rules as a protective layer against drift (GitHub Spec Kit).

Full Track: Threshold Calibration

Auction formula weights, keep/reject thresholds, and signals for weight review are moved to Appendix D, Section D.2. Not needed on the first pass: one accepted and one rejected candidate on default weights is enough.

Examples and Application

Example: in an automated triage project for appointments-api, candidate shadow.p0.voice_handoff describes a situation. On P0, the on-call does not write a long message in chat, but immediately initiates a voice handoff between the on-call engineer and the service owner.

On 20 historical incidents from data/incidents.jsonl, this signal yielded score 0.727: high MTTR gain (0.7541), confident early signal (1.0), narrow coverage (0.25), and zero false escalations. In five cases, it reduced time to second-shift involvement. The candidate created almost no false escalations because it applied only to confirmed P0 with cascade transaction risk.

This candidate becomes a winner. But in QWEN.md, it lands with a narrow applicability condition. Qwen Code should not recommend a voice channel for a normal P2, where asynchronous text trail matters more than call speed. The practical value here is not in the fact of "calling" itself, but in early recognition of a situation where handoff delay is more costly than losing part of the written context.

Another candidate, shadow.alert.red_color_urgency, loses the auction. Although it looks intuitively convincing. The same runnable auction gives it score -0.3081: weak MTTR gain and a noticeable false escalation rate pull the score negative. Red color was often used in dashboards for visual emphasis, but did not correspond to blast radius, SLO budget burn rate, or actual escalation level.

This shadow specification had a triple negative effect:

  • increased false P1 rate,
  • overloaded triage phase,
  • eroded trust in automated recommendations.

Send it to quarantine with reason high_false_escalation, review date, and return condition. First, the team changes the alerting visualization policy. Then the candidate is rerun through the scorebook.

A rare physical signal can win if the cost of missing it is significantly higher than the cost of checking. For example, shadow.dc.burn_smell_power_risk applies only to incidents with onsite observation in a data center. Its coverage is low, but early_signal is high: the smell of burning or overheating sometimes appears before power monitoring shows degradation.

Such a candidate cannot be turned into a universal rule. Otherwise it becomes toxic noise for cloud incidents without physical access. The proper inclusion form is a rare few-shot example with three constraints: hard context, explicit risk note, and requirement to confirm the signal through the onsite operator channel.

flowchart TD
A[Part 6. Shadow Specification Selection]
A --> B[Interviews / postmortems / incident history]

B --> C[Shadow candidate extraction]
C --> D[Normalization context / signal / effect]
D --> E[Retro-test on 50+ cases via Qwen Code]
E --> F["score = 0.5*mttr_gain + 0.3*early_signal + 0.2*coverage - 0.4*false_escalation"]
F --> G[Auction decision keep/quarantine/review]
G --> H[keep]
G --> I[quarantine]
G --> J[review]
H --> K[QWEN.md]
I --> L[quarantine with review date]
J --> L

Summary

The shadow specification auction makes informal nuances manageable. Each candidate gets a context → signal → observed effect structure, passes evaluation on historical incidents, competes for a limited budget—and either becomes a versioned few-shot example in QWEN.md, or goes to quarantine with a verifiable reason.

The main discipline of the process is not to trust vivid stories without a scorebook. Seed data, formula, thresholds, and decision protocol must allow reproducing the result and challenging it when infrastructure changes. The next chapter will translate this logic into a specification gateway (Specification CI), where the specification becomes an executable artifact.

Artifacts and Readiness Criteria

ArtifactReady when

| Local auction run from book2/examples/shadow-auction | smoke-pass; results are reproducible with identical weights and data | | One winner | has source_ref, score, and review deadline; winner does not expand the formal SDD contract and does not masquerade as a requirement | | One rejected candidate | in quarantine with explicit reason (e.g., high_false_escalation) | | Short block for QWEN.md or Shadow notes section in capstone/README.md | few-shot example has a narrow applicability condition |

The full track adds .specify/memory/shadow-candidates.yaml in context → signal → effect format, .specify/memory/shadow-scorebook.json with formula and weights, .specify/memory/shadow-auction.json with winner/disputed/rejected decisions, and a versioned few-shot block or quarantine record. Consider it ready if every shadow specification has source_ref, scope, risk, and next_review, the score is reproducibly calculated (without manual recalculation), and candidates are reviewed when weights, budget, or incident class change.

Practice

  1. Run the auction on educational data: cd book2/examples/shadow-auction && python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights 0.5,0.3,0.2,0.4 --out out/scorebook.json. *Expected: diff -u outputs/scorebook.example.json out/scorebook.json yields 0 lines; among scores there is at least one candidate with score >= 0.70 and at least one with score < 0.40.*
  2. On the same scorebook.json, run python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json. *Expected: out/auction.json and out/quarantine.json match the references in outputs/; in out/quarantine.json at least one record with explicit reason and return_condition.*
  3. Change the false escalation penalty weight from 0.4 to 0.8, recalculate scorebook.json, and record the shift in capstone/README.md. *Expected: in capstone/README.md one line is recorded: "with doubled false escalation penalty, candidate <id> moved from keep to quarantine"; the same line indicates which formula component became dominant under the new weight.*

Review Questions

  1. How does a shadow specification differ from a full requirement, and why can it not replace one?
  2. Why must a few-shot example in QWEN.md have a review deadline?
  3. How can you tell that a heuristic has become operational folklore?
  4. An on-call engineer demands adding a rule to QWEN.md: "if the word ASAP is used in Slack—raise severity". How do you run this through the shadow specification auction without rejecting it immediately?
My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100