Reading: Applied Part 6. Selection of Shadow Specifications

Lesson 1 of 5 in module «Applied Part 6. Selection of Shadow Specifications»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Applied Part 6. Selection of Shadow Specifications

Status: Frontier. The technique itself — extracting informal heuristics into a separate layer and constraining them by a few-shot slot budget — is applied. But the scoring formula and acceptance thresholds require calibration for a specific project. The idea of "not replacing the main specification" is a recommendation.

For an educational pass, it is enough to run examples/shadow-auction/ and see why one heuristic lands in QWEN.md while another goes to quarantine. Calibrating weights on 50+ incidents belongs to the full production track.

Let's introduce the key concepts. Shadow specifications (shadow specs) are verifiable heuristics from operational practice. They help at the triage phase but are not mandatory system requirements. Few-shot example is a short example in the prompt that shows the agent the desired response format on similar cases. Scorebook is a journal of the shadow specifications economy: seed data, scoring formula, thresholds, budget, candidate versions, and decision protocol.

When we assembled mission.md in part 6 of the first volume, participants had remaining wishes that did not quite reach the requirement level. Typical examples:

"respond shorter at night",
"don't scare the patient with the word emergency",
"on a recurring symptom, request history immediately".

This chapter answers the question deferred then — what to do with such wishes in production. Where they go, how they prove their usefulness, and when they can be removed. The few-shot example that ultimately ends up in QWEN.md is the same agent memory as in part 19 of the first volume, but with an explicit time-to-live (ttl) and an acceptance auction.

Before Reading

Anchor from the first volume: part 6 shows that wishes are not equal to requirements; part 19 separates memory from specification.
Local educational case: the auction shadow.p0.voice_handoff versus a noisy heuristic about dashboard color.
Trace for capstone/: a short Shadow notes block with one accepted and one rejected candidate for high_memory_usage.
Key terms of the first pass: shadow specification and scorebook. Auction, few-shot example, quarantine — for reference.

What to defer: collecting candidates from 50+ incidents, calibrating weights, and automatic update of QWEN.md.

Goal

In this chapter you will turn informal observations from incident management into a verifiable layer of shadow specifications with measurable value. The word "auction" here means a ranked selection under a constrained context budget, not a separate product or mandatory external service. Which observations land here:

tone of communication,
intuitive premises,
environment signals,
"magical" decisions of experienced engineers.

The goal is not to replace the formal specification. The goal is to separate useful heuristics from operational folklore. As a result you will be able to:

run a shadow specifications auction (that is, evaluate and select heuristics under a constrained context budget);
assign predictive value to each nuance based on historical incidents;
keep in QWEN.md only those few-shot examples that actually improve the quality of Qwen Code.

Minimal Educational Scenario

Educational Case

You need to decide whether the heuristic shadow.p0.voice_handoff lands in QWEN.md, and whether a noisy heuristic about the red color of the dashboard goes to quarantine. The goal is to see that an informal observation passes through evaluation and budget, and does not become a requirement by authority.

Preparation

book2/examples/shadow-auction/candidates/candidates.yaml.
book2/examples/shadow-auction/data/incidents.jsonl.
Scripts score.py, decide.py, write_qwen_block.py.

Steps

cd book2/examples/shadow-auction. Expectation: you are in the runnable example directory.
python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights 0.5,0.3,0.2,0.4 --out out/scorebook.json. Expectation: a scorebook has been created with score components.
python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json. *Expectation: some candidates received the winner status, some went to quarantine.*

python3 scripts/write_qwen_block.py --auction out/auction.json --target-anchor "QWEN.md#incident-triage-shadow" --today 2026-05-17 --out out/qwen_block.md. *Expectation: the block for QWEN.md contains only the winners and a link to the decision source; for the educational date it matches outputs/qwen_block.example.md.*
Compare out/auction.json and out/quarantine.json: Expectation: the losing candidate has not disappeared, but has received a reason for rejection.

Check Fact

The winner has not become a mandatory requirement. It is formatted as a versioned few-shot example with source_ref, score, and review period. The candidate below the threshold is in quarantine with a reason.

How This Lands in `capstone/`

Move a short Shadow notes section to capstone/README.md: one winner and one rejected candidate, id, score, reason for keep/reject, and review period. Do not add the winner to requirements.md: in the graded package it remains a shadow hint, not an approved requirement.

Minimal fragment:

shadow_notes:
  keep:
    id: shadow.p0.voice_handoff.v1
    score: 0.727
    ttl: "14d"
    reason: "early signal for manual handoff"
  reject:

id: shadow.alert.red_color_urgency
    reason: "false escalation risk"

Reviewable Trace

out/ is not needed in the educational package. For grading it is enough to keep a short extract in QWEN.md or capstone/README.md with a link to the auction criterion.

Key Ideas

Start normalization by converting observations to the shadow specification format: context → signal → observed effect. The fields are:

Context sets the boundaries of applicability. For example, "a P0 incident with cascade risk in appointments-api".
Signal captures the observed detail. For example, "the on-call writes short imperative messages and skips the standard template".
Effect describes a verifiable consequence. For example, "within 5–10 minutes a manual bypass or urgent rollback occurs".

This format does not make the nuance a fully formal contract. But it turns it into a slot that can be compared against incident history. The additional fields evidence, scope, risk, and source_ref are needed so that Qwen Code does not guess the meaning of the heuristic from free text.

In your project, candidate collection is done by a pair of scripts harvest.py + normalize.py: the first collects extracts from interviews, postmortems, and incidents into .specify/memory/shadow-candidates.raw.ndjson, the second expands them using the context → signal → effect template into .specify/memory/shadow-candidates.yaml. There is no runnable analog for this stage in the textbook; it depends on where your sources live. The runnable analog of the evaluation and auction itself is in examples/shadow-auction/README.md.

After normalization, each candidate slot is evaluated on historical incidents along three metric groups:

impact on MTTR,
share of false escalations,
ability to provide early warning of a cascade.

Evaluation is built on three axes.

MTTR shows whether the heuristic helped reach the correct action faster. But on its own this metric is dangerous. A rule can speed up individual cases while creating noise at the triage phase.

False escalations capture the cost of an incorrect trigger. Especially if a shadow specification escalates P2 to P1 without sufficient grounds.

Early cascade warning measures whether the signal appeared before the standard alert. And not after the formal system already recorded the problem.

Record the final score as a reproducible formula, not as an expert "seems useful" estimate. For example, for the educational loop use score = 0.5*mttr_gain + 0.3*early_signal + 0.2*coverage - 0.4*false_escalation. Here coverage constrains overly narrow rules, and false_escalation penalizes noisy heuristics.

The weights in this formula are a starting calibration, not a law. The sum of positive weights is chosen as unity (0.5+0.3+0.2) so that the final score lies in the range [-0.4; 1] and reads as "share of useful signal". Within this unity the proportion 0.5 / 0.3 / 0.2 reflects the educational priority order of AgentClinic-production: reducing MTTR is the main measurable effect, the early signal is valuable only as a reduction of that same MTTR, and coverage is merely insurance against overly narrow rules. The penalty coefficient for false escalation (0.4) is chosen so that one false escalation eats up ~80% of the useful effect of one ideal MTTR reduction (0.4 / 0.5 = 0.8): a heuristic that for one ideal MTTR reduction (mttr_gain=1) generates one false escalation (false_escalation=1) loses almost the entire final score (0.5 - 0.4 = 0.1) and does not make it into the final delivery. How to calibrate further:

if your team's cost of error is higher — raise the penalty to 0.6–0.8;
if early warning is more important — increase early_signal at the expense of mttr_gain.

After calibration, run the formula over 50+ historical incidents. Compare the winners with how the team currently makes decisions manually. If the discrepancy is too large, the weights are calibrated for someone else's risk profile.

Take enough historical cases so that rare cascades do not disappear from the evaluation. For a serious decision, use 50+ incidents: this is the lower bound at which a rare cascade class (with a frequency of ~1 in 25 incidents) occurs in the sample at least twice, and early_signal can be distinguished from a random coincidence. Keep a smaller set only for smoke checks.

What does "data drift" mean in this context. Drift is a desynchronization of time scales and identifiers in incident sources. If the time axes in the sources are not aligned, Qwen Code can take a post-factum observation for an early signal. Therefore, before evaluation, perform three actions: deduplication, timestamp normalization, and binding of events to a single incident identifier.

In your project, evaluation is formatted as python3 scripts/shadow_specs/score.py --candidates .specify/memory/shadow-candidates.yaml --incidents .data/incidents_hist_50plus.jsonl --weights "0.5,0.3,0.2,0.4" --out .specify/memory/shadow-scorebook.json. The runnable analog on educational data is in examples/shadow-auction/README.md.

The auction turns evaluation into managed allocation of a limited context budget.

Bad:

> the heuristic "on-call ASAP in Slack — raise severity to P1" is added directly to requirements.md as a mandatory requirement.

Problem: an unverified observation becomes a contract without evidence. And generates false P1s on any "ASAP" in the chat.

Good:

> the same heuristic is formatted as a shadow specification shadow.slack.asap_urgency with a score of 0.55 and status review: the value is above the rejection threshold reject_threshold=0.40, but below the acceptance threshold keep_threshold=0.70, so the candidate goes to manual review, not to the formal specification.

How the process works. Qwen Code sorts candidates by value_score. Then it spends a predefined budget — for example, 8 few-shot slots or 2,000 tokens. The result is classified into three categories:

keep — winner, goes into QWEN.md;
review — disputed, for manual review;
quarantine — rejected, goes to quarantine.

Winners are automatically included only when the upper threshold is exceeded. Disputed ones go to manual review. Rejected ones do not remain in a gray zone. This scheme protects QWEN.md from bloat. Even a plausible nuance loses if its predictive value is below the cost of its slot in the prompt.

In your project, the auction decision is formatted as python3 scripts/shadow_specs/decide.py --scorebook .specify/memory/shadow-scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction .specify/memory/shadow-auction.json --out-quarantine .specify/memory/shadow-quarantine.json. On educational data, the same step is run by examples/shadow-auction/README.md.

Turn winners into versioned few-shot blocks in QWEN.md, not just append them to the end of the file. For each block, set:

id,
version,
source_ref,
score,
valid_from,
next_review (or ttl — an acceptable short form for short reviews like "14d"),
a short application example.

Why these fields. The next team should understand why this nuance exists.

Explicitly delete low-value candidates. Send them to quarantine with a reason, review date, and link to the calculation. Do not let them disappear from history without a trace. This is important for challenging decisions: if in a month the alert policy or failover architecture has changed, a previously rejected shadow specification can be returned to the auction without re-collecting source data.

- id: shadow.p0.voice_handoff.v1
  status: keep
  score: 0.727
  source_ref:
    - postmortem: "appointments-api-2026-02-11"
    - incident: "INC-1842"
  valid_from: "2026-05-17"
  next_review: "2026-08-17"
  few_shot_target: "QWEN.md#incident-triage-shadow"

Where 0.727 comes from specifically: this is the value produced by examples/shadow-auction/scripts/score.py on 20 historical incidents from data/incidents.jsonl with the default weights 0.5/0.3/0.2 − 0.4. Verification against the reference — examples/shadow-auction/outputs/scorebook.example.json.

The scorebook is a journal of the shadow specifications economy. It jointly stores seed data, scoring formula, thresholds, budget, candidate versions, and decision protocol.

Without a scorebook, the auction quickly turns into a dispute of authorities. An experienced engineer can push through a favorite heuristic, and Qwen Code will get contradictory few-shot examples. It is useful to introduce one more concept here. Anti-Goodhart is protection against optimizing a metric at the expense of meaning. A reproducible journal provides three capabilities: recalculate results after changing weights, check which incidents influenced the win, and separate real improvement from a Goodhart trap.

In the SDD loop, keep this file next to the project's memory and constitutional constraints. In Spec Kit, for such permanent rules it is convenient to use .specify/memory/constitution.md as a protective layer against drift (GitHub Spec Kit).

Full Track: Threshold Calibration

The weights of the auction formula, the keep/reject thresholds, and the signals for weight review are moved to Appendix D, section D.2. On the first pass, the section is not needed: one accepted and one rejected candidate with default weights is enough.

Examples and Application

Example: in the automatic triage project for appointments-api, the candidate shadow.p0.voice_handoff describes a situation. At P0 the on-call does not write a long message in chat, but immediately initiates a voice handoff between the on-call and the service owner.

On 20 historical incidents from data/incidents.jsonl, this signal produced a score of 0.727: high MTTR growth (0.7541), confident early signal (1.0), narrow coverage (0.25), and zero false escalations. In five cases it shortened the time to involvement of the second shift. The candidate almost never created false escalations, because it was applied only with a confirmed P0 and a risk of transaction cascade.

This candidate becomes a winner. But it lands in QWEN.md with a narrow applicability condition. Qwen Code should not recommend a voice channel for an ordinary P2, where the asynchronous text trail is more important than the speed of a call. The practical value here is not in the fact of "calling" itself, but in early recognition of a situation where handoff delay costs more than the loss of part of the written context.

Another candidate, shadow.alert.red_color_urgency, loses the auction. Although it looks intuitively convincing. The same runnable auction gives it a score of -0.3081: weak MTTR growth and a noticeable share of false escalations pull the score into the negative. The red color was often used in dashboards for visual emphasis, but did not match the radius of consequences, the SLO budget burn rate, or the actual escalation level.

This shadow specification had a triple negative effect:

it increased the share of false P1s,
it overloaded the triage phase,
it eroded trust in automatic recommendations.

Send it to quarantine with the reason high_false_escalation, a review date, and a return condition. First the team changes the alert visualization policy. Then it runs the candidate through the scorebook again.

A rare physical signal can win if the cost of a miss is significantly higher than the cost of a check. For example, shadow.dc.burn_smell_power_risk is applicable only to incidents with onsite observation in the data center. Its coverage is low, but early_signal is high: the smell of burning or overheating sometimes appears before the power monitoring shows degradation.

Such a candidate cannot be turned into a universal rule. Otherwise it will become toxic noise for cloud incidents without physical access. The correct form of inclusion is a rare few-shot example with three limiters: strict context, an explicit risk note, and a requirement to confirm the signal through the on-site operator channel.

flowchart TD
A[Chapter 6. Selection of Shadow Specifications]
A --> B[Interviews / postmortems / incident history]

B --> C[Extracting shadow candidates]
C --> D[Normalization context / signal / effect]
D --> E[Retro test on 50+ cases via Qwen Code]
E --> F["score = 0.5*mttr_gain + 0.3*early_signal + 0.2*coverage - 0.4*false_escalation"]
F --> G[Auction decision keep/quarantine/review]
G --> H[keep]
G --> I[quarantine]
G --> J[review]
H --> K[QWEN.md]
I --> L[quarantine with review date]
J --> L

Summary

The shadow specifications auction makes informal nuances manageable. Each candidate gets the structure context → signal → observed effect, goes through evaluation on historical incidents, competes for a limited budget — and either becomes a versioned few-shot example in QWEN.md, or goes to quarantine with a verifiable reason.

The main discipline of the process is not to trust vivid stories without a scorebook. Seed data, formula, thresholds, and decision protocol should make it possible to reproduce the result and challenge it when the infrastructure changes. The next chapter will translate this logic into a Specification CI gate, where the specification becomes an executable artifact.

Artifacts and Readiness Criteria

Artifact	Ready when

| Local auction run from book2/examples/shadow-auction | smoke-pass; results are reproducible with the same weights and data | | One winner | has source_ref, score, and review period; the winner does not expand the formal SDD contract and does not masquerade as a requirement | | One rejected candidate | in quarantine with an explicit reason (for example, high_false_escalation) | | Short block for QWEN.md or Shadow notes section in capstone/README.md | the few-shot example has a narrow applicability condition |

The full track adds .specify/memory/shadow-candidates.yaml in context → signal → effect format, .specify/memory/shadow-scorebook.json with formula and weights, .specify/memory/shadow-auction.json with winner/disputed/rejected decisions, and a versioned few-shot block or quarantine record. Consider it ready if every shadow specification has source_ref, scope, risk, and next_review, the score is computed reproducibly (without manual recalculation), and candidates are reviewed when weights, budget, or incident class change.

Practice

Run the auction on educational data: cd book2/examples/shadow-auction && python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights 0.5,0.3,0.2,0.4 --out out/scorebook.json. *Expectation: diff -u outputs/scorebook.example.json out/scorebook.json gives 0 lines; among the scores there is at least one candidate with score >= 0.70 and at least one with score < 0.40.*
On the same scorebook.json, run python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json. *Expectation: out/auction.json and out/quarantine.json match the references in outputs/; out/quarantine.json has at least one entry with explicit reason and return_condition.*
Change the false escalation penalty weight from 0.4 to 0.8, recompute scorebook.json, and record the shift in capstone/README.md. *Expectation: capstone/README.md contains one line "with doubled false escalation penalty the candidate <id> moved from keep to quarantine"; the same line indicates which formula component became dominant in the new weight.*

Review Questions

How does a shadow specification differ from a full-fledged requirement, and why can't it be substituted for one?
Why should a few-shot example in QWEN.md have a review period?
How do you understand that a heuristic has become operational folklore?
The on-call engineer demands adding the rule "if the word ASAP is used in Slack — raise severity" to QWEN.md. How would you run it through the shadow specifications auction without refusing outright?