Topic: Applied Part 6. Shadow Spec Selection
Difficulty level: Medium
Estimated study time: 4-6 hours (theory + practice)
Prerequisites: Part 6 of volume 1: understanding the distinction between wishes and requirements
Part 19 of volume 1: separating agent memory from specification
Basic proficiency in Python 3 and command line
Understanding of YAML/JSON format for configuration files
Experience with incident management systems (preferred)
Learning objectives: Independently run the full shadow spec auction cycle: from candidate normalization to block generation for QWEN.md
Calculate candidate score using the formula with given weights and interpret the result relative to keep/reject thresholds
Format the auction winner as a versioned few-shot prompt with required fields id, source_ref, ttl, score
Properly format a rejected candidate into quarantine with explicit reason and return condition, preventing its disappearance from history
Conduct sensitivity analysis of formula weights and record how changing the false escalation penalty affects the auction decision
Overview: This chapter teaches how to turn informal operational observations — tone of communication, intuitive assumptions, "magical" decisions of experienced engineers — into a manageable layer of shadow specifications. Key technique: extract heuristics into a separate layer, constrain them by few-shot slot budget, measure predictive value on historical incidents, and never let them replace formal specification. You will run a training auction on real scripts, see why one heuristic makes it into QWEN.md while another goes to quarantine, and record the result in a capstone project.
Key concepts: Shadow specs: Verifiable heuristics from operational practice that help during triage but are not mandatory system requirements. Format: context → signal → observed effect. Example: "P0 incident with cascade risk → on-call writes short imperative messages → within 5–10 minutes a manual bypass or rollback occurs".
Few-shot prompt: A short example in the prompt that shows the agent the desired response format for similar cases. Unlike regular memory, has an explicit time-to-live (ttl) and undergoes admission auction.
Scorebook: Economics journal of shadow specs: seed data, scoring formula, thresholds, budget, candidate versions, and decision protocol. Allows reproducing the result and challenging it when infrastructure changes.
Shadow spec auction: A ranked selection process under limited context budget (e.g., 2000 tokens or 8 slots). Candidates are sorted by value_score, then distributed into three categories: keep (winner, into QWEN.md), review (disputed, manual revision), quarantine (rejected, into quarantine).
Scoring formula: A reproducible formula instead of expert judgment "seems useful". Training variant: score = 0.5×mttr_gain + 0.3×early_signal + 0.2×coverage − 0.4×false_escalation. Sum of positive weights equals 1, final score lies in [−0.4; 1].
Quarantine: Explicit storage for rejected candidates with rejection reason, review date, and return condition. Protects against low-value candidates disappearing without a trace and allows returning them when infrastructure changes.
Anti-goodhart: Protection against optimizing the metric at the expense of meaning. Reproducible scorebook allows recalculating results after weight changes, checking the impact of specific incidents, and separating real improvement from Goodhart's trap.
Data drift: Desynchronization of timelines and identifiers in incident sources. Requires deduplication, timestamp normalization, and binding to a unified incident identifier before evaluation.
Practice exercises: Name: Running the training auction: from score to decision
Problem: In the book2/examples/shadow-auction directory, run the full pipeline: scorebook calculation, auction decision, and block generation for QWEN.md. Ensure results are reproducible and match the reference outputs.
Solution: Step 1: cd book2/examples/shadow-auction. Step 2: python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights 0.5,0.3,0.2,0.4 --out out/scorebook.json. Check diff -u outputs/scorebook.example.json out/scorebook.json — expect 0 lines. Step 3: python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json. Verify match with outputs/auction.example.json and outputs/quarantine.example.json. Step 4: python3 scripts/write_qwen_block.py --auction out/auction.json --target-anchor 'QWEN.md#incident-triage-shadow' --today 2026-05-17 --out out/qwen_block.md. Compare with outputs/qwen_block.example.md. Step 5: verify that quarantine contains a record with explicit reason and return_condition.
Complexity: intermediate
Name: Weight sensitivity analysis: doubling the false escalation penalty
Problem: Change the false escalation penalty weight from 0.4 to 0.8, recalculate the scorebook, and record which candidate changed status. Explain which formula component became dominant.
Solution: Step 1: Run score.py with --weights 0.5,0.3,0.2,0.8. Step 2: Compare out/scorebook.json with the original. Find the candidate whose score crossed a threshold (e.g., shadow.alert.red_color_urgency at −0.3081 may have fallen further, or a borderline candidate dropped below reject_threshold). Step 3: Record in capstone/README.md: "With doubled false escalation penalty, candidate <id> moved from <status1> to <status2>. The false_escalation component became dominant: at false_escalation=1 it subtracts 0.8, equivalent to losing 1.6 ideal MTTR reductions (0.8/0.5=1.6)". Step 4: Explain why this is calibration for a higher cost of error in your team.
Complexity: intermediate
Name: Normalizing an observation into shadow spec format
Problem: Given an informal observation from an on-call: "When people write ASAP in Slack, I immediately know it's serious and escalate severity". Transform this into a structured shadow spec, then analyze risks and prepare arguments for the auction.
Solution: Step 1: Normalization. Context: "incident of any level in team chat". Signal: "message contains word ASAP". Effect: "on-call manually raises severity". Step 2: Add fields: evidence (on-call interview from 2026-05-10), scope (only Slack #incidents channel), risk (false escalations: ASAP is used for routine requests too), source_ref (interview, 3 incidents). Step 3: Prepare auction arguments: mttr_gain — unmeasurable, no data on time reduction; early_signal — weak, ASAP often appears after problem is known; coverage — high, but this is a minus (too broad); false_escalation — high risk. Prediction: score will be low, candidate goes to review or quarantine. Step 4: Format as shadow.slack.asap_urgency with status review and note "requires 50+ incidents for calibration".
Complexity: intermediate
Name: Formatting winner and rejected candidate in capstone
Problem: Transfer training auction results into capstone/README.md in the Shadow notes section. The winner must not become a requirement in requirements.md.
Solution: Minimal fragment for capstone/README.md:
shadow_notes:
keep:
id: shadow.p0.voice_handoff.v1
score: 0.727
ttl: "14d"
reason: "early signal for manual handoff"
source_ref:
- postmortem: "appointments-api-2026-02-11"
- incident: "INC-1842"
reject:
id: shadow.alert.red_color_urgency
reason: "false escalation risk"
score: -0.3081
return_condition: "change in alert visualization policy"
Verify: winner has narrow applicability condition (only P0 with cascade risk), rejected candidate did not disappear and received reason and return condition, neither is added to requirements.md.
Complexity: beginner
Case studies: Name: Voice handoff vs dashboard color: training auction in action
Scenario: The appointments-api SRE team accumulated two operational observations over six months of incidents. First: during P0 incidents with cascade risk, an experienced on-call instead of writing a long chat message immediately initiated voice handoff between on-call and service owner. Second: the on-call claimed that red color in Grafana dashboard was a reliable urgency signal and proposed automatically escalating severity when it appeared.
Challenge: Both observations sounded convincing in conversation but had opposite consequences. Voice handoff required space in QWEN.md context budget, but its value was unmeasured. The red color rule risked spawning a flood of false P1s, since red was used for visual emphasis without binding to blast radius. The team could not distinguish useful heuristic from operational folklore.
Solution: Both observations were normalized into context → signal → effect format and run through auction on 20 historical incidents from data/incidents.jsonl. Formula score = 0.5×mttr_gain + 0.3×early_signal + 0.2×coverage − 0.4×false_escalation gave voice handoff score 0.727 (high mttr_gain 0.7541, perfect early_signal 1.0, narrow coverage 0.25, zero false escalations). Red color got −0.3081: weak mttr_gain and notable false escalation rate. Voice handoff became the winner with narrow applicability condition in QWEN.md. Red color went to quarantine with reason high_false_escalation and return condition upon visualization policy change.
Result: Voice handoff as versioned few-shot prompt with 14-day ttl and source_ref to postmortem appointments-api-2026-02-11. Red color did not disappear from history — it can be returned to auction with new data. The team got a reproducible selection process instead of authority arguments. Trust in automated recommendations increased, since every heuristic had measurable justification.
Lessons learned: Vivid story ("red = urgency") does not replace scorebook — intuition often fails at scale
Narrow coverage is not always a minus: voice handoff was rarely used but with high precision, which compensated for narrowness
Return condition in quarantine protects against permanent deletion: infrastructure changes, and yesterday's rejection may become tomorrow's signal
Winner must not become a requirement — it remains a shadow prompt with explicit review date
Related concepts: Shadow specs
Scorebook
Shadow spec auction
Anti-Goodhart
Quarantine
Name: Smell of burning in data center: when rare signal justifies budget slot
Scenario: Physical data center operators noticed: in rare cases, smell of overheating or burning precedes power monitoring alert by 30–120 seconds. This observation came from an engineer with 10 years of experience and was initially perceived as "magic" defying formalization.
Challenge: Physical smell signal has extremely low coverage (only onsite incidents in specific data centers), zero applicability to cloud incidents, and risk of false triggers from non-system sources (cleaning, neighboring equipment). As a universal rule it would become toxic noise. If discarded — the team would lose a rare but valuable early signal with high miss cost.
Solution: Observation normalized as shadow.dc.burn_smell_power_risk with hard constraints: context "incident with physical access in T3+ data center", signal "on-site operator detects overheating smell before PDU alert", effect "possible power failure precursor by 30–120 seconds". Formula gave high early_signal (1.0), low coverage (0.1), zero false_escalation with mandatory operator confirmation. Auction decision: include as rare few-shot prompt with three constraints — hard context, explicit risk note, requirement to confirm through on-site operator channel.
Result: Signal occupied 1 slot out of 8 budget, but in two historical cases gave critical advantage for graceful shutdown. For cloud incidents the rule does not apply — no noise. During partial infrastructure migration to cloud, candidate automatically went to review by context condition, without requiring manual removal.
Lessons learned: Rare signal with high miss cost can win auction despite low coverage
Hard context protects against toxic noise: rule applies only where it makes sense
Requirement for confirmation through independent channel (on-site operator) reduces false_escalation without losing early_signal
Automatic review upon context change (cloud migration) protects heuristic from obsolescence
Related concepts: Shadow specs
Shadow spec auction
Few-shot prompt
Data drift
Study tips: Go through material sequentially: first minimal training scenario (steps 1–5), then key concepts theory, then practice with weight changes. Attempting to immediately calibrate weights on 50+ incidents will distract from understanding the mechanism
Keep two terminal windows side by side: one for running scripts, another for viewing generated JSON/YAML. Visual comparison of scorebook.example.json and your out/scorebook.json accelerates formula understanding
Use diff -u for all reference checks. If there is discrepancy — don't ignore, find which formula component gave different result. This is the best way to understand weight sensitivity
Create your personal "zoo" of operational folklore: write down vivid phrases from on-calls ("ASAP = P1", "red = urgency", "quiet channel = everything broke") and practice normalizing them into context → signal → effect. This develops intuition for distinguishing heuristic from noise
For visual style: draw on paper the block diagram from the chapter's mermaid diagram. Physically tracing the path from interview to QWEN.md helps remember the process architecture
When studying quarantine, ask yourself: "Why shouldn't a rejected candidate disappear?" The answer — for challenging decisions and returning when infrastructure changes — is the key distinction of a mature process from a "black box"
Practice explaining Anti-Goodhart with an everyday example: "If you optimize only pizza delivery speed, couriers will break traffic rules". This helps remember why reproducible journal is needed instead of bare metric
For auditory style: recite the scoring formula aloud with emphasis on penalty: "zero five em-tee-tee-ar, zero three early signal, zero two coverage, minus zero four false escalation". Rhythm helps remember proportions
Additional resources: Book2/examples/shadow-auction/readme.md: Runnable training auction with scripts score.py, decide.py, write_qwen_block.py and reference outputs
Book2/examples/shadow-auction/outputs/: Reference files scorebook.example.json, auction.example.json, quarantine.example.json, qwen_block.example.md for verification
Book2/appendix-d-threshold-calibration.md#d2-shadow-spec-selection-chapter-6: Full track of threshold, weight, and signal calibration for review on 50+ incidents
Book2/part-06-constitution.md (volume 1): Foundation: distinction between wishes and requirements, where "not-quite" observations come from
Book2/part-19-agent-memory-sqlite.md (volume 1): Foundation: separating agent memory from specification, predecessor of few-shot with ttl
Github spec kit — .specify/memory/constitution.md: Example protective layer against drift for SDD contour
Postmortem appointments-api-2026-02-11: Training source_ref for winner shadow.p0.voice_handoff (inside data/incidents.jsonl)
Summary: The main discovery of this part: informal operational observations can and should be managed, but cannot replace formal specification. Shadow specs are a verifiable layer between operational folklore and requirements: each candidate undergoes normalization into context → signal → effect format, evaluation on historical incidents using reproducible formula, auction for limited context budget, and either becomes a versioned few-shot prompt with ttl in QWEN.md, or goes to quarantine with explicit reason. Scorebook protects against Goodhart's trap and allows challenging the decision. Winner does not become a requirement — it remains a shadow prompt with review date. This is the discipline of measurable selection instead of trusting vivid stories.