Study guide: Applied Part 6. Selection of Shadow Specifications

Lesson 3 of 5 in module «Applied Part 6. Selection of Shadow Specifications»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Topic: Practical Part 6. Selecting Shadow Specifications

Difficulty level: Medium

Estimated study time: 4-6 hours (including work with training scripts and calibration)

Prerequisites: Understanding of incident management principles (triage, post-mortem, MTTR)

Basic command line (CLI) and Python skills

General understanding of LLM agents and prompt engineering concepts (few-shot examples)

Familiarity with YAML and JSON data formats

Learning objectives: Translate informal observations and operational folklore into the structured format 'context → feature → observed effect'.

Calculate the predictive value (score) of heuristics based on historical incidents using a given formula.

Conduct an automated 'auction' to select winners under a limited token budget (few-shot slots).

Generate versioned blocks for QWEN.md and quarantine entries with rejection reasons.

Analyze the impact of weighting coefficients (in particular, the penalty for false escalation) on auction results.

Overview: This section is dedicated to managing informal observations that arise during the support of software systems. Engineers often use useful heuristics (for example, 'respond more briefly at night' or 'request history on a repeated symptom') that are difficult to formalize as strict system requirements. Here the concept of shadow specifications (shadow specs) is introduced — verifiable heuristics that help during the triage phase. You will learn how to turn these observations into measurable slots, evaluate them on historical data (auction), and make an informed decision: whether to add the heuristic as a versioned few-shot example in QWEN.md or send it to quarantine due to a high risk of false positives.

Key concepts: Shadow specification (shadow spec): A verifiable heuristic from operational practice. It is not a mandatory system requirement, but it helps the agent or engineer make decisions faster during the triage phase.

Few-shot example: A short example in the prompt that shows the desired response format or behavior logic. In the context of shadow specifications, this is a useful heuristic formatted for inclusion in QWEN.md.

Scorebook: A journal of shadow specification economics. It stores source data, the scoring formula, thresholds, budget, candidate versions, and the decision protocol. It protects against 'authority arguments' and the Goodhart's law trap.

Auction: The process of ranked selection of heuristics under a strictly limited context budget (in tokens or slots).

Quarantine: The status given to a heuristic that did not pass the quality threshold. The candidate is not deleted without a trace, but is retained with a reason for rejection (for example, high_false_escalation) and a review date.

Scoring formula: A mathematical expression for calculating the value of a heuristic. The basic training variant: score = 0.5mttr_gain + 0.3early_signal + 0.2coverage - 0.4false_escalation.

Data drift: A desynchronization of timelines and identifiers across incident sources. It requires deduplication and normalization before evaluation to avoid false early signals.

Practice exercises: Name: Running an evaluation of shadow candidates

Problem: You need to calculate the score for a list of shadow specifications based on simulated historical incidents. Use the basic formula weights. The goal is to obtain a correct scorebook (scorebook.json) for subsequent decision making.

Solution: 1. Open a terminal and go to the example directory: cd book2/examples/shadow-auction. 2. Run the scoring script: python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights 0.5,0.3,0.2,0.4 --out out/scorebook.json. 3. Compare the obtained result with the reference: diff -u outputs/scorebook.example.json out/scorebook.json.

Complexity: beginner

Name: Auction: selecting winners and filtering out noise

Problem: Based on the calculated scorebook.json, conduct an auction. Set the budget to 2000 tokens, the acceptance threshold to 0.70, and the rejection threshold to 0.40. You need to split the candidates into winners (keep) and those sent to quarantine (quarantine).

Solution: 1. While in the shadow-auction directory, run: python3 scripts/decide.py --scorebook out/scorebook.json --budget-tokens 2000 --keep-threshold 0.70 --reject-threshold 0.40 --out-auction out/auction.json --out-quarantine out/quarantine.json. 2. Check the contents of out/quarantine.json: make sure that at least one entry is present (for example, shadow.alert.red_color_urgency) with an explicit reason for rejection.

Complexity: intermediate

Name: Calibrating the penalty for false escalation

Problem: In your team, the cost of a false P1 escalation turned out to be significantly higher than in the basic configuration. Double the weight of the penalty (false_escalation) and see how the composition of winners changes.

Solution: 1. Run score.py with modified weights: python3 scripts/score.py --candidates candidates/candidates.yaml --incidents data/incidents.jsonl --weights 0.5,0.3,0.2,0.8 --out out/scorebook_strict.json. 2. Run decide.py with the new scorebook. 3. Record the observation: 'With a doubled penalty, candidate X fell into quarantine due to a high share of false escalations'.

Complexity: advanced

Case studies: Name: Fighting false urgency: the red color of the dashboard

Scenario: In an automatic triage project for the appointments-api service, the engineering team proposed a heuristic: if the indicator on the dashboard lights up red, the agent should automatically raise the incident severity to P1, because it 'visually attracts attention'.

Challenge: The rule intuitively seemed useful for quick response. However, upon review it turned out that the red color was often used for visual emphasis in non-critical situations and did not correspond to the actual radius of consequences.

Solution: The heuristic was formalized as the shadow specification shadow.alert.red_color_urgency and run through the auction. The scoring formula showed a weak increase in mttr_gain and a high share of false_escalation. The final score went into the negative (-0.3081). The candidate was sent to quarantine with the reason high_false_escalation.

Result: The system avoided toxic noise. The specification did not get into QWEN.md, which prevented overloading on-call engineers with false P1 escalations and preserved trust in the agent's automatic recommendations.

Lessons learned: Intuitively convincing heuristics ('operational folklore') can cause serious harm to the triage process.

It is important to evaluate not only the speedup of resolution (MTTR), but also the cost of an error (penalty for false escalation).

Quarantine should contain a reason for rejection and conditions for return (for example, after UI policy changes).

Related concepts: Scorebook

False escalations

Quarantine

Name: Voice handoff during cascading failures

Scenario: While analyzing post-mortems, the team noticed a pattern: during confirmed P0 incidents in critical services, on-call engineers sometimes skipped the standard text template and immediately initiated a voice handoff to transfer the issue to the second line.

Challenge: This informal observation needed to be formalized for an LLM agent, but it could not be made a universal rule (P2 incidents still required an asynchronous text trail). It was necessary to prove its value under a limited budget of few-shot slots.

Solution: The observation was converted into the 'context → feature → effect' format under the name shadow.p0.voice_handoff. On a sample of 20 incidents it showed a score of 0.727 (high early signal, zero false escalations, although with narrow coverage).

Result: The specification won the auction and was implemented in QWEN.md as a versioned few-shot example with a strict context (only for P0). The agent began recommending the voice channel only where the transfer delay is more expensive than the loss of written context.

Lessons learned: Rare but accurate signals have high value if their context is strictly limited.

The auction winner should not replace a formal specification — it is implemented as a few-shot with a limited TTL (time to live).

The context → feature → effect format turns 'magic knowledge' into a reproducible slot.

Related concepts: Few-shot example

Auction

Observation normalization

Study tips: Do not skip the practical part: the theory of the auction is best absorbed when running the scripts score.py and decide.py in the examples/shadow-auction/ directory.

Pay attention to the formula score = 0.5mttr_gain + 0.3early_signal + 0.2coverage - 0.4false_escalation. Try mentally changing the priorities — which coefficient would you increase in a medical project? In e-commerce?

The main disciplinary principle of the section is not to trust vivid stories without a scorebook. Always require reproducibility of calculations.

Remember the difference: a formal requirement (requirements.md) expands the system's contract, while a shadow specification (shadow spec in QWEN.md) is only a temporary extension of the agent's memory with a specified review time (ttl).

Additional resources: Training scripts and data: book2/examples/shadow-auction/ (contains candidates.yaml, incidents.jsonl and scripts for calculation)

Previous course topics: Part 6 of the first volume (separating wishes from requirements) and Part 19 of the first volume (agent memory)

Appendix d (calibration): appendix-d-threshold-calibration.md — section D.2 for detailed tuning of weights on the production track

Github spec kit: https://github.com/github/spec-kit — using .specify/memory/constitution.md as a protective layer against drift

Summary: The shadow specification auction is a mechanism for turning operational folklore and informal observations of engineers into a manageable and measurable layer of heuristics. You translate an observation into the 'context → feature → effect' format, evaluate it on incident history using a transparent formula, and compete for a limited context budget. Winning specifications get into QWEN.md as versioned few-shot examples with a time to live and a strict context. The losers do not disappear, but go into quarantine with a clearly recorded reason (for example, the risk of false escalations), which makes the decision process reproducible and disputable.