Topic: Applied Part 9. Model Routing and Token Budget
Difficulty level: Medium
Estimated study time: 4-6 hours (theory + practice), additional 2 hours for full track with threshold calibration
Prerequisites: Familiarity with the SDD cycle (signal collection, detect, diagnostics from the first volume)
Part 15 of the first volume: agent replaceability
Part 14 of the first volume: project skills and Qwen Code hooks
Basic proficiency in Python and command line
Understanding of MTTR, SLA, and incident escalation concepts
Learning objectives: Model and verify a cheap tier model failure scenario, ensuring only critical tasks pass to the expensive tier while maintaining token_health_min ≥ 0.5
Build token distribution across incident management phases (triage, classification, diagnosis, plan, remediation, postmortem) with a 90/10 proportion between local-coder and frontier-reviewer
Formulate and apply an anti-Goodhart rule for validating budget optimization, linking MTTR with four guard metrics
Define red button emergency mode activation conditions and justify when manual mode is safer than continuing automation
Create a budget-note.md artifact for the capstone project documenting failure risk, effect, guard threshold, and decision
Overview: This chapter transforms the daily token budget from a static limit into a manageable SDD pipeline routing table. Tier budgets distribute tokens between model levels by work phase: the cheap local-coder model handles routine (triage, classification, draft diagnosis), while the expensive frontier-reviewer activates only for critical reviews, disputed decisions, and post-analysis. The key skill is failure modeling: when the cheap tier fails, the system must not automatically redirect the entire queue to the expensive tier, or the quota will be exhausted in minutes while real P0/P1 incidents remain without resources. The chapter includes runnable examples (compile.py, simulate.py, inspect.py), failure scenarios for 15 and 45 minutes, and integration with anti-Goodhart validation to prevent metric distortion.
Key concepts: Tier: A model level in the processing hierarchy. local-coder is the base tier for mass routine work, frontier-reviewer is the upper tier for disputed and high-risk cases. Roles, not model names: different projects may have different implementations.
Token health: A composite budget health indicator tracking the ratio of spent and remaining tokens to plan. The minimum value token_health_min is used as a guard threshold to block automatic switching.
Failover to frontier: A load switching plan when a tier fails. Ranked switching: only tasks with high severity and age go to frontier-reviewer, the rest go to degraded queue or manual channel.
Degraded queue: A degradation queue — tasks that cannot be processed in normal mode when the cheap tier fails, but do not yet require immediate escalation to the expensive tier.
Manual queue after 120s: A manual channel opening after a 120-second timeout for tasks not processed automatically. Not a rollback to chaos: inherits the same proof protocol and PostToolUse verification.
Control reserve: An insurance budget layer activated only when risk, queue, or uncertainty rises. Not a "remainder for everything."
Red button (emergency mode): A protected management mode activated under formal conditions: two windows of token_health risk growth, queue above limit, SLA breach for critical severity, or local-coder failure. Limits new queue, prohibits mass automatic remediation, preserves frontier-reviewer for P0/P1.
Anti-Goodhart validation: A rule prohibiting a release from being considered successful if one metric improved at the expense of degrading others. MTTR is validated together with escalation_share, silent_p0, unresolved_manual_ratio, postmortem_gap, and token_health_min.
Budget-keeper (budget guardian): A token budget control service recalculating spent[p], queue[p], quota[p], incident age, blast radius, and confidence-gap every minute to make routing decisions.
Budget-drill: A daily training run: yesterday's alert stream is replayed through the current budget_network.yaml with artificial local-coder shutdown for 15/30/45 minutes to calibrate system sensitivity.
Practice exercises: Name: Compiling the budget plan and checking proportions
Problem: In the book2/examples/budget-keeper directory, compile a budget plan from specs/budget_network.yaml. Verify that daily_budget_tokens equals 10,000,000, the local tier quota sum is 9,000,000, and the frontier tier sum is 1,000,000. Compare the result with the reference outputs/budget_plan.example.json.
Solution: 1. cd book2/examples/budget-keeper
- python3 scripts/compile.py --budget-spec specs/budget_network.yaml --out out/budget_plan.json
- Check the daily_budget_tokens field: 10000000
- Calculate the local-coder sum: 3000000 + 2000000 + 1500000 + 800000 + 700000 + 300000 + 700000 = 9000000
- Calculate the frontier-reviewer sum: 120000 + 140000 + 180000 + 120000 + 200000 + 240000 + 0 = 1000000
- diff out/budget_plan.json outputs/budget_plan.example.json — there should be no discrepancies (or only in comments)
Complexity: beginner
Name: Simulating a 45-minute failure and comprehensive inspection
Problem: Simulate a local-coder failure for 45 minutes with 20 incidents in the queue and a manual timeout of 120 seconds. Check four conditions simultaneously: failover_to_frontier == 5, degraded_queue == 15, manual_queue_after_120s == 15, token_health_min >= 0.5. Explain why metrics cannot be checked individually.
Solution: 1. python3 scripts/simulate.py --plan out/budget_plan.json --scenario scenarios/fail_local_45m.json --out out/fail_result.json
- python3 scripts/inspect.py --result out/fail_result.json --query "failover_to_frontier==5 && degraded_queue==15 && manual_queue_after_120s==15 && token_health_min>=0.5"
- Expected return code: 0 (all conditions met)
- Why not individually: checking only failover_to_frontier==5 does not guarantee that the remaining 15 tasks did not go to frontier through another mechanism; checking only token_health does not show queue distribution; the composite condition with && guarantees a complete picture — failure of any metric breaks the run and requires investigation
Complexity: intermediate
Name: Comparative analysis of 15-minute and 45-minute failures
Problem: Simulate a short local-coder failure for 15 minutes. Check token_health_min >= 0.7. Explain why a short failure burns the frontier tier less aggressively, and how this affects guard threshold selection for production.
Solution: 1. python3 scripts/simulate.py --plan out/budget_plan.json --scenario scenarios/fail_local_15m.json --out out/fail_15m_result.json
- python3 scripts/inspect.py --result out/fail_15m_result.json --query "token_health_min>=0.7"
- Return code: 0
- Explanation: with a 15-minute failure, fewer tasks accumulate in the queue, fewer escalations to frontier-reviewer are needed, less reserve budget is consumed. With a 45-minute failure, the queue grows, pressure on the frontier tier increases, token_health drops more significantly.
- Production conclusion: a guard threshold of 0.60 (below the observed floor of 0.5 for 45-minute failure, but with margin) blocks automatic switching for predictable outages; a 45-minute failure breaches the guard and requires manual decision
Complexity: intermediate
Name: Formulating risk for the capstone project
Problem: Based on simulation results, formulate an entry in capstone/budget-note.md for your main case (e.g., high_memory_usage). Record the risk, effect, simulated_floor, alert_threshold, and decision. Explain the difference between simulated_floor and alert_threshold.
Solution: Minimal fragment:
risk: "local-coder unavailable 45m"
effect: "5 tasks go to frontier-reviewer, 15 remain degraded/manual"
simulated_floor: "token_health_min == 0.5 (drop at 45m)"
alert_threshold: "token_health_min < 0.60 (guard from anti-Goodhart table)"
decision: "do not transfer entire queue to expensive tier"
Difference: simulated_floor (0.5) is the observed simulation floor, the actual minimum in the test scenario. alert_threshold (0.60) is the line below which the guard blocks automatic switching in production. Between 0.5 and 0.60 is a warning zone where the system still works but requires operator attention.
Complexity: intermediate
Name: Calibrating a compressed budget (5M)
Problem: Use specs/budget_network_5m.yaml for a calibration variant with 5M tokens. Verify the 90/10 proportion is preserved. Simulate the same 45-minute failure and analyze how halving the total budget affects system behavior and guard threshold acceptability.
Solution: 1. python3 scripts/compile.py --budget-spec specs/budget_network_5m.yaml --out out/budget_plan_5m.json
- Verify: daily_budget_tokens == 5000000, local sum == 4500000, frontier sum == 500000
- python3 scripts/simulate.py --plan out/budget_plan_5m.json --scenario scenarios/fail_local_45m.json --out out/fail_result_5m.json
- python3 scripts/inspect.py --result out/fail_result_5m.json --query "token_health_min>=0.5"
- Analysis: when the budget is halved, absolute quotas decrease but relative pressure on the system increases — the same flow of 20 incidents consumes a larger share of the reserve. token_health_min may drop below 0.5 or the guard threshold of 0.60 may become unattainable. Conclusion: when scaling the budget proportionally to flow, maintain margin in control_reserve; for disproportionate flow, reconsider the guard threshold or increase the frontier tier share to 15-20%
Complexity: advanced
Name: Verifying the anti-Goodhart gateway
Problem: Formulate a YAML fragment for validation.md for the budget gateway. Condition: if MTTR P95 < 5 minutes AND escalation share < 8%, then fail the check if silent_p0 > 2%, unresolved_manual_ratio > 5%, postmortem_gap > 10%, or token_health_min < 0.60. Explain why exactly this paired check.
Solution: Fragment for validation.md:
checks:
- id: anti_goodhart_budget
if:
all:
- mttr_p95 < "5m"
- escalation_ratio < 0.08
then:
fail_if:
- silent_p0 > 0.02
- unresolved_manual_ratio > 0.05
- postmortem_gap > 0.10
- token_health_min < 0.60
Why the paired check: without it, the system could optimize MTTR by suppressing escalations (silent P0), pushing complex tasks to manual channel without postmortem, or exhausting the frontier tier budget. Improving one metric while degrading others is the classic Goodhart distortion. The gateway blocks such "savings."
Complexity: advanced
Case studies: Name: local-coder failure in production: autoscale_200pct incident
Scenario: Production system appointments-api, morning load peak. At 11:00 the local endpoint of the cheap local-coder model becomes unavailable for 45 minutes (network failure in the cluster). 20 incidents related to autoscaling to 200% load fall into the queue. Manual timeout is set to 120 seconds. Daily budget: 10M tokens with 9M/1M distribution between local-coder and frontier-reviewer.
Challenge: Classic trap: when the cheap tier fails, automatically redirect all traffic to the expensive tier. This will exhaust the 1M frontier-reviewer quota in minutes, after which real P0/P1 incidents will be left without resources. Alternative trap: leave all tasks in the degradation queue, increasing MTTR and potentially violating SLA for critical cases. Third trap: not tracking token_health, missing the drop to a dangerous level.
Solution: Applied ranked failover policy: only 5 tasks with maximum blast radius (radius of consequences) and age > 90 seconds go to frontier-reviewer. 15 tasks remain in degraded queue. After 120 seconds, the manual_queue_after_120s channel opens for tasks not processed automatically. The budget keeper recalculates token_health, spent[p], queue[p], quota[p] every minute. At token_health_min < 0.60 (guard threshold), automatic switching is blocked, requiring manual decision. In this scenario, token_health_min reaches 0.5 — breaches the guard, activating red_button review.
Result: All 4 inspection conditions are met simultaneously: failover_to_frontier == 5, degraded_queue == 15, manual_queue_after_120s == 15, token_health_min >= 0.5. The expensive tier is preserved for real P0/P1. MTTR for critical tasks is unaffected. The manual channel inherited the proof protocol, allowing audit of all decisions after local-coder recovery. Staged rollback: first 30% of local-coder quota for triage/classification, then diagnosis/plan after three stable windows, full return only after PostToolUse audit.
Lessons learned: Failure metrics cannot be checked individually — only a composite condition with && guarantees integrity
Between simulated_floor (0.5) and alert_threshold (0.60), margin is needed for warning, not just emergency reaction
Manual mode is not a rollback to chaos, but controlled degradation preserving the proof chain
Staged rollback after stabilization prevents a second cascade of errors from premature recovery
Related concepts: failover_to_frontier
token_health_min
degraded_queue
manual_queue_after_120s
red_button
anti-Goodhart-validation
budget-drill
Name: Daily budget-drill in the payment gateway team
Scenario: The payment gateway incident processing team implemented tiered routing with a daily budget of 50M tokens (85/15 proportion due to high task complexity). Daily at 9:00, a budget-drill runs: yesterday's stream of ~800 alerts is replayed through budget_network.yaml with artificial local-coder shutdown for 15, 30, and 45 minutes.
Challenge: As the flow grew from 200 to 800 incidents, the 90/10 proportion inherited from the educational example proved insufficient: the frontier tier was overloaded even with a 15-minute failure. The team could not distinguish "normal pressure" from "requiring policy review." Silent_p0 began to rise — complex incidents were closed without escalation due to frontier resource shortage.
Solution: Drill run data showed: at 15-minute failure, token_health_min dropped to 0.55 (below the 0.60 guard threshold), at 30-minute — to 0.35. The team revised the proportion to 80/20, increased control_reserve from 700K to 2M, introduced an intermediate mid-coder tier for diagnosis/plan (medium-cost model). The guard threshold token_health_min was adjusted to 0.65 based on new sensitivity. The anti-Goodhart gateway was supplemented with a mid_tier_saturation metric.
Result: After calibration, a 15-minute failure maintains token_health_min >= 0.70, a 30-minute failure >= 0.50. Silent_p0 decreased from 4.2% to 1.1%. MTTR P95 stabilized at 3.2 minutes without rising unresolved_manual_ratio. The team established a rule: review proportions when flow changes by more than 25% or when drill token_health_min < 0.60 on two consecutive runs.
Lessons learned: The educational 90/10 proportion is a starting point, not dogma; scaling requires review
Budget-drill reveals problems before production incidents, but only if the team reads the results, not just CI
An intermediate tier can be more effective than simply increasing the frontier tier share
Drill metrics must influence operational policy, not just YAML configuration
Related concepts: budget-drill
tier-budgeting
control_reserve
anti-Goodhart-validation
token_health_trend_5m
Study tips: Go through the material sequentially: first compile (understanding structure), then simulate 45m (failure), then simulate 15m (comparison), then inspect with composite condition. Skipping a step breaks understanding of sensitivity
Keep a workbook with two columns: left — command and script output, right — interpretation ("what this means for my project"). This turns runnable examples into project decisions
Use diff between out/budget_plan.json and outputs/budget_plan.example.json as a diagnostic tool, not just a correctness check. Any deviation is a reason to investigate compile.py logic
For visual learners: draw a flow diagram of 6 phases with two tiers and three exits (frontier, degraded, manual). Mark where exactly SLA thresholds are applied and where the guard token_health_min is located
For auditory learners: speak aloud the composite inspect condition with four metrics, explaining each &&. If you cannot explain why exactly 5 tasks go to frontier — return to the "ranked switching" section
For kinesthetic learners: modify the fail_local_45m.json scenario — change queue from 20 to 40, increase duration to 60 minutes, decrease manual_timeout_sec to 60. Predict the result, then verify with simulate + inspect. An incorrect prediction is a valuable signal to refine understanding
Record in capstone/budget-note.md immediately after each successful run, while context is fresh. Delaying recording leads to loss of nuances between simulated_floor and alert_threshold
Conduct a "paired check" with a colleague: one explains why the entire queue cannot be sent to frontier, the other — invents a counterexample where it seems reasonable. The debate clarifies the boundaries of policy applicability
Additional resources: Examples/budget-keeper/readme.md: Local runnable example with complete scripts compile.py, simulate.py, inspect.py and failure scenarios
Book/part-04-environment.md: Choosing one model in the educational AgentClinic — contrast with production tiered routing
Book/part-14-build-your-own-workflow.md: Project skills and hooks — where routing fits naturally
Book/part-15-agent-replaceability.md: Agent replaceability — prerequisite for switching between tiers
Appendix-d-threshold-calibration.md#d3-tier-budgets-chapter-9: Full calibration track: "Low / Default / High" table for budget, proportions, and manual_timeout_sec
Examples/goodhart-validator/scripts/run validation.py: Runnable analog of anti-Goodhart checks from chapter 10
Book2/examples/budget-keeper/specs/budget network 5m.yaml: Calibration variant for the compressed budget exercise
Summary: Model routing and token budget transforms the static limit into a manageable loop: SDD phases → model tiering → SLA thresholds → ranked switching on failure → anti-Goodhart validation. Cheap local-coder handles routine, expensive frontier-reviewer protects disputed and critical cases. When the cheap tier fails, the key rule: do not automatically send the entire queue to the expensive tier. Runnable examples (compile, simulate, inspect) allow verifying this rule on 15 and 45 minute scenarios, recording results in budget-note.md and linking to the capstone project. Successful completion: four inspect conditions met simultaneously, token_health_min above threshold, guard ready for production, anti-Goodhart gateway blocking metric distortion.