Reading: Applied Part 9. Model Routing and Token Budget

Lesson 1 of 5 in module «Applied Part 9. Model Routing and Token Budget»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Applied Part 9. Model Routing and Token Budget

Status: Recommendation. Splitting a cheap model for routine tasks and an expensive model for critical reviews is a stable practice. Specific thresholds, failover switching formulas, and a budget keeper as a separate service are frontier: Qwen Code does not manage the budget itself; implementation depends on infrastructure.

For the educational walkthrough, it is sufficient to replay the failure of local-coder in examples/budget-keeper/ and verify that not the entire queue goes to the expensive tier. A separate budget keeper and integration with providers belong to the full production track.

In the educational AgentClinic, we selected one model in Part 4 of Volume 1 and kept the process independent of it (Part 15). In production, one model is not enough. The expensive one should not spontaneously consume the entire incident queue. The cheap one should not silently degrade on contentious cases. Here we add a dimension that was not present in the educational project: managing a mix of models by pipeline phases. Routing fits conveniently into a user command or hook — techniques from Part 14. Your Own Process via Qwen Code Skills.

Before Reading

  • Foundation from the first volume: Part 15 requires agent replaceability, Part 14 shows project skills and hooks.
  • Local educational case: autoscale_200pct, because failure of the cheap tier provides an observable budget simulation.
  • Trace for capstone/: one risk for high_memory_usage: what happens when local-coder fails, how many tasks are allowed into frontier-reviewer, which token_health blocks switching.
  • Key terms for the first pass: tier and token_health. Budget keeper, failover_to_frontier, manual_queue_after_120s — reference.
  • What to defer: integration with providers, a separate budget keeper service, and regular drills.

Goal

The goal of the chapter is to turn the daily token budget (in the example — 10M) from a static limit into a manageable SDD pipeline routing table. This is tier-budgeting: distributing tokens between model levels by work phases. The cheap model (local-coder) takes routine tasks. The expensive one (frontier-reviewer) engages only for critical reviews and contentious decisions.

The figure of 10M is chosen to cover a flow of about 200 incidents per day at an average phase cost of about 50K tokens. For larger flows, scale the budget proportionally; for smaller ones, reduce it while preserving proportions between phases. The 9M / 1M split between tiers reflects the observation: in calm mode, about 10% of the total budget goes to contentious reviews. If your project more often poses complex tasks, increase the upper tier share to 15–20%.

After this section you will be able to build token distribution by incident management phases, set SLA thresholds for each tier, verify system behavior when the cheap model fails, and prove that savings do not destroy MTTR (mean time to recovery), escalation quality, and post-analytics resilience. local-coder and frontier-reviewer are roles in your infrastructure, not model names: in one project these may be different models from the same provider, in another — local and cloud models.

Minimal Educational Scenario

Educational Case

Production incident autoscale_200pct for MVP phase appointments-api from book/part-12-mvp.md. In the morning the local tier is unavailable for 45 minutes (11:00–11:45), 20 incidents drop into the queue, manual timeout is 120 seconds. The goal of the educational run is to verify that failover only lets high-risk tasks into the upper tier, not the entire queue, and keeps token_health_min above the safe threshold.

Preparation

  • book2/examples/budget-keeper/specs/budget_network.yaml — description of the 10M token plan.
  • book2/examples/budget-keeper/specs/budget_network_5m.yaml — ready calibration variant at 5M tokens with the same proportions.
  • book2/examples/budget-keeper/scenarios/fail_local_45m.json and fail_local_15m.json — two failure scenarios.
  • book2/examples/budget-keeper/outputs/budget_plan.example.json, outputs/fail_result.example.json — reference benchmarks for comparison.
  • book2/examples/budget-keeper/scripts/compile.py, simulate.py, inspect.py.

Steps

  1. cd book2/examples/budget-keeper. Expected: you are in the example directory, no additional dependencies.
  2. python3 scripts/compile.py --budget-spec specs/budget_network.yaml --out out/budget_plan.json. *Expected: in out/budget_plan.json field daily_budget_tokens: 10000000, local tier sum equals 9,000,000, frontier — 1,000,000 (90/10).*
  3. Compare out/budget_plan.json with outputs/budget_plan.example.json via diff. Expected: no discrepancies, or deviations only in comments.
  4. python3 scripts/simulate.py --plan out/budget_plan.json --scenario scenarios/fail_local_45m.json --out out/fail_result.json. *Expected: failover_to_frontier == 5, degraded_queue == 15, token_health_min >= 0.5.*
  1. python3 scripts/inspect.py --result out/fail_result.json --query "failover_to_frontier==5 && degraded_queue==15 && manual_queue_after_120s==15 && token_health_min>=0.5". Expected: return code 0, all four conditions satisfied simultaneously.

Bad: checking one metric at a time — 5 tasks went to frontier, the rest "seems okay", token_health forgotten. Good: one inspect run with four conditions in && — failure of any metric breaks the run.

  1. Short failure. python3 scripts/simulate.py --plan out/budget_plan.json --scenario scenarios/fail_local_15m.json --out out/fail_15m_result.json && python3 scripts/inspect.py --result out/fail_15m_result.json --query "token_health_min>=0.7". *Expected: return code 0, token_health_min >= 0.7 (short failure burns frontier less aggressively).*
  2. Record the run as a short budget output: local-coder unavailable, upper tier gets only 5 tasks, the rest goes to degraded/manual, token_health_min stays above threshold. *Expected: on next token_health regression, comparison goes not "green vs old baseline", but against both simulations.*

If you have Qwen Code installed and need an explanation for review, perform a separate optional step:

qwen -p "Read @out/fail_result.json and @out/fail_15m_result.json. Explain why the 45-minute failure drops token_health more than the 15-minute one. Do not modify the files." --approval-mode plan

Such output is useful as a comment, but does not replace inspect.py and is not counted as a runnable fact.

Control Fact

Four conditions from step 5 are satisfied simultaneously. token_health_min does not drop below 0.5 on 45-minute failure and not below 0.7 on 15-minute failure. Without both simulations the scenario is considered incomplete: one point does not show budget sensitivity to failure duration.

How This Goes into capstone/

Transfer to capstone/budget-note.md not the entire budget table, but one risk and one limiter: what happens when local-coder fails, how many tasks go to frontier-reviewer, which token_health threshold blocks further switching. If the main credit case is high_memory_usage, record this run as a budget risk for the same contour: not the entire autoscale_200pct, but the principle "expensive tier does not accept the entire queue on cheap tier failure". Full budget_plan.json is only needed for the full track.

Minimal fragment:

risk: "local-coder unavailable 45m"
effect: "5 tasks go to frontier-reviewer, 15 remain degraded/manual"
simulated_floor: "token_health_min == 0.5 (dip at 45m)"
alert_threshold: "token_health_min < 0.60 (guard from anti-Goodhart table)"
decision: "do not transfer entire queue to expensive tier"

Two different thresholds must not be confused. 0.5 is the observed simulation floor; 0.60 is the line below which the guard blocks automatic switching in production. The educational scenario shows that a 45-minute failure breaches the guard and therefore requires manual decision.

Reviewable Trace

Directory out/ is a local simulation result and must not enter the repository. For the educational walkthrough, a line in capstone/budget-note.md with risk, effect, guard threshold, and decision is sufficient.

In your own production repository, you may additionally store a short drill run report: links to 45m and 15m scenarios, token_health_min invariant, and decision not to transfer the entire queue to the expensive tier. Such a report is only useful if read by a reviewer or CI; a commit by itself is not an SDD fact.

Key Ideas

Model routing starts with splitting an incident into phases: triage (initial triage), classification, diagnosis, plan, remediation, post-analysis. For each phase, fix three parameters: which model serves it, expected token cost, and at what risk escalation to the upper tier occurs.

Triage and classification are dense, templated flows sensitive to delay. Therefore local-coder takes them as the main routine consumer: quickly normalizes alerts, groups similar symptoms, extracts service, severity, recent events, and initial blast radius (area of possible damage).

frontier-reviewer occupies the upper network level for contentious diagnoses, conflicting plans, critical remediations, and post-mortems. These are cases where an error may cost more than the entire model call.

Draw the boundary between tiers not by model prestige, but by recoverability of the decision. If an action is easy to roll back and can be checked by a local validator, it stays in the cheap contour. If rollback is expensive or consequences affect multiple services, the expensive contour is needed.

flowchart TD
IN[Incident flow]
S[SDD phase S signal collection and normalization]
D1[SDD phase D1 anomaly detection]
D2[SDD phase D2 diagnosis and assessment]
Q[Processing queue length]
R[Risk level]
B[Token budget as energy]
P[Flow distributor]
A[local-coder base level]
G[frontier-reviewer upper level]
O[Incident resolution and feedback]

IN --> S --> D1 --> D2 --> O
D1 --> Q
D2 --> R
Q --> P
R --> P
B --> P
P -->|stable mode| A
P -->|queue and risk growth| G
A --> O
G --> O
A -->|escalation of complex case| G
O -->|correction of limits and queues| B

The diagram above shows only the input and decision phases of the SDD cycle (signal collection, detect, diagnosis); the full incident cycle continues with plan, remediation, postmortem phases, which have separate SLAs and quotas — they appear in the YAML below. That is, the three abstract diagram phases (S, D1, D2) expand into six specific quotas (triage, classification, diagnosis, plan, remediation, postmortem) plus control_reserve as a buffer.

Build token quotas by load shape, not only by desired savings. For 10M tokens per day, a base layout may assign 9M to local-coder and 1M to frontier-reviewer. The cheap tier covers triage, classification, draft diagnosis, and preliminary plan. The expensive tier gets reserve for validation, contentious remediation actions, and post-analysis.

Set SLA thresholds separately for each phase. For example: triage must fit in tens of seconds, diagnosis may spend more context, and post-mortem allows a longer pass for completeness of the evidence chain.

Do not turn reserve into "leftovers for everything". Reserve is a safety layer that activates only on risk growth, queue growth, or uncertainty.

Project file template: .specify/memory/budget_network.yaml.

daily_budget_tokens: 10000000
phases:
  triage:
    local-coder: 3000000
    frontier-reviewer: 120000
    sla_p95: "30s"
  classification:
    local-coder: 2000000
    frontier-reviewer: 140000
    sla_p95: "45s"
  diagnosis:
    local-coder: 1500000
    frontier-reviewer: 180000
    sla_p95: "90s"
  plan:
    local-coder: 800000
    frontier-reviewer: 120000
    sla_p95: "120s"
  remediation:
    local-coder: 700000
    frontier-reviewer: 200000
    sla_p95: "180s"
  postmortem:
    local-coder: 300000
    frontier-reviewer: 240000
    sla_p95: "10m"
  control_reserve:
    local-coder: 700000
    frontier-reviewer: 0

In your own project, the same steps are structured as tools/budget_keeper.py compile|assert|simulate|inspect on top of integration with providers and CI. Inside the textbook, a runnable analog is launched:

> [runnable] — a runnable budget keeper example lives in [examples/budget-keeper/](examples/budget-keeper/) (see [examples/budget-keeper/README.md](examples/budget-keeper/README.md)): there is a sample budget_network.yaml, scripts compile.py, simulate.py, inspect.py, and failover switching scenarios.

cd book2/examples/budget-keeper
python3 scripts/compile.py \
  --budget-spec specs/budget_network.yaml \
  --out out/budget_plan.json

Model cascade failures as ranked failover, not as simple replacement of one model with another. Failover here is a load switching plan on tier failure. Let's look at the difference in approaches.

Bad: > When local-coder falls, all traffic goes to frontier-reviewer.

Problem: the expensive tier will eat the daily quota in minutes and will be unable to serve real P0/P1 when they arrive.

Good: > When local-coder falls, only tasks with severity in [P0, P1] and age > 90s go to frontier-reviewer, the rest — to degraded queue.

If local-coder falls, do not let the entire incoming flow automatically into frontier-reviewer. Otherwise the expensive tier will quickly exhaust quota and lose ability to serve truly critical cases.

Instead, the budget keeper (budget-keeper, token budget control service) calculates several parameters every minute: spent[p] and queue[p] (spent and queue length in phase p), quota[p] (remaining quota), incident age, blast radius, and model confidence gap. Based on this, it selects only those tasks where delay is more dangerous than cost. Such ranked failover changes escalation timing: some incidents go to frontier-reviewer immediately, some stay in degraded mode, and some transfer to manual channel after a set timeout.

Emergency mode, or "red button", is a switch to protected mode. The figurative name is acceptable, but in artifacts fix exactly the emergency mode conditions. It is needed as a separate management mode because automatic failover itself can become a source of emergency. Activation conditions are formal: two consecutive windows with token_health risk growth (composite token budget health indicator), queue above limit, SLA breach on critical severities, or failure of the local endpoint serving local-coder.

After triggering, the system limits new queue, prohibits mass automatic remediations, preserves frontier-reviewer for P0/P1, and transfers other decisions to manual or quasi-manual mode. Manual mode is not a rollback to chaos. Let it inherit the same file protocol, evidence chain, and PostToolUse checks, so that after stabilization the reasons for each decision can be recovered.

Anti-Goodhart logic in validation.md closes the main risk of budget optimization: improving reported metrics at the cost of hidden degradation of real incident management. The anti-Goodhart rule is a ban on counting a release as successful if one metric rose at the cost of degrading others.

If you control only MTTR, the system may faster close complex incidents as non-critical, suppress escalation share, or push inconvenient P0s to manual channels without full post-mortem. Therefore validate MTTR together with four guard metrics and one activation condition. Their role is conveniently kept in one table.

MetricWhat it measuresWhat it blocks
escalation_shareshare of escalations to total flowactivation condition for check — drop below historical corridor simultaneously with fast MTTR
silent_p0share of closed P0s without escalationrise above 2%
unresolved_manual_ratioshare of unresolved manual tasksrise above 5%
postmortem_gapgaps in post-analyticsgaps above 10%
token_health_minminimum budget health leveldrop below 0.6

Count MTTR improvement as invalid if any guard metric crossed its boundary. The paired check is exactly for this: a pretty reported metric should not cover degradation of resilience, silent P0 failures, or rupture of the evidence chain.

Fragment for validation.md with budget gateway rules.

checks:
  - id: anti_goodhart_budget
    if:
      all:
        - mttr_p95 < "5m"
        - escalation_ratio < 0.08
    then:
      fail_if:
        - silent_p0 > 0.02
        - unresolved_manual_ratio > 0.05
        - postmortem_gap > 0.10
        - token_health_min < 0.60

  - id: ecology_warn
    if:
      any:
        - token_health_trend_5m < -0.12
        - queue_pressure > 0.80
        - degraded_mode_duration > "120s"
    then:
      require:
        - red_button_review == true
        - manual_channel_open == true
        - frontier_reserved_for_p0_p1 == true

In your own project, this check is structured as python3 tools/validation_runner.py run --spec validation.md --out .specify/artifacts/validation_health.json with subsequent jq check of anti_goodhart_budget and ecology_warn. A close runnable analog of the anti-Goodhart checks themselves is examples/goodhart-validator/scripts/run_validation.py (see Chapter 10).

Full Track: Threshold Calibration

The "Low / Default / High" table for budget size, local/frontier proportions, and manual_timeout_sec, exercise with the compressed 5M variant, and signals for review — in Appendix D, Section D.3. On the first pass, two failure simulations and a token_health line in budget-note.md are sufficient.

Examples and Application

Practical simulation of Scenario B verifies that local-coder failure does not turn frontier-reviewer into an emergency reserve for the entire queue. At 11:00 the local endpoint of the cheap model is unavailable for 45 minutes. The queue contains 20 incidents. Manual timeout is 120 seconds.

The policy selects three directions: 5 tasks with maximum blast radius and age go to frontier-reviewer, 15 tasks remain in the degradation queue, after two minutes the manual channel opens. The check is considered successful not because all tasks were processed automatically. Success is elsewhere: the system preserved the expensive tier, limited the queue, and did not let token_health_min drop below the safe threshold.

In your own project, this scenario is launched as tools/budget_keeper.py simulate ... --failure "11:00,local-coder,down,45m" --queue 20 --manual-timeout-sec 120 with subsequent inspect on condition failover_to_frontier==5 && degraded_queue==15 && manual_queue_after_120s==15 && token_health_min>=0.5. The runnable analog is the same:

> [runnable] — scenario examples/budget-keeper/scenarios/fail_local_45m.json.

cd book2/examples/budget-keeper
python3 scripts/simulate.py \
  --plan out/budget_plan.json \
  --scenario scenarios/fail_local_45m.json \
  --out out/fail_result.json

python3 scripts/inspect.py \
  --result out/fail_result.json \
  --query "failover_to_frontier==5 && degraded_queue==15 && manual_queue_after_120s==15 && token_health_min>=0.5"

Make rollback after stabilization stepped: otherwise cheap tier recovery will create a second cascade. First return 30% of local-coder quota and only for triage/classification (these phases are easier to check by formal signs and they faster unload the input flow); another 30% — to diagnosis/plan after three stable token_health windows, no silent_p0_ratio growth, and queue normalization; full return is only permitted after PostToolUse audit. Reason: premature lifting of manual mode may hide errors accumulated during degradation.

In operations, this model is conveniently checked as a daily budget drill. The team takes yesterday's alert flow, replays it through the current budget_network.yaml, and artificially disables local-coder for 15, 30, and 45 minutes. Then four indicators are compared: MTTR, escalation share, manual queue volume, and minimum token_health.

Signals for analysis:

  • if on short failure frontier-reviewer starts serving non-critical tasks — failover is too broad;
  • if manual channel opens at moderate queue — SLA thresholds are too nervous.

The goal of the drill is to find a profile where degradation is predictable, not invisible until quota exhaustion.

Summary

Token budget becomes a manageable resource only when five elements are linked in one management loop: SDD phases, model tiering, SLA thresholds, failover, and validation. In this loop, local-coder provides throughput for mass routine; frontier-reviewer protects contentious and high-risk decisions; emergency mode limits automation on risk growth; validation.md prevents improving MTTR at the cost of silent P0s and destroyed post-analytics. Such a scheme shows not only current spend, but also degradation order: which phases will starve first, which tasks must transfer to the expensive tier, and when manual mode is safer than further automation. Next, this loop will pass to Goodhart metrics and paired guard metrics.

Artifacts and Readiness Criteria

ArtifactReady when
Local run of book2/examples/budget-keeperquota sum matches 10M tokens and given local/frontier split

| out/budget_plan.json, out/fail_result.json, out/fail_15m_result.json | 45-minute scenario yields failover_to_frontier==5, degraded_queue==15, manual_queue_after_120s==15, token_health_min>=0.5; 15-minute scenario preserves token_health_min>=0.7; out/ is not committed | | Entry in precedents.md or capstone/budget-note.md | explains what happens when local-coder fails, which tasks go to frontier-reviewer, and which token_health_min threshold protects the budget |

Full track adds .specify/memory/budget_network.yaml with phases and SLAs, budget_plan.json after compile, fail_scenario_B.json, validation.md with anti-Goodhart budget gateway, and validation_health.json. Consider it ready if emergency mode preserves frontier for P0/P1 and opens manual channel, the anti-Goodhart gateway blocks savings at the cost of silent_p0 or audit rupture, and budget simulation is included in regular drill or CI.

Practice

  1. cd book2/examples/budget-keeper && python3 scripts/compile.py --budget-spec specs/budget_network.yaml --out out/budget_plan.json — *expected: daily_budget_tokens == 10_000_000, local tier sum 9M, frontier 1M (90/10).*
  1. python3 scripts/simulate.py --plan out/budget_plan.json --scenario scenarios/fail_local_45m.json --out out/fail_result.json && python3 scripts/inspect.py --result out/fail_result.json --query "failover_to_frontier==5 && degraded_queue==15 && manual_queue_after_120s==15 && token_health_min>=0.5"expected: code 0, four conditions satisfied simultaneously.
  2. Transfer to capstone/budget-note.md five lines: risk, effect, simulated_floor, alert_threshold, decision. *Expected: format matches the reference from the "How This Goes into capstone/" section; full budget_plan.json does not go into capstone/.*

Review Questions

  1. Why should failover not let the entire queue into the expensive tier?
  2. Which metrics reveal degradation of budget routing?
  3. When is manual mode safer than continuing automation?
  4. The local model fell for 45 minutes at peak time. You have 60% of daily budget, but MTTR is creeping up. What do you switch — model, routing policy, or triage mode?
My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100