Reading: Applied Part 9. Model Routing and Token Budget

Lesson 1 of 5 in module «Applied Part 9. Model Routing and Token Budget»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Applied Part 9. Model Routing and Token Budget

Status: Recommendation. Separating a cheap model for routine work and an expensive model for critical reviews is a stable practice. Specific thresholds, the failover switching formula, and the budget keeper as a separate service are the frontier: Qwen Code does not manage the budget itself; implementation depends on the infrastructure.

For an educational run, it is enough to play out the local-coder failure in examples/budget-keeper/ and verify that not the entire queue goes to the expensive tier. A separate budget keeper and integration with providers belong to the full production track.

In the educational AgentClinic, we chose one model in part 4 of the first volume and kept the process independent of it (part 15). In production, one model is not enough. The expensive one should not spontaneously absorb the entire incident queue. The cheap one should not silently degrade on disputed cases. Here, a dimension is added that was absent in the educational project: managing the mix of models across pipeline phases. Routing fits naturally into a custom command or a hook — techniques from part 14. Your Own Process via Qwen Code Skills.

Before Reading

Foundation from the first volume: part 15 requires agent replaceability, part 14 shows project skills and hooks.
Local educational case: autoscale_200pct, because a failure of the cheap tier provides an observable budget simulation.
Trail for capstone/: one risk for high_memory_usage: what happens when local-coder fails, how many tasks are allowed into frontier-reviewer, which token_health blocks switching.

Key terms on first pass: tier and token_health. Budget keeper, failover_to_frontier, manual_queue_after_120s — for reference.
What to defer: integration with providers, a separate budget keeper service, and regular drills.

Goal

The goal of the chapter is to turn the daily token budget (10M in the example) from a static limit into a manageable routing table for the SDD pipeline. This is tier-budgeting: distributing tokens between model levels across work phases. The cheap model (local-coder) handles routine work. The expensive one (frontier-reviewer) is engaged only for critical reviews and disputed decisions.

The figure of 10M is chosen to cover a flow of about 200 incidents per day at an average phase cost of about 50K tokens. For larger flows, scale the budget proportionally; for smaller ones, reduce it, keeping the proportions between phases. The 9M / 1M split between tiers reflects an observation: in calm mode, disputed reviews consume about 10% of the total budget. If your project sets complex tasks more often, increase the upper tier's share to 15–20%.

After the section, you will be able to build a token distribution across incident management phases, set SLA thresholds for each tier, check the system's behavior when the cheap model falls, and prove that savings do not destroy MTTR (mean time to recovery), escalation quality, and post-analysis resilience. local-coder and frontier-reviewer are roles in your infrastructure, not model names: in one project they may be different models from one provider, in another — local and cloud models.

Baseline: Model Selection Table by Task

Before moving to tiering by phase, risk, and queue (which the rest of the chapter describes), it is useful to fix the starting heuristic for model selection. It operates not on the names of models of a specific provider, but on classes by power and cost: light, medium, and heavy. Different projects map to different models for these classes — the brand does not matter, the "expensive/smart" ratio does.

Model class	When to use
Light	exploration and file search, simple single-file edits, documentation generation
Medium	multi-file implementation, pull-request review — the default model for ~90% of coding tasks

| Heavy | complex architecture, security analysis, debugging complex bugs |

Keep the medium class as the default: it covers most coding tasks. Use the light class for cheap routine work where errors are easy to detect and roll back. Engage the heavy class not out of habit, but by an upgrade rule:

the first attempt on the medium class failed;
the task touches five or more files;
an architectural decision is being made;
the code is security-critical.

This table is the baseline of the same logic expanded later in the chapter. The connection is direct: local-coder corresponds to the light/medium class for streaming routine, frontier-reviewer to the heavy class for disputed and high-risk decisions. The difference is that the baseline table selects a model by the type of a single task, while tiering adds three dimensions absent from the basic heuristic: decision risk and recoverability, queue pressure, and the SDD pipeline phase with its own token budget. So the development path is: "baseline selection by task → tiering by risk, queue, and phase." The upgrade rule to the heavy class does not disappear — it becomes one of the conditions for escalation from local-coder to frontier-reviewer.

Minimum Educational Scenario

Educational Case

Production incident autoscale_200pct for the MVP phase of appointments-api from book/part-12-mvp.md. In the morning, the local tier is unavailable for 45 minutes (11:00–11:45), 20 incidents fall into the queue, the manual timeout is 120 seconds. The goal of the educational run is to verify that failover passes only high-risk items to the upper tier, not the entire queue, and keeps token_health_min above the safe threshold.

Preparation

book2/examples/budget-keeper/specs/budget_network.yaml — description of the 10M token plan.
book2/examples/budget-keeper/specs/budget_network_5m.yaml — ready calibration variant for 5M tokens with the same proportions.
book2/examples/budget-keeper/scenarios/fail_local_45m.json and fail_local_15m.json — two failure scenarios.
book2/examples/budget-keeper/outputs/budget_plan.example.json, outputs/fail_result.example.json — references for comparison.
book2/examples/budget-keeper/scripts/compile.py, simulate.py, inspect.py.

Steps

cd book2/examples/budget-keeper. Expectation: you are in the example directory, no additional dependencies.

python3 scripts/compile.py --budget-spec specs/budget_network.yaml --out out/budget_plan.json. *Expectation: in out/budget_plan.json the field daily_budget_tokens: 10000000, the local tier sum is 9,000,000, frontier — 1,000,000 (90/10).*
Compare out/budget_plan.json with outputs/budget_plan.example.json via diff. Expectation: no discrepancies, or deviations only in comments.
python3 scripts/simulate.py --plan out/budget_plan.json --scenario scenarios/fail_local_45m.json --out out/fail_result.json. *Expectation: failover_to_frontier == 5, degraded_queue == 15, token_health_min >= 0.5.*
python3 scripts/inspect.py --result out/fail_result.json --query "failover_to_frontier==5 && degraded_queue==15 && manual_queue_after_120s==15 && token_health_min>=0.5". Expectation: return code 0, all four conditions met simultaneously.

Bad: checking one metric at a time — 5 tasks went to frontier, the rest is "seems ok", token_health forgotten. Good: one inspect run with four conditions in && — failure of at least one metric breaks the run.

Short failure. python3 scripts/simulate.py --plan out/budget_plan.json --scenario scenarios/fail_local_15m.json --out out/fail_15m_result.json && python3 scripts/inspect.py --result out/fail_15m_result.json --query "token_health_min>=0.7". *Expectation: return code 0, token_health_min >= 0.7 (a short failure burns the frontier less aggressively).*
Record the run as a short budget summary: local-coder is unavailable, the upper tier receives only 5 tasks, the rest goes to degraded/manual, token_health_min stays above the threshold. *Expectation: on the next regression on token_health, the comparison is not "green vs. old baseline" but against both simulations.*

If you have Qwen Code installed and need an explanation for review, perform a separate optional step:

qwen -p "Read @out/fail_result.json and @out/fail_15m_result.json. Explain why a 45-minute failure reduces token_health more than a 15-minute one. Do not modify the files." --approval-mode plan

Such output is useful as a comment, but does not replace inspect.py and is not considered a runnable fact.

Control Fact

The four conditions from step 5 are met simultaneously. token_health_min does not drop below 0.5 for a 45-minute failure and not below 0.7 for a 15-minute one. Without both simulations, the scenario is considered incomplete: a single point does not show the budget's sensitivity to failure duration.

How This Lands in `capstone/`

Move into capstone/budget-note.md not the entire budget table, but one risk and one guard: what happens when local-coder fails, how many tasks go to frontier-reviewer, which token_health threshold blocks further switching. If the main graded case is high_memory_usage, record this run as a budget risk for the same contour: not the entire autoscale_200pct, but the principle that "the expensive tier does not accept the entire queue when the cheap one fails." The full budget_plan.json is needed only for the full track.

Minimum fragment:

risk: "local-coder unavailable 45m"
effect: "5 tasks go to frontier-reviewer, 15 remain degraded/manual"
simulated_floor: "token_health_min == 0.5 (dip at 45m)"
alert_threshold: "token_health_min < 0.60 (guard from anti-Goodhart table)"
decision: "do not switch the entire queue to the expensive tier"

Two different thresholds should not be confused. 0.5 is the observed simulation floor; 0.60 is the line below which the guard blocks automatic switching in production. The educational scenario shows that a 45-minute failure breaks through the guard and therefore requires a manual decision.

Reviewable Trail

The out/ directory is a local simulation result and should not be committed to the repository. For an educational pass, a single line in capstone/budget-note.md with the risk, effect, guard threshold, and decision is sufficient.

In your production repository, you can additionally store a short drill-run report: links to the 45m and 15m scenarios, the token_health_min invariant, and the decision not to switch the entire queue to the expensive tier. Such a report is useful only if it is read by a reviewer or CI; the commit itself is not an SDD fact.

Key Ideas

Model routing begins with splitting the incident into phases: triage (initial analysis), classification, diagnosis, plan, remediation, post-analysis. For each phase, fix three parameters: which model serves it, the expected cost in tokens, and at what risk escalation to the upper tier occurs.

Triage and classification are a dense, template-driven flow, sensitive to latency. Therefore local-coder takes it as the main routine consumer: quickly normalizes alerts, groups similar symptoms, extracts service, severity, recent events, and the initial blast radius.

frontier-reviewer occupies the top level of the network for disputed diagnoses, conflicting plans, critical remediations, and post-mortems. These are cases where a mistake can cost more than the entire model call.

Draw the boundary between tiers not by model prestige, but by decision recoverability. If an action is easy to roll back and can be verified by a local validator, it stays in the cheap contour. If rollback is expensive or consequences touch multiple services, the expensive contour is needed.

flowchart TD
IN[Incident flow]
S[SDD phase S signal collection and normalization]
D1[SDD phase D1 anomaly detection]
D2[SDD phase D2 diagnosis and assessment]
Q[Processing queue length]
R[Risk level]
B[Token budget as energy]
P[Flow distributor]
A[local-coder base level]
G[frontier-reviewer top level]
O[Incident resolution and feedback]

IN --> S --> D1 --> D2 --> O

D1 --> Q
D2 --> R
Q --> P
R --> P
B --> P
P -->|stable mode| A
P -->|queue and risk growth| G
A --> O
G --> O
A -->|complex case escalation| G
O -->|limits and queue correction| B

The diagram above shows only the input and decision phases of the SDD cycle (signal collection, detect, diagnosis); the full incident cycle continues with the plan, remediation, postmortem phases, which have their own SLAs and quotas — they appear in the YAML below. That is, the three abstract phases of the diagram (S, D1, D2) unfold into six concrete quotas (triage, classification, diagnosis, plan, remediation, postmortem) plus control_reserve as a buffer.

Build token quotas by the shape of the load, not only by the desired savings. For 10M tokens per day, the basic split can fix 9M to local-coder and 1M to frontier-reviewer. The cheap tier covers triage, classification, draft diagnosis, and a preliminary plan. The expensive tier gets a reserve for validation, disputed remediation actions, and post-analysis.

Set SLA thresholds separately for each phase. For example: triage must fit into tens of seconds, diagnosis can spend more context, and post-mortem allows a longer pass for the sake of a complete evidence chain.

Do not turn the reserve into "the remainder for everything." The reserve is a safety layer that activates only when risk, queue, or uncertainty grows.

Project file template: .specify/memory/budget_network.yaml.

daily_budget_tokens: 10000000
phases:
  triage:
    local-coder: 3000000
    frontier-reviewer: 120000
    sla_p95: "30s"
  classification:
    local-coder: 2000000
    frontier-reviewer: 140000
    sla_p95: "45s"
  diagnosis:
    local-coder: 1500000
    frontier-reviewer: 180000
    sla_p95: "90s"
  plan:
    local-coder: 800000
    frontier-reviewer: 120000
    sla_p95: "120s"
  remediation:
    local-coder: 700000
    frontier-reviewer: 200000
    sla_p95: "180s"
  postmortem:
    local-coder: 300000
    frontier-reviewer: 240000
    sla_p95: "10m"
  control_reserve:
    local-coder: 700000
    frontier-reviewer: 0

In your project, the same steps are formalized as tools/budget_keeper.py compile|assert|simulate|inspect on top of provider integration and CI. Inside the textbook, a runnable analog is launched:

> [runnable] — a runnable budget keeper example lives in [examples/budget-keeper/](examples/budget-keeper/) (see [examples/budget-keeper/README.md](examples/budget-keeper/README.md)): there is a sample budget_network.yaml, scripts compile.py, simulate.py, inspect.py, and failover scenarios.

cd book2/examples/budget-keeper
python3 scripts/compile.py \
  --budget-spec specs/budget_network.yaml \
  --out out/budget_plan.json

Model the failure cascade as ranked failover, not as a simple replacement of one model with another. Failover here is a load-switching plan when a tier fails. Let us look at the difference in approaches.

Bad: > When local-coder falls, all traffic goes to frontier-reviewer.

Problem: the expensive tier will eat the daily quota in minutes and will not be able to serve real P0/P1 when they arrive.

Good:

> When local-coder falls, only tasks with severity in [P0, P1] and age > 90s go to frontier-reviewer; the rest go to the degradation queue (degraded queue).

If local-coder falls, do not automatically send the entire incoming flow to frontier-reviewer. Otherwise the expensive tier will quickly exhaust the quota and lose the ability to serve truly critical cases.

Instead, the budget-keeper (token budget control service) every minute counts several parameters: spent[p] and queue[p] (spent and queue length in phase p), quota[p] (remaining quota), incident age, blast radius, and the model confidence gap. Based on this, it selects only those tasks where delay is more dangerous than spend. Such ranked failover changes the escalation time: some incidents go to frontier-reviewer immediately, some remain in degraded mode, and some move to the manual channel after a given timeout.

Emergency mode, or the "red button", is a switch to a protected mode. The figurative name is acceptable, but in artifacts record the actual emergency mode conditions. It is needed as a separate control mode, because automatic failover itself can become a source of incident. Activation conditions are formal: two consecutive windows with token_health risk growth (composite token budget health indicator), queue above the limit, SLA breach for critical severities, or the local endpoint serving local-coder falling.

After triggering, the system limits the new queue, forbids mass automatic remediations, reserves frontier-reviewer for P0/P1, and moves the rest of the decisions to manual or quasi-manual mode. Manual mode is not a rollback to chaos. Let it inherit the same file protocol, evidence chain, and PostToolUse checks, so that after stabilization the reasons for each decision can be reconstructed.

The anti-Goodhart logic in validation.md closes the main risk of budget optimization: improving reported metrics at the cost of hidden degradation of real incident management. The anti-Goodhart rule is a prohibition on counting a release as successful if one metric grew at the expense of degradation of others.

If you control only MTTR, the system may close complex incidents faster as non-critical, lower the share of escalations, or push inconvenient P0s into manual channels without a full post-mortem. Therefore, validate MTTR together with four guard metrics and one check activation condition. It is convenient to keep their role in one table.

Metric	What it measures	What it blocks
`escalation_share`	share of escalations to total flow	check activation condition — drop below historical corridor simultaneously with fast MTTR
`silent_p0`	share of P0s closed without escalation	growth above 2%
`unresolved_manual_ratio`	share of unresolved manual tasks	growth above 5%
`postmortem_gap`	gaps in post-analysis	gaps above 10%
`token_health_min`	minimum token budget health level	drop below 0.6

Consider an MTTR improvement invalid if even one guard metric crossed its boundary. The paired check exists for exactly this: a pretty reported metric should not mask resilience degradation, silent P0 failures, or a broken evidence chain.

Fragment for validation.md with budget gate rules.

checks:
  - id: anti_goodhart_budget
    if:
      all:
        - mttr_p95 < "5m"
        - escalation_ratio < 0.08
    then:
      fail_if:
        - silent_p0 > 0.02
        - unresolved_manual_ratio > 0.05
        - postmortem_gap > 0.10
        - token_health_min < 0.60

  - id: ecology_warn
    if:
      any:
        - token_health_trend_5m < -0.12
        - queue_pressure > 0.80
        - degraded_mode_duration > "120s"
    then:
      require:
        - red_button_review == true
        - manual_channel_open == true
        - frontier_reserved_for_p0_p1 == true

In your project, this check is formalized as python3 tools/validation_runner.py run --spec validation.md --out .specify/artifacts/validation_health.json followed by a jq check on anti_goodhart_budget and ecology_warn. A close runnable analog of the anti-Goodhart checks themselves is examples/goodhart-validator/scripts/run_validation.py (see chapter 10).

Full Track: Threshold Calibration

The "Low / Default / High" table for budget size, local/frontier proportions, and manual_timeout_sec, an exercise with a compressed 5M variant, and signals for revision — in Appendix D, section D.3. On the first pass, two failure simulations and a token_health line in budget-note.md are enough.

Examples and Application

A practical simulation of scenario B verifies that the fall of local-coder does not turn frontier-reviewer into an emergency reserve for the entire queue. At 11:00, the local endpoint of the cheap model is unavailable for 45 minutes. The queue contains 20 incidents. The manual timeout is 120 seconds.

The policy chooses three directions: 5 tasks with the largest blast radius and age go to frontier-reviewer, 15 tasks remain in the degradation queue, after two minutes the manual channel opens. The check is considered successful not because all tasks are processed automatically. Success is different: the system preserved the expensive tier, bounded the queue, and did not let token_health_min fall below the safe threshold.

In your project, this scenario is run as tools/budget_keeper.py simulate ... --failure "11:00,local-coder,down,45m" --queue 20 --manual-timeout-sec 120 followed by inspect on the condition failover_to_frontier==5 && degraded_queue==15 && manual_queue_after_120s==15 && token_health_min>=0.5. The runnable analog is the same:

> [runnable] — scenario examples/budget-keeper/scenarios/fail_local_45m.json.

cd book2/examples/budget-keeper
python3 scripts/simulate.py \
  --plan out/budget_plan.json \
  --scenario scenarios/fail_local_45m.json \
  --out out/fail_result.json

python3 scripts/inspect.py \
  --result out/fail_result.json \
  --query "failover_to_frontier==5 && degraded_queue==15 && manual_queue_after_120s==15 && token_health_min>=0.5"

Do rollback after stabilization in steps: otherwise the recovery of the cheap tier will create a second cascade. First return 30% of the local-coder quota only for triage/classification (these phases are easier to verify by formal signs and they unload the input flow faster); another 30% — to diagnosis/plan after three stable token_health windows, no silent_p0_ratio growth, and queue normalization; full return is allowed only after a PostToolUse audit. Reason: premature lifting of manual mode can hide errors accumulated during degradation.

In operations, this model is convenient to verify as a daily budget drill. The team takes yesterday's alert flow, plays it through the current budget_network.yaml, and artificially disables local-coder for 15, 30, and 45 minutes. Then four metrics are compared: MTTR, escalation share, manual queue volume, and minimum token_health.

Signals for analysis:

if during a short failure frontier-reviewer starts serving non-critical tasks — failover is too broad;
if the manual channel opens already at a moderate queue — SLA thresholds are too nervous.

The goal of the run is to find a profile where degradation is predictable, not invisible until the moment of quota exhaustion.

Summary

A token budget becomes a manageable resource only when five elements are linked into a single control loop: SDD phases, model tiering, SLA thresholds, failover, and validation. In this loop, local-coder provides throughput for mass routine; frontier-reviewer protects disputed and high-risk decisions; emergency mode limits automation when risk grows; validation.md does not allow improving MTTR at the cost of hidden P0 and broken post-analysis. Such a scheme shows not only current spend, but also the order of degradation: which phases will starve first, which tasks must move to the expensive tier, and when manual mode is safer than further automation. Next, this loop will move to Goodhart metrics and paired guard metrics.

Artifacts and Readiness Criteria

Artifact	Ready when
Local run of `book2/examples/budget-keeper`	quota sum corresponds to 10M tokens and the specified local/frontier split

| out/budget_plan.json, out/fail_result.json, out/fail_15m_result.json | the 45-minute scenario gives failover_to_frontier==5, degraded_queue==15, manual_queue_after_120s==15, token_health_min>=0.5; the 15-minute scenario keeps token_health_min>=0.7; out/ is not committed | | Entry in precedents.md or capstone/budget-note.md | explains what happens when local-coder fails, which tasks go to frontier-reviewer, and which token_health_min threshold protects the budget |

The full track adds .specify/memory/budget_network.yaml with phases and SLAs, budget_plan.json after compile, fail_scenario_B.json, validation.md with the anti-Goodhart budget gate, and validation_health.json. Consider it ready if emergency mode preserves frontier for P0/P1 and opens the manual channel, the anti-Goodhart gate blocks savings at the cost of silent_p0 or audit breaks, and the budget simulation is included in a regular drill or CI.

Practice

cd book2/examples/budget-keeper && python3 scripts/compile.py --budget-spec specs/budget_network.yaml --out out/budget_plan.json — *expectation: daily_budget_tokens == 10_000_000, local tier sum 9M, frontier 1M (90/10).*

python3 scripts/simulate.py --plan out/budget_plan.json --scenario scenarios/fail_local_45m.json --out out/fail_result.json && python3 scripts/inspect.py --result out/fail_result.json --query "failover_to_frontier==5 && degraded_queue==15 && manual_queue_after_120s==15 && token_health_min>=0.5" — expectation: code 0, four conditions met simultaneously.
Move five lines into capstone/budget-note.md: risk, effect, simulated_floor, alert_threshold, decision. *Expectation: the format matches the reference from the section "How this lands in capstone/"; the full budget_plan.json does not land in capstone/.*

Review Questions

Why should failover not send the entire queue to the expensive tier?
Which metrics reveal degradation of budget routing?
When is manual mode safer than continuing automation?
The local model fell for 45 minutes at peak time. You have 60% of the daily budget, but MTTR is creeping up. What will you switch — the model, the routing policy, or the triage mode?