Reading: Applied Part 8. File Arbitration of Contested Changes: Roles, Verdicts, and Precedents

Lesson 1 of 5 in module «Applied Part 8. File Arbitration of Contested Changes: Roles, Verdicts, and Precedents»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents

Status: Frontier. File arbitration with Verifier/Implementor/Safety (voting) with Coordinator protocol and recording in judgment.md (dispute resolution) and precedents.md (precedents) — a technique that is applied, but not built into Qwen Code. Compatibility and limitations are in appendix-b-qwen-code-compatibility.md.

For the educational walkthrough, it is sufficient to obtain judgment.md from the runnable example and understand what evidence the Verifier accepts. Tier rotation, external Coordinator, and model matrix belong to the full production track.

The boundary with Chapter 4 is as follows: the LLM duel answers the question "was a minimal counterexample found and how was it closed." File arbitration answers a different question: "what official verdict does the team of roles adopt, what evidence is deemed admissible, and what precedent remains for future disputes."

File arbitration does not find all defects itself. It accepts the results of other mechanisms as evidence: counterexample from the duel, Spec CI report, anti-Goodhart invariant, readiness gate, or mutant record. If evidence is not in the file, the Coordinator must not turn an agent's impression into a verdict.

Team review from Part 16 of the first volume is the basic scheme: a human reviewer checks a pull request against a package of evidence. Here the same scheme is raised one level higher. Instead of one person, roles work: Verifier, Implementor, and Safety vote; the Coordinator keeps the protocol and does not vote. Instead of PR comments — two files: judgment.md (dispute resolution journal) and precedents.md (recurring disputes database). The foundation remains unchanged: verdicts are checked against facts from validation.md from Part 9, and outcomes revise the roadmap just as in replanning from Part 10.

Before Reading

  • Foundation from the first volume: Part 16 establishes team review, Part 10 shows replanning after facts.
  • Local educational case: autoscale_200pct, because a duel, invariants, and judgment.md already exist for it.
  • Trace for capstone/: one verdict APPROVE, DENY, or DEFERRED with evidence_ref for high_memory_usage.
  • Key terms for the first pass: file arbitration and judgment.md. Roles (Verifier/Implementor/Safety + Coordinator-protocol) were already introduced in Part 3 — here they receive procedural formalization.
  • What to defer: model matrix, external Coordinator, and persistent precedents.md base.

Goal

You will learn to conduct file arbitration of a disputed change. This is a collective review of one change by multiple roles, where the result is recorded in files rather than in chat. The goal is to design a scheme in which one specification is checked reproducibly even with rotation of roles, models, and strictness modes.

Role rotation is running the same specification through different Implementor/Verifier pairs (local or strong agent in each position). It is needed so that the verdict does not depend on a specific model.

The practical gain is simple: a dispute ceases to be an exchange of opinions in chat and becomes a chain of evidence. The Coordinator runs the process. The Implementor proposes changes. The Verifier accepts or rejects them by formal criteria. Safety gets veto on critical_risk. The outcome is fixed in project artifacts.

This approach continues the SDD logic: the specification remains the source of truth for system behavior, not an optional description of developer intent (GitHub Spec Kit).

The engineering name of the mechanism is file protocol for arbitration of disputed changes by multiple roles. The name tribunal remains a technical label for the runnable example directory, not a separate Qwen Code product.

Minimal Educational Scenario

Educational Case

The same autoscale_200pct, but now you need not only a counterexample, but an official protocol: duel, anti-Goodhart invariants, and final judgment.md.

Preparation

  • book2/examples/tribunal/specs/autoscale_spec.yaml.
  • book2/examples/tribunal/cases/.
  • book2/examples/tribunal/metrics/validation_metrics.json.
  • Scripts run_duel.py, check_invariants.py, write_judgment.py.

Steps

  1. cd book2/examples/tribunal. Expectation: you are in the runnable example directory.
  2. python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases/ --out out/duel.json. Expectation: the duel recorded verdicts on cases.
  3. python3 scripts/check_invariants.py --metrics metrics/validation_metrics.json --out out/invariants.json. Expectation: anti-Goodhart invariants checked separately from the duel.
  4. python3 scripts/write_judgment.py --duel-out out/duel.json --invariants-out out/invariants.json --to out/judgment.md. Expectation: the final markdown protocol appeared.
  5. Open out/judgment.md and transfer one repeatable conflict to precedents.md: condition, evidence, decision, applicability.

Control Fact

judgment.md contains not only PASS/FAIL, but also the basis: which case was checked, which invariant triggered, what the Implementor must do upon a repeated dispute. Without this, file arbitration remains the duel from Chapter 4.

How This Gets into capstone/

Transfer to capstone/judgment.md one verdict, reason, evidence_ref, and next verifiable step. If the conflict is repeatable, add a short precedent record. Do not transfer the entire out/duel.json if it can be reproduced by a command from the runnable example.

Minimal fragment:

verdict: DEFERRED
reason: "readiness passes by score, but stateful blocker has no backup evidence"
evidence_ref: "fixtures/readiness_block_stateful.json"
next_step: "add backup_verified evidence or keep remediation manual"

Reviewable Trace

Keep judgment.md or an excerpt in precedents.md if they became part of the educational evidence package. Local out/duel.json and out/invariants.json can be left outside the repository if they can be reproduced by a command.

Key Ideas

The stage contract of file arbitration begins with the Coordinator role. They open the session, set the round order, manage the dispute queue, and are responsible for the official protocol in judgment.md. The judgment.md itself is the session decisions journal: which round passed, which diff was reviewed, which evidence was deemed sufficient.

The minimal cycle is: the Coordinator receives the original specification, breaks it down into checkable files, assigns Implementor, Verifier, and Safety roles for critical risks, and prohibits advancing to the next stage without recording the previous stage's result. The full charter of roles with voting weights (vote_weight), quorum, and veto conditions is in Part 3. Here we are interested in how these same roles work around one specific disputed change.

What this means in practice. Take a short verdict from chat:

Bad: > "The Verifier rejected the Implementor's proposal."

Problem: no basis and no evidence reference (evidence_ref), the dispute is irreproducible. The next reviewer will be unable to challenge or support the verdict.

Good: > verdict=DENY, reason=violates_invariant:silent_p0, evidence_ref=tests/regression_001.json, next_step=Implementor adds severity check before auto-escalation

evidence_ref here is the same evidence marker as in Part 1: a reference to a specific place in a file, not a retelling. silent_p0 is the invariant "no P0 incident may close without escalation." If the Verifier returns DENY, do not close the dispute manually. Demand a formal basis from the party: a reference to a specific requirement, hook log, schema violation, or unproven scenario. Thus judgment.md becomes not a report of "who won," but a journal of procedural state.

In Qwen Code, such arbitration is not one built-in command. The minimal implementation is assembled from /review, headless qwen -p runs, project scripts, and, if needed, user commands. Save all verdicts to files so that another engineer can repeat the dispute without chat history. Detailed mapping of roles and built-in CLI capabilities is in [appendix-b-qwen-code-compatibility.md](appendix-b-qwen-code-compatibility.md).

> [runnable] — the runnable example of file arbitration is in [examples/tribunal/](examples/tribunal/) (see [examples/tribunal/README.md](examples/tribunal/README.md)). The real run is assembled from three scripts:

  • run_duel.py writes the duel JSON result;
  • check_invariants.py checks anti-Goodhart thresholds (a rule prohibiting improving one metric at the expense of degrading others);
  • write_judgment.py assembles the final judgment.md from the two previous outputs.

Run from the book2/examples/tribunal directory.

cd book2/examples/tribunal
python3 scripts/run_duel.py \
  --spec specs/autoscale_spec.yaml \
  --cases cases/ \
  --out out/duel.json

python3 scripts/check_invariants.py \
  --metrics metrics/validation_metrics.json \
  --out out/invariants.json

python3 scripts/write_judgment.py \
  --duel-out out/duel.json \
  --invariants-out out/invariants.json \
  --to out/judgment.md

run_duel.py reads the specification and runs counterexamples from cases/. check_invariants.py checks actual metrics against thresholds. write_judgment.py assembles the final markdown protocol. There are no external "coordinators" or "verifiers" as separate processes. In production, such arbitration is assembled from the built-in /review command, headless qwen -p calls with different roles in the prompt, and project scripts — each with its own artifact on disk.

A/B comparison of one specification between different Implementor/Verifier configurations shows how much the verdict depends on the agent of a given tier. Model tier here is the model level: cheap local (local-coder) or strong cloud (frontier-reviewer). The same rate_limit_spec.md is run through several pairs:

  • C1: cheap local Implementor vs. strong Verifier;
  • C2: strong Implementor vs. local Verifier;
  • C3: symmetric local pair;
  • C4: symmetric expensive pair.

If C1 and C4 give PASS, and C2 consistently returns FAIL, this is not a signal for immediate model replacement. First check the evidentiary framework: the Verifier with a weaker tier may have missed the implicit link between the request limit, cooldown window, and safe queue state.

The test is useful precisely because it keeps the specification unchanged and only changes the role configuration.

The educational runnable analog is in [examples/tribunal/matrix/](examples/tribunal/matrix/README.md): the same judge() from the duet is run through four tier pairs described in matrix/tiers.json. The config models the gap between evidence forms — local-coder gives a short diagnostic_code (minimal_form), frontier-reviewer gives an evidence_by_invariant structure (extended_form), and the weak Verifier only recognizes minimal_form. Therefore pair C2 (strong Implementor + weak Verifier) consistently fails, while the other three pass — this is the educational signal: tier_dependent_spec.

cd book2/examples/tribunal
python3 scripts/matrix.py \
  --spec specs/autoscale_spec.yaml \
  --cases cases/ \
  --tiers matrix/tiers.json \
  --out out/matrix.json

#### control: summary.signal != "tier_dependent_spec" — reason to explain the discrepancy in validation.md or move to precedents.md

On a production project, this output is backed by scripts/tribunal_matrix.py — it replaces judge() with real qwen -p calls with different roles in the prompt, but the artifact interface (summary.signal, pairs[*].verdict, pairs[*].cases[*].reasons) remains the same. If the matrix in the textbook outputs a discrepancy — exit 1, and in smoke_all.sh this is wrapped in expect_fail: the discrepancy here is the target educational signal, not a failure.

Formulate strict evidentiary requirements for the Verifier, not a general request to "check the solution." There are three: hook logs, JSON Schema compliance, and formal Given/When/Then scenarios.

PreToolUse logs show which tool calls were allowed or blocked before execution. PostToolUse logs fix the actual result, exit code, diff checksum, and event reference in evidence.

JSON Schema closes the class of errors where the agent generates convincing text but violates the data contract. Examples of such violations:

  • missing required field;
  • parameter type changes from integer to string;
  • limit set outside the allowed range.

Given/When/Then scenarios add causal checking: under what initial conditions is the action allowed, what event triggers it, and what observable result must confirm safety.

flowchart TD
    COORD[Coordinator: record in requirements]
    IMPL[Implementor: patch_plan and hooks]
    PRE[PreToolUse: block dangerous actions]
    POST[PostToolUse: evidence and hash]
    VER["Verifier: check against validation, verdict"]
    SAFETY["Safety: veto on critical_risk"]
    DISPUTE["Dispute: diff in requirements/hooks/validation"]
    COORD --> IMPL
    IMPL --> PRE
    PRE --> POST
    POST --> VER
    VER --> SAFETY
    SAFETY --> DISPUTE
    DISPUTE --> COORD

Conflict is resolved only through diffs in requirements.md, hooks.md, validation.md. Any hidden edits in the chat dialog are excluded from the evidentiary base.

If the Implementor believes the rejection is erroneous, they do not rewrite the explanation in free form. Instead, they add a verifiable change: refine the requirement, strengthen the hook, or extend the validation scenario.

The Coordinator accepts a repeated round only after the diff is linked to the original specification and to a specific evidence event. Otherwise the dispute turns into an irreproducible private history. Upon repeated conflict, transfer the decision to precedents.md — a precedent journal where exactly five fields are fixed for each case:

  • case_id — stable precedent identifier;
  • verdict — arbitration rule outcome (APPROVE / DENY / DEFERRED);
  • evidence_ref — reference to the diff, hook log, schema, or scenario that proved the verdict;
  • applies_to — precedent applicability boundaries (tiers, strictness modes, domains);
  • next_check — condition under which the precedent must be reviewed.
- case_id: PREC-021
  verdict: DENY
  evidence_ref: "tests/rate_limit_tenant_isolation.json"
  applies_to: "rate-limit without tenant_id deduplication, all tiers, strict_guardrails_prompt"
  next_check: "burst_window_sec increases above 60 or tenant_id isolation evidence appears"

The anti-Goodhart rule protects file arbitration from a situation where one metric improves at the expense of system degradation. MTTR (mean time to recovery) cannot justify increased false escalations, silent failures, or rollback-flapping. This is true even if an individual round shows a quick PASS.

Therefore set hard stop conditions in validation.md:

  • false_escalation_rate <= 0.05;
  • rollback_flapping < 3/h;
  • silent_p0_ratio == 0.

Exceeding any threshold moves the verdict to FAIL regardless of time gain. This turns Goodhart protection from a moral warning into an enforceable arbitration rule.

Examples and Application

Example: a specification for automatic rate limiting in an API gateway requires temporarily restricting a specific client (tenant) during a request burst, but not blocking the entire service and not escalating every burst as P0.

The Implementor proposes a patch:

  • add tenant_id to the deduplication key;
  • introduce a burst_window_sec=60 window;
  • write an event to evidence/rate_limit.ndjson after each limit application.

The Verifier makes a decision only in the presence of three pieces of evidence:

  • JSON Schema requires tenant_id, limit_reason, expires_at;
  • PreToolUse prohibits changing the global limit without a specific client scope;
  • Given/When/Then shows that one client's burst does not reduce the neighboring client's quota.

If any of this evidence is missing, the Verifier returns DENY, even if the patch looks technically plausible.

In an A/B round, the configuration Implementor=local-coder, Verifier=frontier-reviewer may pass. The strong Verifier recognizes the sufficient linkage between schema, hook logs, and scenarios.

The reverse configuration Implementor=frontier-reviewer, Verifier=local-coder may reject the same approach. This happens if the safety proof is hidden in the Implementor's lengthy reasoning rather than extracted into validation.md.

This does not mean one agent is "right" and the other "wrong." The arbitration reveals that the requirement is not sufficiently portable between model tiers. The fix must appear as a diff — for example, adding the scenario Given tenant A exceeds burst limit / When tenant B sends normal traffic / Then tenant B quota remains unchanged.

Scenario: Tenant isolation during load burst
  Given tenant A sends 800 req/min
  And tenant B sends 40 req/min
  When rate-limit hook applies restriction
  Then tenant A receives temporary limit for 60 seconds
  And tenant B keeps base quota
  And evidence contains tenant_id, limit_reason, and expires_at

Stress against the Goodhart trap is conducted as a separate mini-analysis. The Implementor gets a task to reduce MTTR from 6 to 2 minutes and proposes aggressive auto-escalation on the first alert event.

Make the Verifier check not only speed but also side effects:

  • false escalation rate;
  • rollback-flapping frequency (repeated rollbacks in a short window);
  • repeated notification volume;
  • presence of a cooldown window.

If the fast plan increases false_escalation_rate above the allowed threshold, the Coordinator records FAIL(reason=metric corruption) in judgment.md and demands a validation.md fix, not a cosmetic chat explanation. Thus arbitration learns to distinguish real improvement from optimizing one number at the cost of operational resilience.

Summary

File arbitration makes dispute resolution reproducible. The Coordinator manages stages and protocol. The Implementor changes only controlled artifacts. The Verifier demands hook logs, JSON Schema, and Given/When/Then. All conflicts go through diffs in requirements.md, hooks.md, validation.md and, if necessary, enter precedents.md.

Role rotation turns different tiered agents into a tool for checking specification robustness. If the verdict changes when the Implementor/Verifier pair is swapped, strengthen evidence rather than relying on a specific model's authority.

The anti-Goodhart rule closes the loop: it prohibits accepting quick decisions that improve MTTR at the cost of false escalations, silent failures, or rollback-flapping. Next, this arbitration loop will move into tier routing economics and token distribution between roles.

Artifacts and Readiness Criteria

ArtifactReady when
judgment.md (or its excerpt)the verdict has a reason and evidence_ref to a diff, hook log, schema, or Given/When/Then, not to a retelling
out/duel.json and out/invariants.jsonlocally reproducible; the runnable example in book2/examples/tribunal passes smoke-pass
precedents.md recordcreated if the conflict is repeatable; otherwise skipped

The full track adds judgment.md with rounds of voting roles (Verifier/Implementor/Safety) under the Coordinator protocol, a verdict matrix by tier pairs for one unchanged specification, and anti-Goodhart invariants as a mandatory part of arbitration. Consider it ready if tier pair verdict discrepancies are explained by a diff in validation.md, anti-Goodhart thresholds block a quick but harmful plan, and recurring conflicts are entered in precedents.md.

Practice

  1. cd book2/examples/tribunal && python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases --out out/duel.json && python3 scripts/check_invariants.py --metrics metrics/validation_metrics.json --out out/invariants.json && python3 scripts/write_judgment.py --duel-out out/duel.json --invariants-out out/invariants.json --to out/judgment.md — *expectation: in out/judgment.md the final verdict with evidence_ref to a specific case.*
  2. Record the evidence that the Verifier is entitled to accept: diff, hook log, schema, Given/When/Then. *Expectation: in out/judgment.md the evidence_ref field points to a file, not a retelling.*
  3. Transfer a recurring conflict to capstone/precedents.md using this stub (minimum fields):
   - case_id: "PREC-001"
     verdict: "DENY"
     evidence_ref: "tests/regression_001.json"
     applies_to: "auto-remediation without full audit_trace"
     next_check: "repeat duel when manual_review_floor changes"

*Expectation: the next similar dispute is resolved by reference to PREC-001, not by a repeated round.*

Review Questions

  1. How does the Coordinator differ from the Verifier and Safety, and why do they not vote on equal terms?
  2. Why must a dispute be resolved by diffs rather than by rewriting?
  3. What does a verdict discrepancy upon changing tiered agents show?
  4. The Implementor and Verifier do not reach a decision for three rounds in a row, the incident queue grows. What stop condition and what artifact will you record before handing the dispute to a human?
My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100