Reading: Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents

Lesson 1 of 5 in module «Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Applied Part 8. File-Based Arbitration of a Disputed Change: Roles, Verdicts, and Precedents

Status: Frontier. File-based arbitration by Verifier/Implementor/Safety (who vote) with a Coordinator protocol and records in judgment.md (dispute decision) and precedents.md (precedents) is a technique that is applied, but it is not built into Qwen Code. Compatibility and limitations are in appendix-b-qwen-code-compatibility.md.

For a learning run-through, it is enough to obtain judgment.md from the runnable example and understand which pieces of evidence the Verifier accepts. Role rotation, an external Coordinator, and a model matrix belong to the full production track.

The boundary with chapter 4 is this: the LLM duel answers the question "was a minimal counterexample found and what closed it." File-based arbitration answers a different question: "what official verdict does the team of roles adopt, what evidence is deemed admissible, and what precedent will remain for future disputes."

File-based arbitration does not search for all defects on its own. It accepts the results of other mechanisms as evidence: a counterexample from the duel, a Spec CI report, an anti-Goodhart invariant, a readiness gate, or a mutant record. If the evidence is not in a file, the Coordinator must not turn an agent's impression into a verdict.

The team review from part 16 of the first volume is the basic scheme: a human reviewer checks a pull request against a bundle of evidence. Here the same scheme is raised one level higher. Instead of a single person, roles work: the Verifier, the Implementor, and Safety vote; the Coordinator runs the protocol and does not vote. Instead of comments in a PR, there are two files: judgment.md (dispute decision log) and precedents.md (a base of recurring disputes). The backbone does not change: verdicts are cross-checked against the facts of validation.md from part 9, and the outcomes revise the roadmap in the same way as during replanning from part 10.

Before reading

The backbone from the first volume: part 16 sets out team review, part 10 shows replanning after the facts.

Local learning case: autoscale_200pct, because it already has a duel, invariants, and judgment.md.
Trace for capstone/: one verdict APPROVE, DENY, or DEFERRED with evidence_ref for high_memory_usage.
Key first-pass terms: file-based arbitration and judgment.md. The roles (Verifier/Implementor/Safety + Coordinator as note-taker) were introduced in part 3 — here they receive their procedural form.
What to defer: the model matrix, an external Coordinator, and a permanent precedents.md base.

Goal

You will learn to run file-based arbitration of a disputed change. This is a collective review of a single change by several roles, where the result is recorded in files rather than in chat. The goal is to design a scheme in which one specification is checked reproducibly even under rotation of roles, models, and strictness modes.

Role rotation is running the same specification through different Implementor/Verifier pairs (a local or a strong agent in each position). It is needed so that the verdict does not depend on a specific model.

The practical win is simple: a dispute stops being an exchange of opinions in chat and turns into a chain of evidence. The Coordinator runs the process. The Implementor proposes changes. The Verifier accepts or rejects them by formal criteria. Safety holds veto power at critical_risk. The outcome is recorded in the project's artifacts.

This approach continues the SDD logic: the specification remains the source of truth for system behavior, not an optional description of developer intent (GitHub Spec Kit).

The engineering name of the mechanism is a file-based multi-role arbitration protocol for a disputed change. The name tribunal remains a technical label for the runnable example's directory, not a separate Qwen Code product.

Minimal learning scenario

Learning case

The same autoscale_200pct, but now we need not just a counterexample but an official protocol: the duel, anti-Goodhart invariants, and the final judgment.md.

Preparation

book2/examples/tribunal/specs/autoscale_spec.yaml.
book2/examples/tribunal/cases/.
book2/examples/tribunal/metrics/validation_metrics.json.
Scripts run_duel.py, check_invariants.py, write_judgment.py.

Steps

cd book2/examples/tribunal. Expectation: you are in the runnable example's directory.
python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases/ --out out/duel.json. Expectation: the duel recorded verdicts for the cases.
python3 scripts/check_invariants.py --metrics metrics/validation_metrics.json --out out/invariants.json. Expectation: anti-Goodhart invariants are checked separately from the duel.
python3 scripts/write_judgment.py --duel-out out/duel.json --invariants-out out/invariants.json --to out/judgment.md. Expectation: the final markdown protocol has appeared.
Open out/judgment.md and move one recurring conflict into precedents.md: condition, evidence, decision, applicability.

Control fact

judgment.md contains not only PASS/FAIL but also the rationale: which case was checked, which invariant fired, what the Implementor must do in a repeat dispute. Without this, file-based arbitration remains the duel from chapter 4.

How this gets into `capstone/`

Move one verdict, the reason, evidence_ref, and the next verifiable step into capstone/judgment.md. If the conflict is recurring, add a short precedent entry. Do not move the entire out/duel.json if it can be reproduced by a command from the runnable example.

Minimal fragment:

verdict: DEFERRED
reason: "readiness passes by score, but stateful blocker has no backup evidence"
evidence_ref: "fixtures/readiness_block_stateful.json"
next_step: "add backup_verified evidence or keep remediation manual"

Reviewable trace

Keep judgment.md or an excerpt in precedents.md if they became part of the learning evidence bundle. Local out/duel.json and out/invariants.json can be left out of the repository if they can be reproduced by a command.

Key ideas

The contract for the stages of file-based arbitration starts with the Coordinator role. They open a session, set the order of rounds, keep the dispute queue, and are responsible for the official protocol in judgment.md. The judgment.md itself is the session's decision log: which round passed, which diff was considered, which pieces of evidence were deemed sufficient.

The minimal cycle is this: the Coordinator accepts the initial specification, breaks it down into checkable files, assigns the Implementor, Verifier, and Safety roles for critical risks, and forbids moving to the next stage without recording the result of the previous one. The full role charter with voting weights (vote_weight), quorum, and veto conditions is in part 3. Here we are interested in how these same roles work around one specific disputed change.

What this means in practice. Take a short verdict from chat:

Bad: > "The Verifier rejected the Implementor's proposal."

Problem: there is no rationale and no evidence link (evidence_ref), the dispute is not reproducible. The next reviewer will not be able to challenge or support the verdict.

Good: > verdict=DENY, reason=violates_invariant:silent_p0, evidence_ref=tests/regression_001.json, next_step=Implementor adds a severity check before auto-escalation

evidence_ref here is the same evidence marker as in part 1: a reference to a specific place in a file, not a paraphrase. silent_p0 is the invariant "no P0 incident should close without escalation." If the Verifier returns DENY, do not close the dispute manually. Demand a formal rationale from the side: a reference to a concrete requirement, a hook log, a schema violation, or an unproven scenario. This is how judgment.md becomes not a "who won" report, but a log of procedural state.

In Qwen Code, such arbitration is not a single built-in command. The minimal implementation is assembled from /review, headless runs of qwen -p, project scripts, and, if needed, custom commands. Save all verdicts to files so that another engineer can replay the dispute without chat history. The detailed mapping of roles to the CLI's built-in capabilities is in [appendix-b-qwen-code-compatibility.md](appendix-b-qwen-code-compatibility.md).

> [runnable] — the runnable example of file-based arbitration lives in [examples/tribunal/](examples/tribunal/) (see [examples/tribunal/README.md](examples/tribunal/README.md)). The real run is assembled from three scripts:

run_duel.py writes the duel's JSON result;
check_invariants.py checks anti-Goodhart thresholds (a rule that forbids improving one metric at the cost of degrading others);
write_judgment.py assembles the final judgment.md from the two previous outputs.

Run from the book2/examples/tribunal directory.

cd book2/examples/tribunal
python3 scripts/run_duel.py \
  --spec specs/autoscale_spec.yaml \
  --cases cases/ \
  --out out/duel.json

python3 scripts/check_invariants.py \
  --metrics metrics/validation_metrics.json \
  --out out/invariants.json

python3 scripts/write_judgment.py \
  --duel-out out/duel.json \
  --invariants-out out/invariants.json \
  --to out/judgment.md

run_duel.py reads the specification and runs the counterexamples from cases/. check_invariants.py compares the actual metrics against thresholds. write_judgment.py assembles the final markdown protocol. There are no external "coordinators" or "verifiers" as separate processes. In production, such arbitration is assembled from the built-in /review command, headless qwen -p calls with different roles in the prompt, and project scripts — each with its own artifact on disk.

A/B comparison of a single specification across different Implementor/Verifier configurations shows how much the verdict depends on the agent of a given tier. The model tier here is a model tier: a cheap local one (local-coder) or a strong cloud one (frontier-reviewer). The same rate_limit_spec.md is run through several pairs:

C1: cheap local Implementor against strong Verifier;
C2: strong Implementor against local Verifier;
C3: symmetric local pair;
C4: symmetric expensive pair.

If C1 and C4 return PASS, and C2 stably returns FAIL, that is a signal not to immediately swap models. First, check the evidence framework: a Verifier with a weaker tier may not have seen the implicit link between the request limit, the cooldown window, and the safe queue state.

The test is useful precisely because it keeps the specification unchanged and varies only the role configuration.

The learning runnable analogue lives in [examples/tribunal/matrix/](examples/tribunal/matrix/README.md): the same judge() from the duel is run across four tier pairs described in matrix/tiers.json. The config models the gap between evidence forms — local-coder gives a short diagnostic_code (minimal_form), frontier-reviewer gives the evidence_by_invariant structure (extended_form), and a weak Verifier recognizes only minimal_form. That is why pair C2 (strong Implementor + weak Verifier) stably fails, while the other three pass — this is the learning signal: tier_dependent_spec.

cd book2/examples/tribunal
python3 scripts/matrix.py \
  --spec specs/autoscale_spec.yaml \
  --cases cases/ \
  --tiers matrix/tiers.json \
  --out out/matrix.json

#### control: summary.signal != "tier_dependent_spec" — a reason to explain the discrepancy in validation.md or to record it in precedents.md

In a production project, behind this output is scripts/tribunal_matrix.py — it swaps judge() for real qwen -p calls with different roles in the prompt, but the artifact interface (summary.signal, pairs[*].verdict, pairs[*].cases[*].reasons) stays the same. If the matrix in the textbook produces a discrepancy, the exit code is 1, and in smoke_all.sh it is wrapped in expect_fail: a discrepancy here is a target learning signal, not a failure.

Formulate strict evidence requirements for the Verifier, not a generic "check the solution" request. There are three of them: hook logs, JSON Schema compliance, and formal Given/When/Then scenarios.

PreToolUse logs show which tool calls were allowed or blocked before execution. PostToolUse logs record the actual result, the exit code, the diff checksum, and a link to the event in the evidence.

JSON Schema closes a class of errors in which an agent generates convincing text but violates the data contract. Examples of such violations:

a required field is missing;
a parameter type changes from integer to string;
a limit is set outside the allowed range.

Given/When/Then scenarios add causal checking: under which preconditions an action is allowed, which event triggers it, and which observable result must confirm safety.

flowchart TD
    COORD[Coordinator: record into requirements]
    IMPL[Implementor: patch_plan and hooks]
    PRE[PreToolUse: block dangerous actions]
    POST[PostToolUse: evidence and hash]
    VER["Verifier: cross-check with validation, verdict"]
    SAFETY["Safety: veto at critical_risk"]
    DISPUTE["Dispute: diff in requirements/hooks/validation"]
    COORD --> IMPL
    IMPL --> PRE
    PRE --> POST
    POST --> VER
    VER --> SAFETY
    SAFETY --> DISPUTE
    DISPUTE --> COORD

A conflict is resolved only through diffs in requirements.md, hooks.md, validation.md. Any hidden edits in the chat dialog are excluded from the evidence base.

If the Implementor believes the rejection is mistaken, they do not rewrite the explanation in free form. Instead, they add a verifiable change: clarify the requirement, strengthen the hook, or extend the validation scenario.

The Coordinator accepts a new round only after the diff is linked to the original specification and to a specific event-evidence. Otherwise the dispute turns into an unreproducible private story. On a repeat conflict, move the decision into precedents.md — the precedents log, where for each case exactly five fields are recorded:

case_id — a stable precedent identifier;
verdict — the outcome under the arbitration rule (APPROVE / DENY / DEFERRED);
evidence_ref — a reference to the diff, hook log, schema, or scenario that proved the verdict;
applies_to — the boundaries of the precedent's applicability (tiers, strictness modes, domains);
next_check — the condition under which the precedent must be revised.

- case_id: PREC-021
  verdict: DENY
  evidence_ref: "tests/rate_limit_tenant_isolation.json"
  applies_to: "rate-limit without tenant_id deduplication, all tiers, strict_guardrails_prompt"
  next_check: "burst_window_sec rises above 60 or tenant_id isolation evidence appears"

The anti-Goodhart rule protects file-based arbitration from a situation in which one metric improves at the cost of system degradation. MTTR (mean time to recovery) cannot justify a rise in false escalations, silent failures, or rollback-flapping. This is true even if an individual round shows a quick PASS.

That is why set hard stop conditions in validation.md:

false_escalation_rate <= 0.05;
rollback_flapping < 3/h;
silent_p0_ratio == 0.

Exceeding any threshold switches the verdict to FAIL regardless of the time win. This turns Goodhart protection from a moral warning into an executable arbitration rule.

Examples and application

Example: a specification for automatic rate limiting in an API gateway requires that, under a burst of requests, a specific tenant be temporarily limited, but the entire service must not be blocked and every burst must not be escalated as P0.

The Implementor proposes a patch:

add tenant_id to the deduplication key;
introduce a burst_window_sec=60 window;
write an event to evidence/rate_limit.ndjson after each limit application.

The Verifier makes a decision only if three pieces of evidence are present:

JSON Schema requires tenant_id, limit_reason, expires_at;
PreToolUse forbids changing the global limit without a specific tenant scope;
Given/When/Then shows that one tenant's burst does not reduce a neighboring tenant's quota.

If one of these pieces of evidence is missing, the Verifier returns DENY, even if the patch looks technically plausible.

In an A/B round, the configuration Implementor=local-coder, Verifier=frontier-reviewer may pass. A strong Verifier recognizes a sufficient link between the schema, the hook logs, and the scenarios.

The reverse configuration Implementor=frontier-reviewer, Verifier=local-coder may reject the same approach. This happens when the safety evidence is hidden in the Implementor's long reasoning instead of being surfaced into validation.md.

This does not mean that one agent is "right" and the other is "wrong." Arbitration reveals that the requirement is not portable enough between model tiers. The fix must appear as a diff — for example, adding a scenario Given tenant A exceeds burst limit / When tenant B sends normal traffic / Then tenant B quota remains unchanged.

Scenario: Tenant isolation under a load burst
  Given tenant A sends 800 req/min
  And tenant B sends 40 req/min
  When the rate-limit hook applies the limit
  Then tenant A gets a temporary limit for 60 seconds
  And tenant B retains the base quota
  And evidence contains tenant_id, limit_reason, and expires_at

A stress test against the Goodhart trap is run as a separate mini-breakdown. The Implementor is given the task of reducing MTTR from 6 to 2 minutes and proposes aggressive auto-escalation on the first alert event.

Force the Verifier to check not only speed but also side effects:

the share of false escalations;
the frequency of rollback-flapping (repeated rollbacks within a short window);
the volume of repeated notifications;
the presence of a cooldown window.

If the fast plan raises false_escalation_rate above the allowed threshold, the Coordinator records FAIL(reason=metric corruption) in judgment.md and requires a change to validation.md, not a cosmetic explanation in chat. This is how arbitration learns to distinguish a real improvement from optimizing a single number at the cost of operational stability.

Summary

File-based arbitration makes dispute resolution reproducible. The Coordinator manages the stages and the protocol. The Implementor changes only controlled artifacts. The Verifier demands hook logs, JSON Schema, and Given/When/Then. All conflicts go through diffs in requirements.md, hooks.md, validation.md and, when needed, are recorded in precedents.md.

Role rotation turns different tier agents into a tool for checking specification robustness. If the verdict changes when the Implementor/Verifier pair changes, strengthen the evidence instead of relying on the authority of a specific model.

The anti-Goodhart rule closes the loop: it forbids accepting fast decisions that improve MTTR at the cost of false escalations, silent failures, or rollback-flapping. Next, this arbitration loop will move into the economics of tier routing and the distribution of tokens between roles.

Decision trace instead of hidden reasoning

Arbitration does not need the full flow of the model's thoughts. It needs a reproducible decision protocol: which facts were extracted, which red flags were checked, which policy was applied, and what verdict came out. That is why a disputed conclusion is shaped as a phased decision_trace, not as free-form text that says "the model thought."

Minimal structure:

case_id: "JDG-001"
facts:
  - "readiness_block_audit.json gives score=22/25"
  - "audit_trace_coverage=0.7"
checks:
  - rule: "auto mode requires audit_trace_coverage=1.0"
    status: "fail"
  - rule: "score >= 23"
    status: "fail"
policy_outcome: "deny_auto_mode"
verdict: "DENY"
evidence_ref: "fixtures/readiness_block_audit.json"
customer_safe_summary: "automatic mode is blocked until full audit_trace"
internal_note: "fix Process evidence, then repeat readiness"

Such a trace can be handed to another Verifier or Safety role without chat history. If the verdict changes, the team compares the facts, checks, and policy_outcome fields, not the style of the explanation.

Artifacts and readiness criteria

Artifact	Ready when
`judgment.md` (or an excerpt from it)	the verdict has a reason and an `evidence_ref` to a diff, hook log, schema, or Given/When/Then, not to a paraphrase
`decision_trace`	facts, checks, policy outcome, and the final verdict are separated from each other
`out/duel.json` and `out/invariants.json`	are locally reproducible; the runnable example in `book2/examples/tribunal` passes the smoke pass
`precedents.md` entry	is created if the conflict is recurring; otherwise it is skipped

The full track adds judgment.md with rounds of voting roles (Verifier/Implementor/Safety) under the Coordinator's protocol, a verdict matrix across tier pairs for a single unchanged specification, and anti-Goodhart invariants as a mandatory part of arbitration. Consider it ready if the verdict discrepancy across tier pairs is explained by a difference in validation.md, the anti-Goodhart thresholds block a fast but harmful plan, and recurring conflicts are recorded in precedents.md.

Practice

cd book2/examples/tribunal && python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases --out out/duel.json && python3 scripts/check_invariants.py --metrics metrics/validation_metrics.json --out out/invariants.json && python3 scripts/write_judgment.py --duel-out out/duel.json --invariants-out out/invariants.json --to out/judgment.md — *expectation: in out/judgment.md, a final verdict with evidence_ref pointing to a specific case.*
Lock in the evidence the Verifier is allowed to accept: a diff, a hook log, a schema, a Given/When/Then. *Expectation: in out/judgment.md, the evidence_ref field points to a file, not a paraphrase.*

Move a recurring conflict into capstone/precedents.md using this template (minimum fields):

   - case_id: "PREC-001"
     verdict: "DENY"
     evidence_ref: "tests/regression_001.json"
     applies_to: "auto-remediation without full audit_trace"
     next_check: "re-run the duel when manual_review_floor changes"

*Expectation: the next similar dispute is resolved by a reference to PREC-001, not by a repeat round.*

Review questions

How does the Coordinator differ from the Verifier and Safety, and why does the Coordinator not vote on equal terms with them?
Why must a dispute be resolved by diffs rather than by a back-and-forth?
What does a verdict discrepancy when tier agents are swapped show?
The Implementor and Verifier fail to reach a decision for three rounds in a row, and the incident queue is growing. What stop condition and which artifact will you record before handing the dispute to a human?