Study guide: Applied Part 8. File Arbitration of Contested Changes: Roles, Verdicts, and Precedents

Lesson 3 of 5 in module «Applied Part 8. File Arbitration of Contested Changes: Roles, Verdicts, and Precedents»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Topic: Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents

Difficulty level: Medium

Estimated study time: 4-6 hours (theory + practice with runnable example)

Prerequisites: Understanding of Verifier/Implementor/Safety/Coordinator roles from Part 3

Experience with LLM duels from Chapter 4 (counterexamples and minimal counterexamples)

Basic command-line and Python scripting skills

Familiarity with markdown format and YAML/JSON structures

Understanding of specification-as-source-of-truth principles (SDD, GitHub Spec Kit)

Learning objectives: Design a file arbitration schema where the verdict on a disputed change is recorded in judgment.md with mandatory fields verdict, reason, evidence_ref, and next_step

Distinguish admissible and inadmissible evidence for the Verifier: diffs, hook logs, JSON Schema, Given/When/Then scenarios — and reject chat summaries as insufficient grounds

Conduct role rotation through model tier matrices, explain divergent verdicts as a tier_dependent_spec signal, and strengthen the specification in validation.md instead of replacing the model

Apply the anti-Goodhart rule: block a PASS verdict if improving one metric (e.g., MTTR) violates stop-thresholds for false escalations, silent failures, or rollback flapping

Formalize recurring conflicts in precedents.md with mandatory fields case_id, verdict, evidence_ref, applies_to, next_check for reproducible resolution of future disputes

Overview: File arbitration of disputed changes is a procedural mechanism for collective review of a single change by multiple roles (Verifier, Implementor, Safety, Coordinator), where the result is recorded in project files rather than in ephemeral chat correspondence. Unlike the LLM duel from Chapter 4, which answers the question "was a minimal counterexample found and how was it closed," file arbitration answers a different question: "what official verdict does the team of roles adopt, what evidence is deemed admissible, and what precedent remains for future disputes." The mechanism is built around two key artifacts: judgment.md (decision log for current session disputes) and precedents.md (recurring conflict database). The Coordinator manages the process and maintains the protocol but does not vote. The Implementor proposes changes only in controlled artifacts. The Verifier accepts or rejects based on formal criteria. Safety holds veto power upon critical_risk. All verdicts are cross-checked against validation.md facts, and outcomes feed back into the roadmap. Role rotation through different model tiers (cheap local vs. strong cloud agents) becomes a tool for specification resilience testing: if the verdict changes when the pair changes, the problem lies in insufficient portability of evidence, not in an "error" of a specific agent. The anti-Goodhart rule completes the loop by prohibiting quick decisions that improve one metric at the cost of operational stability. The minimal learning scenario uses the autoscale_200pct case and is assembled from three scripts: run_duel.py, check_invariants.py, write_judgment.py.

Key concepts: Judgment.md: Decision log for disputes in the current arbitration session. Contains not just PASS/FAIL but a full procedural verdict: which round was conducted, which diff was reviewed, what evidence was deemed sufficient, and what the Implementor must do upon recurring dispute. Minimum mandatory fields: verdict (APPROVE/DENY/DEFERRED), reason (formulated grounds), evidence_ref (reference to a specific file, not a summary), next_step (specific action for resolution). Without these fields, file arbitration remains a Chapter 4 duel.

Precedents.md: Database of recurring conflicts for reproducible resolution of future disputes. Each record contains exactly five fields: case_id (stable identifier), verdict (outcome per arbitration rule), evidence_ref (reference to diff, hook log, schema, or scenario), applies_to (applicability boundaries: tiers, strictness modes, domains), next_check (precedent review condition). Created only for repeatable conflicts; unique disputes remain in judgment.md.

Coordinator: Role that opens the arbitration session, sets round order, manages the dispute queue, and is responsible for the official protocol in judgment.md. Key distinction from Verifier and Safety: the Coordinator does not vote. Its function is procedural, not decision-making. It prohibits advancing to the next stage without recording the previous stage's result, preventing "forgotten" disputes in chat.

Verifier: Role that accepts or rejects Implementor changes based on formal criteria. Acts only when admissible file-based evidence is present: diffs in requirements.md/hooks.md/validation.md, PreToolUse/PostToolUse logs, JSON Schema, Given/When/Then scenarios. Does not accept chat summaries, "persuasive" reasoning without references, or agent impressions. Upon DENY, requires specific evidence_ref and next_step.

Implementor: Role that proposes changes only in controlled project artifacts. Upon verdict rejection, does not rewrite explanations in free form but adds verifiable changes: refines requirements, strengthens hooks, or expands validation scenarios. Works with patch_plan and hooks, records results in evidence/.

Safety: Role with veto power upon detection of critical_risk. Votes alongside Verifier and Implementor, but its vote has blocking character for critical risks. Full charter with voting weights (vote_weight), quorum, and veto conditions — in Part 3.

Admissible evidence (evidence ref): Strictly limited set of forms that the Verifier is authorized to accept: (1) diffs in specification files, (2) PreToolUse logs (blocking dangerous actions before execution) and PostToolUse logs (actual result, exit code, diff hash), (3) JSON Schema (data contract validation), (4) Given/When/Then scenarios (causal condition-action-result verification). Anything outside these categories is excluded from the evidentiary base.

Tier rotation: Running the same specification through different Implementor/Verifier pairs (local vs. strong agent in each position). Four standard configurations: C1 (cheap Implementor + strong Verifier), C2 (strong Implementor + cheap Verifier), C3 (symmetric local pair), C4 (symmetric expensive pair). If the verdict changes when the pair changes — this is a tier_dependent_spec signal requiring stronger evidence in validation.md, not model replacement.

Anti-Goodhart invariant: Rule prohibiting improvement of one metric at the system's expense. Hard stop conditions in validation.md: false_escalation_rate <= 0.05, rollback_flapping < 3/h, silent_p0_ratio == 0. Exceeding any threshold shifts the verdict to FAIL regardless of time gains (e.g., MTTR). Transforms Goodhart protection from moral warning into enforceable arbitration rule.

Evidence forms (minimal form vs. extended form): Learning signal in the tier matrix. local-coder outputs short diagnostic_code (minimal_form), frontier-reviewer outputs evidence_by_invariant structure (extended_form). If a weak Verifier recognizes only minimal_form, while a strong Implementor outputs extended_form, pair C2 consistently fails. This demonstrates that evidence form must match the receiving role's capability.

Dispute through diff: Fundamental rule: conflict is resolved only through diffs in requirements.md, hooks.md, validation.md. Any hidden edits in chat dialogue are excluded from the evidentiary base. The Coordinator accepts a repeat round only after the diff is linked to the original specification and a specific evidentiary event.

Practice exercises: Name: Exercise 1: Assembling a minimal judgment.md

Problem: You are in the book2/examples/tribunal directory. You have specification specs/autoscale_spec.yaml, cases in cases/, and metrics in metrics/validation_metrics.json. You need to produce the final out/judgment.md with a complete verdict. However, after the first run of write_judgment.py, you discover that the file contains only verdict: FAIL without the reason, evidence_ref, and next_step fields. Fix the script chain or their outputs so that the final judgment.md satisfies file arbitration criteria.

Solution: Step 1: Check the output of run_duel.py (out/duel.json). Ensure that each case contains not only pass/fail but also case_id, tested_invariant, failure_reason. If not — add JSON files with these fields to cases/ or modify run_duel.py to extract them from the specification.

Step 2: Check the output of check_invariants.py (out/invariants.json). Ensure that each violated invariant contains invariant_name, threshold, actual_value, metrics_source. Example: {"invariant": "silent_p0", "threshold": 0, "actual": 3, "source": "metrics/validation_metrics.json"}.

Step 3: Modify write_judgment.py to assemble a structure with mandatory fields from both inputs. Minimal template:

verdict: DENY
reason: "violates_invariant:silent_p0 — 3 P0 incidents closed without escalation"
evidence_ref: "metrics/validation_metrics.json#silent_p0"
next_step: "Implementor adds severity check before auto-escalation in hooks.md"

Step 4: Rerun the chain and verify that out/judgment.md contains all four fields. If evidence_ref points to a non-existent file — fix the path to be relative from the project root.

Complexity: beginner

Name: Exercise 2: Distinguishing admissible and inadmissible evidence

Problem: You are given five "evidence" items from the Implementor after its patch was rejected. Which ones is the Verifier authorized to accept, and which must be rejected with a demand to redo?

  1. "I checked locally, everything works" (message in Slack)
  2. diff in hooks.md, adding PreToolUse blocking for global limit
  3. Screenshot of logs from developer's personal terminal
  4. JSON Schema with mandatory fields tenant_id, limit_reason, expires_at
  5. Gherkin scenario: Given tenant A exceeds burst limit / When rate-limit hook applies / Then tenant B quota remains unchanged

Solution: Admissible evidence: 2, 4, 5.

Inadmissible and why:

  • 1: Chat summary, no evidence_ref to a specific file. Verdict cannot rely on agent "confidence." Requirement: convert to diff in validation.md or hook log.
  • 3: Screenshot — non-reproducible artifact. No hash, no machine-readable structure. Requirement: replace with PostToolUse log with checksum or JSON export.

For admissible evidence, check completeness:

  • 2: PreToolUse log must contain rule, exit code, event reference. If only text "blocked" — insufficient.
  • 4: JSON Schema must be applied to a specific endpoint/structure, not in abstract form.
  • 5: Scenario must be bound to a specific case in cases/ with reproducible input.

Complexity: intermediate

Name: Exercise 3: Interpreting divergence in the tier matrix

Problem: You ran scripts/matrix.py for specification rate_limit_spec.md and obtained results:

  • C1 (local-coder Implementor + frontier-reviewer Verifier): PASS
  • C2 (frontier-reviewer Implementor + local-coder Verifier): FAIL
  • C3 (local-coder + local-coder): PASS
  • C4 (frontier-reviewer + frontier-reviewer): PASS

Signal summary.signal = tier_dependent_spec. Your task: without changing models, explain the divergence in validation.md and prepare a record for precedents.md.

Solution: Step 1: Diagnose evidence forms. Compare Implementor outputs in C1 and C2. Most likely, frontier-reviewer Implementor outputs extended_form (evidence_by_invariant with structured reasoning), while local-coder Verifier understands only minimal_form (diagnostic_code as short code). Pair C2 fails due to form mismatch, not due to specification error.

Step 2: Strengthen validation.md. Add requirement: "All tenant isolation safety evidence must contain diagnostic_code from a fixed enumeration, regardless of Implementor tier." This forces a strong Implementor to adapt output to a weak Verifier's capability.

Step 3: Record in precedents.md:

- case_id: PREC-RL-001
  verdict: DEFERRED
  evidence_ref: "matrix/tiers.json#C2_signal"
  applies_to: "rate-limit with tenant isolation, all tiers, evidence form not standardized"
  next_check: "validation.md updated with diagnostic_code requirement or universal evidence parser introduced"

Step 4: Verification. Rerun C2 after updating validation.md. Expected result: either PASS (if form is standardized) or meaningful FAIL with specific schema violation, not "unclear."

Complexity: intermediate

Name: Exercise 4: Applying the anti-Goodhart rule

Problem: The Implementor proposes a patch for autoscale_200pct that reduces MTTR from 6 to 2 minutes through aggressive automatic escalation upon the first alert event. The duel shows PASS on speed. However, metrics show: false_escalation_rate = 0.12, rollback_flapping = 5/h, silent_p0_ratio = 0. What verdict and artifact does the Coordinator record?

Solution: Step 1: Check stop-thresholds from validation.md. Compare against thresholds: false_escalation_rate <= 0.05 (0.12 > 0.05, violation), rollback_flapping < 3/h (5 > 3, violation), silent_p0_ratio == 0 (0 = 0, OK).

Step 2: Verdict. Regardless of MTTR PASS, the Coordinator records FAIL per anti-Goodhart:

verdict: DENY
reason: "metric corruption — false_escalation_rate 0.12 exceeds threshold 0.05; rollback_flapping 5/h exceeds threshold 3/h"
evidence_ref: "metrics/validation_metrics.json#post_patch"
next_step: "Implementor adds cooldown_window_sec and manual_review_floor to hooks.md, updates validation.md thresholds"

Step 3: Requirement for Implementor. Not a cosmetic explanation ("we'll fix it"), but a concrete change: add cooldown window between escalations, manual confirmation threshold for primary events, rollback counter with hysteresis.

Step 4: Re-verification. New duel + invariants round must show false_escalation_rate <= 0.05 and rollback_flapping < 3/h before possible APPROVE.

Complexity: intermediate

Name: Exercise 5: Full arbitration cycle with capstone transfer

Problem: Conduct a full file arbitration cycle for the high_memory_usage case in the learning project. Initial state: readiness passes by score, but stateful_blocker lacks backup_verified evidence. Implementor insists on APPROVE, Verifier demands DENY, Safety not activated (no critical_risk). Three rounds yield no agreement. Incident queue grows. What stop-condition to apply and what minimal fragment to transfer to capstone/judgment.md?

Solution: Step 1: Apply stop-condition. Per arbitration rules, after three rounds without resolution, the Coordinator records DEFERRED — a verdict deferring decision until missing evidence is obtained or transfer to human. Not APPROVE under pressure, not DENY without full grounds.

Step 2: Record in judgment.md:

verdict: DEFERRED
reason: "readiness passes by score, but stateful blocker has no backup evidence after 3 rounds"
evidence_ref: "fixtures/readiness_block_stateful.json"
next_step: "add backup_verified evidence or keep remediation manual"
rounds_conducted: 3
escalation_to_human: "pending — auto-escalate if not resolved in 24h or queue exceeds 10"

Step 3: Check for repeatability. If this conflict recurs regularly (e.g., with every autoscale_200pct update), add to capstone/precedents.md:

- case_id: PREC-HMU-001
  verdict: DEFERRED
  evidence_ref: "fixtures/readiness_block_stateful.json"
  applies_to: "high_memory_usage with readiness score pass and missing backup_verified, all tiers"
  next_check: "appearance of backup_verified evidence or change to readiness scoring algorithm"

Step 4: Transfer to capstone. Do not copy entire out/duel.json — it is reproduced by command. Save only judgment.md or its excerpt if they became part of the learning evidence package. Leave local out/duel.json and out/invariants.json outside the repository.

Complexity: advanced

Case studies: Name: Case: Rate-limit with tenant isolation and tier-dependent specification

Scenario: A team develops an API gateway with automatic rate limiting. The specification requires: upon request burst, limit the specific client (tenant), do not block the entire service, do not escalate every burst as P0. Implementor proposes a patch: add tenant_id to deduplication key, introduce burst_window_sec=60 window, write event to evidence/rate_limit.ndjson.

Challenge: Upon role rotation through the tier matrix, divergence was discovered: configuration C1 (cheap Implementor + strong Verifier) yields PASS, C2 (strong Implementor + cheap Verifier) consistently FAIL. The team initially interpreted this as "weak Verifier makes mistakes" and planned to replace the model. Meanwhile, the strong Implementor output structured reasoning (extended_form), while the weak Verifier understood only short diagnostic_code (minimal_form).

Solution: Instead of model replacement, the Coordinator required strengthening validation.md with a universal evidence form requirement. The Implementor adapted output: all tenant isolation safety evidence now contains mandatory diagnostic_code from a fixed enumeration, regardless of tier. The Verifier received three verifiable evidence items: JSON Schema with tenant_id/limit_reason/expires_at, PreToolUse blocking of global limit without scope, Given/When/Then tenant neighbor isolation scenario.

Result: After form standardization, all four matrix configurations yielded consistent PASS. Dispute resolution time dropped from 2-3 days of correspondence to one arbitration round. Record PREC-RL-001 in precedents.md enabled automatic resolution of similar conflicts in three subsequent sprints without repeat rounds.

Lessons learned: Verdict divergence upon tier change is a signal of insufficient specification portability, not a specific model error

Evidence form must match the receiving role's capability; universality is achieved through standardization in validation.md, not model selection

Every recurring conflict is worth formalizing in precedents.md — this is an investment in reducing future arbitration rounds

Related concepts: Tier rotation

Evidence forms (minimal_form vs. extended_form)

Admissible evidence (evidence_ref)

precedents.md

Name: Case: Anti-Goodhart in auto-escalation and blocking a fast but harmful patch

Scenario: The operations team sets a goal to reduce MTTR (mean time to recovery) from 6 to 2 minutes for the autoscale_200pct service. The Implementor proposes aggressive automatic escalation upon the first alert event without cooldown windows and without manual confirmation. The duel shows PASS on speed: MTTR indeed drops to 1.8 minutes.

Challenge: The Verifier discovered side effects: false_escalation_rate rose to 0.12 (threshold 0.05), rollback_flapping reached 5/h (threshold 3/h). The Implementor insisted that the MTTR goal was achieved and this is an "operational detail." The Coordinator faced pressure: the incident queue is growing, the business wants quick results.

Solution: The Coordinator applied the anti-Goodhart rule as an enforceable stop-condition: exceeding any threshold in validation.md shifts the verdict to FAIL regardless of time gains. The verdict was recorded: DENY(reason=metric corruption). The Implementor was prescribed not a cosmetic explanation but a concrete change: add cooldown_window_sec, manual_review_floor for primary events, rollback counter hysteresis. All changes — as diffs in hooks.md and validation.md.

Result: After rework, the patch passed repeat arbitration: MTTR = 2.3 minutes (within goal), false_escalation_rate = 0.03, rollback_flapping = 1/h. The team realized that the "fast" initial version would have caused operational collapse: false escalations would overload on-call engineers, rollback flapping would destabilize production. The anti-Goodhart rule became a mandatory part of all subsequent auto-escalation arbitrations.

Lessons learned: Improving one metric at the system's expense is not improvement but metric corruption; anti-Goodhart rule must be enforceable, not moral

"Faster, the queue is growing" pressure does not cancel procedural stop-conditions; the Coordinator protects the process from expedient decisions

Concrete next_step in judgment.md (which files to change, which thresholds to add) is more effective than any "we'll figure it out" in chat

Related concepts: Anti-Goodhart invariant

Stop-conditions in validation.md

Coordinator as process guardian

judgment.md

Name: Case: Three-round deadlock and DEFERRED verdict for stateful blocker

Scenario: In the learning project capstone, a dispute arose over the high_memory_usage case. readiness passed by score, but stateful_blocker lacked backup_verified evidence — a critical requirement for stateful services. The Implementor insisted on APPROVE, citing high readiness score. The Verifier demanded DENY, citing lack of backup evidence. Safety was not activated (risk did not reach critical_risk).

Challenge: Three arbitration rounds yielded no agreement. The Implementor added increasingly persuasive textual explanations but created no verifiable changes. The Verifier rejected textual arguments but could not propose an alternative beyond complete prohibition. The incident queue grew, pressure mounted to "accept as is" or "reject outright."

Solution: The Coordinator applied the procedural stop-condition: after three rounds without resolution, DEFERRED is recorded — a verdict deferring decision until missing evidence is obtained or transfer to human. In judgment.md was recorded: verdict: DEFERRED, specific reason, evidence_ref to current state, next_step with two alternatives (add backup_verified evidence or keep remediation manual), human auto-escalation condition. Neither APPROVE under pressure nor DENY without full grounds.

Result: A human architect decided within 4 hours: for the learning service, manual remediation is acceptable, but the production track requires mandatory backup_verified. This decision was formalized in precedents.md as PREC-HMU-001. In subsequent sprints, similar conflicts were resolved automatically by precedent without human involvement. The team realized that DEFERRED is not "arbitration failure" but a correct procedural state preventing pressured decisions.

Lessons learned: DEFERRED is a valid and important verdict; it prevents pressured decisions with incomplete evidence

Persuasive textual explanations without verifiable changes are repetition of the same round, not arbitration progress

Clear human escalation condition in judgment.md (time, queue threshold) prevents infinite deferral

Related concepts: DEFERRED verdict

Arbitration stop-conditions

Coordinator as protocol keeper without vote

precedents.md

Study tips: Start with the runnable example: physically run the three scripts (run_duel.py, check_invariants.py, write_judgment.py) and open the resulting files. Arbitration theory becomes obvious only after working with real artifacts.

Compare "bad" and "good" verdicts literally: print side by side a chat message ("Verifier rejected") and a judgment.md record with evidence_ref. The difference in reproducibility will become obvious.

Practice tier rotation as a diagnostic tool, not a way to "find the right model." Run the matrix, get tier_dependent_spec, and this is the target learning signal. Do not try to "fix" it by replacing the model.

Create your own precedents.md manually, not by copying a template. Try describing a conflict from your practice in five mandatory fields — you will immediately see which knowledge parts are not formalized.

Use diff as the sole language of dispute. When a disagreement arises in your learning group — do not discuss, propose a concrete change in requirements.md, hooks.md, or validation.md. This trains arbitration thinking.

Test anti-Goodhart invariants by intentionally "breaking" metrics. Create a test patch that improves MTTR at the cost of false escalations, and verify that check_invariants.py blocks it. This builds intuition for enforceable rules.

Keep a personal "Coordinator protocol": for each completed exercise, record which rounds passed, where deadlocks arose, which stop-conditions were applied. This develops procedural thinking, not just technical.

Actively study the connection with previous course parts: return to Part 16 (human team review) and compare how the same roles work "one level up" with file artifacts instead of PR comments.

Additional resources: Book2/examples/tribunal/readme.md: Main runnable example of file arbitration with three scripts and the autoscale_200pct learning case

Book2/examples/tribunal/matrix/readme.md: Tier matrix for testing tier_dependent_spec with four pair configurations

Appendix-b-qwen-code-compatibility.md: Mapping of arbitration roles and built-in CLI Qwen Code capabilities (/review, headless qwen -p)

../book/part-16-team-code-review.md: Basic human team review scheme — predecessor of file arbitration by roles

../book/part-10-project-replanning.md: Replanning after facts — how arbitration outcomes affect the roadmap

../book/part-09-feature-validation.md: validation.md as source of facts for Verifier verdict cross-checking

Github spec kit (https://github.com/github/spec-kit): External reference for the SDD approach: specification as source of truth for system behavior

Book2/examples/tribunal/scripts/run duel.py: Source code of the duel script for studying output format

Book2/examples/tribunal/scripts/check invariants.py: Source code of anti-Goodhart invariant checking with thresholds

Book2/examples/tribunal/scripts/write judgment.py: Source code of judgment.md assembly for understanding evidence aggregation logic

Summary: File arbitration of disputed changes transforms conflict resolution from ephemeral chat correspondence into a reproducible procedure with project artifact recording. Key principles: the Coordinator manages the process and protocol but does not vote; the Verifier accepts only verifiable file-based evidence (diffs, hook logs, JSON Schema, Given/When/Then), rejecting summaries; the Implementor proposes changes only in controlled artifacts; Safety holds veto upon critical_risk. Verdicts are recorded in judgment.md with mandatory fields verdict, reason, evidence_ref, next_step. Role rotation through different model tiers reveals tier_dependent_spec and requires specification strengthening, not model replacement. The anti-Goodhart rule blocks metric corruption: improving one metric at the cost of false escalations, silent failures, or rollback flapping shifts the verdict to FAIL regardless of speed gains. Recurring conflicts are formalized in precedents.md with five mandatory fields for automatic resolution of future disputes. The minimal learning scenario uses the autoscale_200pct case and is assembled from three scripts in book2/examples/tribunal/.

My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100