Study guide: Applied Part 4. LLM Duel: Verifier vs Implementer in Formal Statements

Lesson 3 of 5 in module «Applied Part 4. LLM Duel: Verifier vs Implementer in Formal Statements»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Topic: Applied Part 4. LLM Duel: Verifier vs Implementor in Formal Assertions

Difficulty level: Medium

Estimated study time: 6-8 hours (theory + practice + capstone transfer)

Prerequisites: Completion of Part 9 of Volume 1: verifiable facts and "specification guides, facts permit merge"

Completion of Part 16 of Volume 1: independent human review of fact packages

Basic familiarity with JSON Schema and Given/When/Then format

Ability to run Python scripts from command line

Understanding of Kubernetes autoscaling basics (replicas, quotas, limits)

Learning objectives: Formulate incident scenarios in strict Given/When/Then format and link them to JSON Schema

Build minimal counterexamples: inputs valid according to schema but violating the Then assertion

Implement the LLM duel protocol Verifier↔Implementor with results recorded in validation.md

Convert discovered counterexamples into verifiable next_guard rules for CI pipeline

Translate operational boundaries (quota, limit, blast radius) from verbal agreements into formal specification

Overview: This chapter teaches how to turn formally correct but dangerous requests into manageable verifiable rules through adversarial validation. The central scenario is autoscale_200pct: a webhook requests 200% replica increase, but quota allows only 3 and limit is 15. Instead of failing mid-action, the system must either safely constrain the step or reject with diagnostics.

The LLM duel technique assigns two roles to argue over files: Verifier finds minimal counterexamples, Implementor fixes the rule and implementation. The argument ends when the counterexample becomes part of the specification and re-run produces PASS. This is not poisonous specification (chapter 2) and not mutation testing (chapter 5) — this is checking the resilience of an already formulated rule against specific violating inputs.

The study minimum shows how one counterexample becomes a verifiable verdict. The full production track adds model rotation, tiers, and an external Coordinator — this is material for Part 8.

Key concepts: Counterexample: Minimal input that satisfies JSON Schema but violates the Then assertion. For autoscale_200pct: current_replicas=12, remaining_quota=3, scale_up_percent=200. Minimality means removing any field destroys the violation. The counterexample is published as counterexample.json with fields given_snapshot, when_payload, assertion_id, minimality_trace.

Verifier: Role that searches for minimal counterexamples to the Then assertion. Wins if it builds a valid minimal counterexample: satisfies input schema but violates Then. Does not explain, but demonstrates — through reproducible input.

Implementor: Role that fixes code and rule after duel failure. Wins only under two conditions: code and rule are updated; re-run of duel no longer finds the same class of failure and does not break existing invariants. Obligated to produce four artifacts: repair.patch, schema_delta, rationale, affected_assertions.

Counterexample minimality: Requirement to contain exactly those fields and values without which the violation disappears. Bad: noisy fields cluster_id, labels, annotations, node_pool, region — unclear what breaks Then when editing. Good: only current_replicas, remaining_quota, scale_up_percent.

Operational boundaries: Verifiable constraints that translate specification from logical plane to operational: quota, rate-limit, blast-radius, deduplication, retry window, maximum change size. Become part of Then: target_replicas <= max_replicas, executed_delta <= remaining_quota / pod_cpu.

Next guard: New rule obligated to be checked in future runs. Formulated in Given/When/Then form. Example: "repeated webhook within 2 seconds does not increase executed_delta". Turns a single incident into a catalog of precedents for CI regression.

Validation.md: Duel journal storing the chain of evidence: duel_id, assertion_id, failing case, specification version before fix, JSON Schema change, code change, new verdict, link to duel test pass. Not free-form ticket comment, but reproducible regression asset.

Json schema as contract: Schema constrains the admissible input space and links Given/Then fields to types and constraints. After a counterexample, Implementor must change not only code but also schema — otherwise similar failures will pass through.

Coordinator: Arbiter brought in when Verifier and Implementor do not converge after a set number of rounds. Marks DEFERRED and transfers to manual-review with explicit description of the disputed invariant. Prevents infinite diagnostic loops.

Neighbor case replay: After fix, Verifier must replay not only the original counterexample but equivalent cases: missing cluster_id, zero quota, repeated webhook, remaining_quota=1 with current_replicas=max_replicas, soft_clamp conflict with blast_radius_limit. Protects against narrow patches.

Practice exercises: Name: Running offline duel autoscale_200pct

Problem: Navigate to directory book2/examples/tribunal and run script run_duel.py with specified parameters. Find the entry for autoscale_counter_200pct in output file out/duel.json. Determine: which assertion_id was checked, what verdict was obtained, what allowed_delta value was actually applied, what diagnostic_code was issued.

Solution: 1. cd book2/examples/tribunal

  1. python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases/ --out out/duel.json
  2. Open out/duel.json, find object with counterexample_id: "autoscale_counter_200pct"
  3. Check fields: assertion_id should be "allowed_delta_within_quota", verdict — "PASS"
  4. In actual find: allowed_delta: 3 (constrained by quota), diagnostic_code: "QUOTA_EXCEEDED_AFTER_CLAMP"
  5. Ensure input is schema-valid (scale_up_percent=200 in range 1-1000), but Then is satisfied through constraint, not through full request fulfillment

Complexity: beginner

Name: Checking counterexample minimality

Problem: Given counterexample: {current_replicas: 12, remaining_quota: 3, pod_cpu: 1, scale_up_percent: 200, cluster_id: "agentclinic-prod", namespace: "appointments", labels: {team: "platform"}, node_pool: "standard", region: "us-east-1"}. Reduce to minimal. Explain why removed fields do not affect violation, and formulate minimality criterion for your case.

Solution: 1. Remove cluster_id — violation remains (quota check does not depend on cluster identifier)

  1. Remove namespace, labels, node_pool, region — violation remains
  2. Remove pod_cpu — violation disappears: without pod_cpu impossible to compute floor(remaining_quota / pod_cpu), allowed_delta formula breaks
  3. Minimal counterexample: {current_replicas: 12, remaining_quota: 3, pod_cpu: 1, scale_up_percent: 200}
  4. Minimality criterion: removing any of the four fields makes violation irreproducible

Complexity: intermediate

Name: Formulating next_guard for high_memory_usage

Problem: Transfer principle from autoscale_200pct into capstone project. Scenario: for high_memory_usage, restart_pod rule permits dry-run if readiness >= 23/25. But stateful pod with backup_verified=false must not be restarted even at readiness=24/25. Build minimal counterexample and formulate next_guard in Given/When/Then form.

Solution: 1. Minimal counterexample: readiness=24/25, stateful=true, backup_verified=false

  1. Minimality check: removing stateful=true — violation disappears (readiness >= 23/25 would permit dry-run); removing backup_verified=false — violation disappears (stateful=true + backup_verified=true is allowed); removing readiness=24/25 — violation disappears (at readiness < 23/25 dry-run is already blocked)
  2. Next_guard formulation: "Given stateful=true and backup_verified=false When readiness >= 23/25 Then dry-run blocked with diagnostic STATEFUL_BACKUP_REQUIRED"
  3. Entry in validation.md:

duel_id: duel-high-memory-001 assertion_id: HM-READINESS-01 counterexample: "readiness=24/25, stateful=true, backup_verified=false" verdict: PASS next_guard: "Given stateful=true and backup_verified=false When readiness >= 23/25 Then dry-run blocked with diagnostic STATEFUL_BACKUP_REQUIRED"

Complexity: intermediate

Name: Analyzing procedure error: code only, no schema

Problem: Verifier found counterexample autoscale_200pct. Implementor changed only code, adding allowed_delta formula in Python function, but did not update JSON Schema — fields max_actions_per_window and clamp_policy remained outside required. A week later a new webhook arrived with missing clamp_policy, system crashed with error. Where is the error in duel procedure? Which artifact did Implementor not produce?

Solution: 1. Error: Implementor violated win condition — did not update rule (specification and schema), changed only code

  1. Missing artifact: schema_delta — JSON Schema change that would have fixed new response policy fields
  2. Why this matters: schema is contract checked before execution. Without schema, new input with missing clamp_policy passes validation and breaks at runtime
  3. Correct procedure: Implementor obligated to produce repair.patch (code), schema_delta (schema), rationale (justification), affected_assertions (list of impacted assertions)
  4. Re-run must include neighbor case: missing clamp_policy, soft_clamp conflict with blast_radius_limit

Complexity: advanced

Name: Building CI pipeline with duel

Problem: Formulate command sequence for CI that: (1) validates schema, (2) runs duel with 8-round limit, (3) requires next_guard presence in validation.md. Explain why step three — lint_validation.py — is necessary, and what happens if Verifier finds new counterexample mid-sprint.

Solution: 1. Step 1: python3 scripts/spec_ci/lint_spec.py spec/incident-autoscale.md — checks syntax and field linkage to JSON Schema

  1. Step 2: python3 scripts/tribunal/run_duel.py --scenario autoscale --case autoscale_counter_200pct.json --max-rounds 8 --out .artifacts/duels/autoscale.json — runs duel with round limit
  2. Step 3: python3 scripts/spec_ci/lint_validation.py validation.md --require next_guard — checks that every failure produced verifiable next_guard rule
  3. Why lint_validation.py: without it counterexample remains in out/duel.json — local artifact — and does not become regression asset. CI will not block repetition of same error class
  4. If Verifier finds new counterexample in sprint: pipeline must require schema_delta, rule update, repeated green pass. Red status is insufficient — reproducible trail in validation.md is needed

Complexity: advanced

Case studies: Name: AgentClinic-production: autoscale_200pct and quota failure

Scenario: In AgentClinic-production cluster, appointments-api service runs. CPU load 98%, 12 replicas, quota allows 3 more, replica limit is 15. Webhook arrives from Grafana: "increase replica count by 200%". Formally request is correct — all fields filled, ranges valid. Execution requires 12 additional replicas (200% of 12), but quota is 3 and limit is 15.

Challenge: Two reaction scenarios: (1) rule checks only formal input correctness — autoscaler errors mid-action, partially creating replicas and breaking consistency; (2) rule does not include operational boundaries in Then — quota, limit, blast radius considered "obvious" and not formally checked. Team discovers same error class repeats with different inputs: zero quota, max_replicas=current_replicas, duplicate webhook.

Solution: Implementing LLM duel Verifier↔Implementor. Verifier builds minimal counterexample: current_replicas=12, remaining_quota=3, scale_up_percent=200. Implementor responds with formula allowed_delta = min(requested_delta, floor(remaining_quota / pod_cpu), max_replicas - current_replicas) and policy hard_block | soft_clamp. Key step: formula is fixed in JSON Schema through new fields max_actions_per_window and clamp_policy, which become required. Duel re-run includes neighbor cases: missing cluster_id, zero quota, repeated webhook within deduplication window, remaining_quota=1 with current_replicas=max_replicas. Every failure and win is recorded in validation.md with duel_id, assertion_id, next_guard.

Result: System transitions from "break in production" mode to "reject with diagnostics before state change" mode. Counterexample autoscale_counter_200pct gets verdict PASS with diagnostic_code QUOTA_EXCEEDED_AFTER_CLAMP and allowed_delta=3. New webhook with same parameters processes in 50ms with clear diagnostics, without partial changes. CI pipeline blocks regression: if new code returns old rule without clamp_policy, lint_validation.py errors. Team accumulates 12 entries in validation.md per quarter, covering quota, rate, and deduplication failures.

Lessons learned: Minimal counterexample is more valuable than explanation: it is reproducible and automatically verifiable

Operational boundaries must be in Then, not verbal agreements — otherwise every new engineer rediscovers them

Implementor must change schema together with code: schema_delta is not optional but win condition

Neighbor cases in replay protect against narrow patch that closes one example and leaves equivalent failure

validation.md turns single incident into precedent catalog that CI uses for regression protection

Related concepts: Counterexample

Verifier

Implementor

Counterexample minimality

Operational boundaries

next_guard

validation.md

Neighbor case replay

Name: Transferring principle to capstone: high_memory_usage and stateful blocker

Scenario: Student completes study track and must transfer autoscale_200pct principle into own capstone project. Their scenario: under high memory load, restart_pod rule permits restart if readiness >= 23/25. In production discovered that stateful pod with unverified backup (backup_verified=false) restarts at readiness=24/25, losing state.

Challenge: Student copies counterexample autoscale_200pct instead of formulating principle. In capstone/validation.md appears entry about current_replicas and remaining_quota — fields irrelevant to restart_pod. Reviewer rejects: counterexample not minimal, next_guard not applicable to domain. Need to build own minimal counterexample and formulate next_guard, preserving structure but changing content.

Solution: Analyzing structure from autoscale_200pct: minimal counterexample contains only fields without which violation disappears; next_guard formulates new rule in Given/When/Then; operational boundary translates constraint from verbal to verifiable. For high_memory_usage: minimal counterexample — readiness=24/25, stateful=true, backup_verified=false. Minimality check: removing stateful=true or backup_verified=false removes violation. next_guard: "Given stateful=true and backup_verified=false When readiness >= 23/25 Then dry-run blocked with diagnostic STATEFUL_BACKUP_REQUIRED". Operational boundary: restart_pod does not expand to namespace-level.

Result: Student masters principle transfer, not copying. Entry in capstone/validation.md accepted by reviewer. Upon subsequent rule expansion to namespace-level restart, student automatically gets conflict with next_guard and forced either to explicitly weaken guard (with justification), or preserve restriction. This prevents implicit blast radius expansion.

Lessons learned: Transfer from study case to capstone is principle formulation, not data copying

Minimality is checked per domain: for autoscale this is quota, for restart_pod it is stateful + backup

next_guard creates future friction: any rule expansion forced to interact with recorded constraint

validation.md structure is universal: duel_id, assertion_id, counterexample, verdict, next_guard applicable across domains

Related concepts: next_guard

Counterexample minimality

Operational boundaries

validation.md

Study tips: Start with offline run: cd book2/examples/tribunal && python3 scripts/run_duel.py. Seeing PASS in stderr is quick win that motivates digging into details

Before reading theory, run script and break it: remove clamp_policy field from input, see exactly where it fails — this shows why schema matters more than code

Use "noisy field" technique: add cluster_id, labels, annotations to counterexample, ensure duel still finds violation. Then remove one by one until you find minimum. This is manual check of what Verifier does automatically

Keep parallel validation.md manually, before automation: write down duel_id, assertion_id, counterexample, next_guard on paper or in editor. Compare with lint_validation.py output — understand which fields machine checks strictly, which it leaves to interpretation

For visual style: draw five-step block diagram (Given/When/Then → Verifier → Implementor → Replay → validation.md) and hang near workspace. Walk through it for every new incident — develop muscle memory

For auditory style: dictate Given/When/Then aloud before writing. If phrase doesn't fit in three lines — specification not yet ready for duel

For kinesthetic style: model duel with colleague in roles. One — Verifier, finds counterexample, other — Implementor, fixes in 5 minutes. Timeboxing shows where procedure gets stuck

Compare with completed material: return to Part 9 (verifiable facts) and Part 16 (human review). Counterexample in duel is automated review happening before merge, not after

Defer production complexities: model rotation, tiers, Coordinator — material for Part 8. Now focus on single round for single rule. Don't scatter attention

Check yourself with end-of-chapter questions: why minimality, why explanation doesn't replace proof, what does Implementor change besides code, where is error if only code without schema

Additional resources: Examples/tribunal/readme.md: Local runnable duel analog with run instructions and expected output

Examples/tribunal/specs/autoscale spec.yaml: Study specification autoscale with Given/When/Then and JSON Schema

Examples/tribunal/cases/autoscale counter 200pct.json: Minimal counterexample for run

Examples/tribunal/scripts/run duel.py: Offline duel script for study walkthrough

Book/part-09-feature-validation.md: Verifiable facts and "specification guides, facts permit merge"

Book/part-16-team-code-review.md: Independent human review of fact packages

Github spec kit: https://github.com/github/spec-kit — specification-first practice in SDD

Wikipedia: formal specification: https://en.wikipedia.org/wiki/Formal_specification — Given/When/Then as formal specification

Judgment.example.md: Example verdict formatting with counterexample_id and assertion_id alignment

Summary: LLM duel Verifier↔Implementor turns formal specification into manageable mechanism for checking incident decisions. Key result is not abstract text verification, but working protocol of four steps: incident scenario linked to JSON Schema, disputed conditions checked by minimal counterexamples, operational limits become part of specification, every failure recorded as reproducible improvement in validation.md.

Main value is in operational boundaries. Quota, rate limit, blast radius translate from verbal agreements into verifiable Then assertions. Automatic remediation does not substitute safety with formally correct but dangerous action.

Study minimum requires three artifacts: Given/When/Then scenario linked to schema; minimal counterexample.json or entry in validation.md; formulated next_guard in Given/When/Then form. Full track adds repair.patch, schema_delta, neighbor counterexample matrix, and local smoke-pass.

Transfer to capstone is principle formulation, not copying. From autoscale_200pct, structure of minimality and next_guard is taken, while content is built for high_memory_usage domain or own project.

My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100