Reading: Applied Part 4. LLM Duel: Verifier vs Implementer in Formal Statements

Lesson 1 of 5 in module «Applied Part 4. LLM Duel: Verifier vs Implementer in Formal Statements»
You are viewing the lesson without signing in. Sign in to save progress and take tests.

Applied Part 4. LLM Duel: Verifier vs. Implementor in Formal Assertions

Status: Frontier. For educational walkthrough, the offline run from examples/tribunal/ is sufficient: it demonstrates how one counterexample is turned into a verifiable verdict. Real LLM roles, model rotation, and an external Coordinator are only needed for the full production track.

To avoid retelling the chapter before we begin, let's start from one scenario. In the AgentClinic-production cluster, the appointments-api service is loaded. CPU load is 98%, replicas are 12, quota allows adding 3 more, replica limit is 15. A webhook arrives: "increase replica count by 200%". Formally the request is correct — all fields are filled, ranges are valid. But executing it is impossible: quota won't suffice, limit won't allow. For the rest of the chapter, everything will revolve around this autoscale_200pct — the same AgentClinic that we brought to MVP in Part 12 of the first volume, only now under load.

Two reaction scenarios are possible. First: the behavior rule is configured only for "formal correctness of input", and the autoscaler errors out mid-action. Second: the rule has a separate check of operational boundaries — quota, limit, blast radius — and the autoscaler either safely constrains its step or refuses with diagnostics. This chapter needs to teach the second: bring the rule to a state where a simple violating input cannot break it.

The technique we will apply for this is called adversarial validation in the literature: one model searches for a minimal violating example, the second fixes the rule and implementation until a stable PASS. In shorter text — LLM duel: Verifier and Implementor argue over files until the minimal counterexample — a specific input that passes the schema but breaks the stated rule — becomes part of the specification. In Qwen Code this is not a built-in command; result quality depends on model choice, context length, protocol discipline, and role composition.

This chapter should not be confused with other techniques. Poisoned specification from Chapter 2 checks whether you can create and fix one requirements defect. Mutants from Chapter 5 check whether the validator catches an entire class of defects. The duel checks a third thing: whether the Verifier can construct a minimal counterexample to an already formulated rule, and whether the Implementor can close exactly that gap. In Chapter 8 the same dispute will be formalized as a file arbitration procedure with Coordinator, judgment.md and precedents.md; here we only need one round on one rule.

The chapter builds on two ideas from the first volume: "specification directs, facts permit merge" from Part 9 and independent human review of a fact package from Part 16. The difference is one thing: the counterexample is built not by a human reviewer, but by a second model, and it does so before merge, not after.

Before Reading

  • Foundation from the first volume: Part 9 gives verifiable facts, Part 16 gives independent review.
  • Local educational case: autoscale_200pct, because quota and replica limit provide a compact counterexample.
  • Trail for capstone/: one next_guard for high_memory_usage, for example a prohibition on bypassing a stateful blocker even with a good readiness score.
  • Main term for first pass: counterexample. Roles (Verifier/Implementor/Safety) are analyzed in detail in Part 8; here the Verifier–Implementor pair is sufficient.
  • What to defer: model rotation, tiers, and external Coordinator.

Goal

You will be able to implement a Verifier↔Implementor LLM duel in an automatic incident management project. The goal is to bring a formal Given/When/Then specification to a state resistant to counterexample attacks.

The practical result is not an abstract text check, but a working protocol. It consists of four steps:

  • incident scenario is linked to JSON Schema;
  • disputed conditions are checked with minimal counterexamples;
  • operational limits become part of the specification;
  • every failure is recorded as a reproducible improvement in validation.md.

Minimal Educational Scenario

Educational Case

autoscale_200pct: a webhook asks to increase replica count by 200%, but remaining_quota=3 and max_replicas=15. Need to prove that the action either constrains itself to a safe allowed_delta, or is blocked with diagnostics.

Preparation

  • book2/examples/tribunal/specs/autoscale_spec.yaml.
  • book2/examples/tribunal/cases/autoscale_counter_200pct.json.
  • Script book2/examples/tribunal/scripts/run_duel.py.

Steps

  1. cd book2/examples/tribunal. Expected: you are in the runnable example directory.
  2. python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases/ --out out/duel.json. *Expected: out/duel.json is created with verdicts on counterexamples.*
  3. Find the autoscale_counter_200pct case in out/duel.json. Expected: you can see which Then was checked and why the counterexample is admissible by input schema.
  4. Rewrite the output into validation.md: duel_id, assertion_id, counterexample, verdict, next_guard.
  5. Do not proceed to full file arbitration. In this minimum, only proving that one counterexample turns into a new verifiable rule matters.

Control Fact

A counterexample contains only the fields necessary for violation: current replicas, quota, limit, and scaling percentage. If extra fields are needed for explanation, the counterexample is not yet minimal.

How This Gets into capstone/

Transfer to capstone/validation.md one duel_id, one assertion_id, a minimal counterexample, and next_guard. The runnable example uses autoscale_200pct, while the main graded case is high_memory_usage. Transfer is done not by copying the counterexample, but by formulating the principle.

What to take from autoscale_200pctWhat to write in capstone/validation.md for high_memory_usage
Minimal counterexample: only fields without which the violation disappearsMinimal counterexample to one restart_pod rule: readiness=24/25, stateful=true, backup_verified=false
next_guard: duplicate_webhook_must_not_double_scalenext_guard: stateful + backup=false blocks dry-run even with readiness >= 23/25
Operational boundary: quota, blast-radiusOperational boundary: restart_pod does not expand to namespace

Minimal fragment:

duel_id: duel-high-memory-001
assertion_id: HM-READINESS-01

counterexample: "readiness=24/25, stateful=true, backup_verified=false"
verdict: PASS
next_guard: "Given stateful=true and backup_verified=false When readiness >= 23/25 Then dry-run is blocked with diagnostic STATEFUL_BACKUP_REQUIRED"

Reviewable Trail

out/duel.json is a local result. In the educational package, save not it, but a record in validation.md or a short precedent indicating which guard appeared after the duel.

Key Ideas

Format the incident scenario in strict Given/When/Then. A minimal example is enough to keep in three lines:

  • Given: current_replicas=12, remaining_quota=3, max_replicas=15.
  • When: webhook requests scale_up_percent=200.
  • Then: either scaling fits within the limit, or the action is refused with diagnostics and without changing state.

Each Given and Then field is later linked to a type and JSON Schema constraint; the schema itself is analyzed below in fragments. The full list of fields that in a real rule will go into Given (cluster, namespace, deduplication window, webhook source, trusted monitoring context), and into Then (diagnostic code, no repeated action within deduplication window, audit trail preserved) — add as the scenario grows, not as a pre-filled template, but as a reaction to found counterexamples.

This format aligns with the "specification-first" practice in SDD (GitHub Spec Kit) and with user stories with Given/When/Then criteria (Wikipedia: Formal specification).

Set duel rules before launch. Otherwise the dispute between agents will quickly turn into negotiations about the meaning of requirements. Let's introduce roles. Verifier — a role that searches for a minimal counterexample to a Then assertion. Implementor — a role that fixes code and rule after failure. The Verifier wins if it constructs a valid minimal counterexample: it satisfies the input schema but violates the Then assertion. The Implementor wins only under two conditions: code and rule are updated; the rerun of the duel no longer finds the same class of failure and does not break existing invariants.

Counterexample minimality is a separate requirement. A counterexample must contain exactly those fields and values without which the violation disappears. Not an arbitrary set of noisy conditions, but a narrow squeezing example.

Bad:

> counterexample with many noisy fields: cluster_id, namespace, labels, annotations, node_pool, region, current_replicas, remaining_quota, scale_up_percent, last_deploy_at, owner_team.

Problem: when fixing, it's unclear which field actually breaks Then. Regression is not reproduced in clean form.

Good:

> minimal counterexample with only critical fields: current_replicas=12, remaining_quota=3, scale_up_percent=200.

For example, for autoscale current_replicas=12, remaining_quota=3, pod_cpu=1, scale_up_percent=200 is sufficient. For reproducibility the Verifier publishes counterexample.json with fields given_snapshot, when_payload, assertion_id, minimality_trace. The Implementor responds with four artifacts: repair.patch, schema_delta, rationale, and a list of affected_assertions.

Record operational boundaries as part of the specification, not as team verbal agreements. Let's list them:

  • quota (quota),
  • rate limiting (rate-limit),
  • blast radius (blast-radius),
  • deduplication,
  • repeated action window,
  • maximum change size.

Why this matters. If the schema only checks types, then scale_up_percent can be an integer and simultaneously lead to unacceptable resource expenditure.

Therefore add conditions to Then like:

  • target_replicas <= max_replicas,
  • executed_delta <= remaining_quota / pod_cpu,
  • actions_per_window <= max_actions_per_window,
  • affected_services <= blast_radius_limit.

This shifts verification from purely logical plane to operational. The system does not just "reason correctly". It proves that the action will not exceed the safe radius.

Save every disputed run in validation.md as a proof chain, not as a free-form ticket comment. Include in the record:

  • duel_id,
  • assertion_id,
  • failing case,
  • specification version before fix,
  • JSON Schema change,
  • code change,
  • new verdict,
  • link to duel test pass.

A separate next_guard field sets a new rule that must be checked in future runs. For example, "a repeated webhook within 2 seconds does not increase executed_delta". Such a log turns a single incident into a catalog of precedents. If a similar error occurs again, CI will be able to reproduce the old failing case and block regression before merge.

Embed the duel into the educational incident project pipeline so that every new incident automatically tightens the specification. A normalized webhook from PagerDuty or Grafana goes through four steps:

  1. schema check (schema lint),
  2. Given/When/Then validation,
  3. Verifier↔Implementor duel,
  1. history replay from validation.md after fix.

What happens if the Verifier found a new counterexample. The pipeline should not be limited to a red status. It must require schema_delta, rule update, and repeated green pass. As a result the project learns not from declarations, but from verifiable trails: new incidents expand the verification matrix, strengthen CI blocking, and reduce the space of implicit interpretations.

Examples and Application

flowchart TD
  A[Given/When/Then of incident]
  B[Verifier: minimal counterexample]
  C[Implementor: constraint policy and schema fix]
  D[Duel replay]
  E[Record in validation.md]
  A --> B --> C --> D --> E

The scenario is the same autoscale_200pct that we launched in "Minimal Educational Scenario". Here we look at it from another angle: how the Implementor closes the failure through JSON Schema, not only through the rule. The requested increase requires 12 additional replicas, quota only allows adding 3, and target_replicas=24 violates max_replicas=15. The Implementor responds with the formula allowed_delta = min(requested_delta, floor(remaining_quota / pod_cpu), max_replicas - current_replicas) and a hard_block | soft_clamp policy. But a formula without a schema is still a verbal agreement.

JSON Schema anchors the rule. To avoid confusion in ten fields at once, let's look at it in three short blocks: what identifies the source, what describes current state, and what sets the response policy.

First, source identification. Without it two identical requests from different monitoring systems cannot be distinguished:

{
  "cluster_id": {"type": "string", "minLength": 1},
  "source_service": {"type": "string", "enum": ["pagerduty", "grafana"]},
  "scale_up_percent": {"type": "integer", "minimum": 1, "maximum": 1000}
}

Next — cluster state at request time. These are the fields the Verifier operates with when building a counterexample:

{
  "current_replicas": {"type": "integer", "minimum": 0},
  "pod_cpu": {"type": "number", "exclusiveMinimum": 0},
  "remaining_quota": {"type": "integer", "minimum": 0},
  "max_replicas": {"type": "integer", "minimum": 1}
}

Finally, response policy. These are the fields the Implementor is forced to add after the very first counterexample, because without them the rule can only break:

{
  "max_actions_per_window": {"type": "integer", "minimum": 1},
  "clamp_policy": {"type": "string", "enum": ["hard_block", "soft_clamp"]}
}

In assembled form this is one object with required: [cluster_id, source_service, scale_up_percent, current_replicas, pod_cpu, remaining_quota, max_replicas, max_actions_per_window, clamp_policy]. The main thing in it is not the number of fields, but that the response policy is described on par with state.

After the fix, the Verifier must replay not only the original autoscale_200pct, but also neighboring cases:

  • missing cluster_id,
  • zero quota,
  • repeated webhook within deduplication window,
  • remaining_quota=1 with current_replicas=max_replicas,
  • soft_clamp conflict with blast_radius_limit.

This protects against a narrow patch that closes one example and leaves an equivalent failure nearby.

In CI such a run is represented as a sequence of commands. The first check validates the schema. The second runs the duel. The third requires a journal record:

> [project script]lint_spec.py and lint_validation.py are project gateways here; runnable duel analog see in examples/tribunal/README.md.

python3 scripts/spec_ci/lint_spec.py spec/incident-autoscale.md

python3 scripts/tribunal/run_duel.py \
  --scenario autoscale \
  --case autoscale_counter_200pct.json \
  --max-rounds 8 \
  --out .artifacts/duels/autoscale.json

python3 scripts/spec_ci/lint_validation.py \
  validation.md \
  --require next_guard

A validation.md fragment must be specific enough that another agent or engineer can repeat the dispute without verbal explanations.

For example, record du-2026-001 stores:

  • failing case autoscale_counter_200pct,
  • old rule target_replicas = current_replicas + requested_delta,
  • new rule with allowed_delta,
  • chosen strategy soft_clamp,
  • verdict PASS after replay,
  • next_guard: duplicate_webhook_must_not_double_scale.

What to do if the Verifier and Implementor do not converge after the set number of rounds. Here another role connects — Coordinator, an arbiter that maintains the duel protocol and fixes the outcome. The Coordinator marks DEFERRED and transfers the case to manual review. It does this only with an explicit description of the disputed invariant. This prevents endless diagnostic loops and leaves a point in history to which one can return after policy clarification.

Summary

The Verifier↔Implementor LLM duel makes a living specification a manageable mechanism for checking incident decisions. Let's gather roles by steps:

  • Given/When/Then sets the behavioral contract;
  • JSON Schema constrains the admissible input space;
  • Verifier searches for a minimal counterexample;
  • Implementor fixes the rule and implementation;
  • validation.md saves the failure as a regression asset.

The main value of the approach manifests in operational boundaries. Quota, rate limiting, and blast radius become part of a verifiable assertion. Therefore automatic remediation does not substitute safety with a formally correct but dangerous action. The next chapter will translate the duel into a stress specification generator.

Artifacts and Readiness Criteria

Educational minimum — three artifacts and three conditions by which they can be considered ready.

ArtifactReady when
Given/When/Then scenariocovers one disputed requirement, verifiable fields are linked to JSON Schema
counterexample.json or record in validation.mdinput is valid by schema and violates only the checked Then; counterexample is minimal or explicitly marked as not minimal
next_guardnew rule is formulated in Given/When/Then form and will be checked after the fix

Full track adds repair.patch / schema_delta from the Implementor, validation.md record with duel_id and link to rerun, matrix of neighboring counterexamples, and local smoke-pass of the runnable duel analog from examples/tribunal/. Consider the full track ready if the Implementor changes the rule and contract (not only explanation) and the repeated duel no longer finds the same class of failure.

Practice

  1. cd book2/examples/tribunal && python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases --out out/duel.json — *expected: stderr shows PASS autoscale_counter_200pct and PASS duplicate_webhook_within_dedup_window; in out/duel.json for autoscale_counter_200pct field verdict: "PASS", actual.diagnostic_code: "QUOTA_EXCEEDED_AFTER_CLAMP", actual.allowed_delta: 3.*
  2. Open judgment.example.md and verify that for autoscale_counter_200pct.json the counterexample_id field equals the filename without .json, and assertion_id equals allowed_delta_within_quota. *Expected: identifiers are consistent — counterexample_id matches filename, assertion_id references the violated Then.*
  1. Transfer to capstone/validation.md one line: "counterexample <counterexample_id> violates Then <assertion_id>; added next_guard: <…>". *Expected: counterexample name matches counterexample_id from out/duel.json, next_guard formulation is written in Given/When/Then form.*

Control Questions

  1. Why must a counterexample be minimal?
  2. Why does free-form explanation not replace proof?
  3. What must the Implementor change after a duel failure — besides code?
  4. The Verifier found a counterexample, but the Implementor only fixes code without updating JSON Schema. A week later a similar counterexample passes. Where is the error in the duel procedure?
My notes
0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course menu

Course

Production SDD for Qwen Code CLI. Part 2
Progress 0 / 100