Reading: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements

Lesson 1 of 5 in module «Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Applied Part 4. LLM Duel: Verifier vs. Implementor on Formal Assertions

Status: Frontier. For a study run-through, the offline pass from examples/tribunal/ is enough: it shows how a single counterexample turns into a verifiable verdict. Real LLM roles, model rotation, and an external Coordinator are only needed for the full production track.

To avoid retelling the chapter before we begin, let's start with a single scenario. The appointments-api service is loaded in the AgentClinic-production cluster. CPU load is 98%, there are 12 replicas, the quota allows adding 3 more, and the replica limit is 15. A webhook arrives: "increase the number of replicas by 200%." Formally the request is correct — all fields are filled in, ranges are valid. But it cannot be executed: the quota is insufficient, the limit will not allow it. The whole chapter will then revolve around this autoscale_200pct — the same AgentClinic that in part 12 of volume 1 we brought to MVP, except now under load.

Two response scenarios are possible. The first: the behavior rule is configured only for "formal correctness of input," and the autoscaler errors out mid-action. The second: the rule includes a separate check of operational boundaries — quotas, limit, blast radius — and the autoscaler either safely limits its step, or refuses with diagnostics. This chapter aims to teach the second: to bring the rule to a state in which it is not broken by a simple violating input.

The technique we will apply for this is called adversarial validation in the literature: one model searches for a minimal violating example, the other fixes the rule and the implementation until a stable PASS is reached. In the text it is shorter — LLM duel: a Verifier and an Implementor argue over files until the minimal counterexample — a specific input that passes the schema but breaks the declared rule — becomes part of the specification. In Qwen Code this is not a built-in command; the quality of the result depends on the choice of models, context length, protocol discipline, and the composition of roles.

It should not be confused with other techniques from the chapter. The poisoned specification from chapter 2 tests whether you can create and fix a single requirements defect. The mutants from chapter 5 test whether the validator catches a whole class of defects. The duel tests a third thing: whether the Verifier can build a minimal counterexample to an already formulated rule, and the Implementor can close exactly that gap. In chapter 8 the same dispute will be formalized into a file arbitration procedure with a Coordinator, judgment.md, and precedents.md; here we need only one round on one rule.

The chapter rests on two ideas from volume 1: "the specification guides, facts admit to merging" from part 9, and independent human review of the fact package from part 16. The difference is one: the counterexample is built not by a human reviewer, but by a second model, and it does so before merging, not after.

Before Reading

Foundation from volume 1: part 9 provides verifiable facts, part 16 provides independent review.
Local study case: autoscale_200pct, because the quota and replica limit yield a compact counterexample.

Trace for capstone/: one next_guard for high_memory_usage, for example a ban on bypassing a stateful blocker even with a good readiness score.
The main term of the first pass: counterexample. Roles (Verifier/Implementor/Safety) are examined in detail in part 8; a pair of Verifier–Implementor is enough here.
What to defer: model rotation, tiers, and the external Coordinator.

Goal

You will be able to introduce a Verifier↔Implementor LLM duel into an automated incident-management project. The goal is to bring a formal Given/When/Then specification to a state that is resilient to counterexample attacks.

The practical outcome is not an abstract text check, but a working protocol. It consists of four steps:

the incident scenario is linked to a JSON Schema;
disputed conditions are checked with minimal counterexamples;
operational limits become part of the specification;
every failure is recorded as a reproducible improvement in validation.md.

Minimal Study Scenario

Study Case

autoscale_200pct: a webhook asks to increase the number of replicas by 200%, but remaining_quota=3, and max_replicas=15. It is necessary to prove that the action is either limited by a safe allowed_delta, or blocked with diagnostics.

Preparation

book2/examples/tribunal/specs/autoscale_spec.yaml.
book2/examples/tribunal/cases/autoscale_counter_200pct.json.
Script book2/examples/tribunal/scripts/run_duel.py.

Steps

cd book2/examples/tribunal. Expectation: you are in the runnable example directory.
python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases/ --out out/duel.json. *Expectation: out/duel.json is created with verdicts on the counterexamples.*
Find the autoscale_counter_200pct case in out/duel.json. Expectation: it is clear which Then was checked and why the counterexample is admissible by the input schema.
Rewrite the output in validation.md: duel_id, assertion_id, counterexample, verdict, next_guard.
Do not move on to file arbitration as a whole. In this minimum, it is only important to prove that one counterexample turns into a new checkable rule.

Checkable Fact

The counterexample contains only the fields needed for the violation: current replicas, quota, limit, and scaling percent. If extra fields are needed to explain it, the counterexample is not yet minimal.

How This Gets Into `capstone/`

Transfer to capstone/validation.md one duel_id, one assertion_id, the minimal counterexample, and next_guard. The runnable example uses autoscale_200pct, while the main graded case is high_memory_usage. The transfer is done not by copying the counterexample, but by formulating the principle.

What to take from `autoscale_200pct`	What to record in `capstone/validation.md` for `high_memory_usage`
Minimal counterexample: only the fields without which the violation disappears	Minimal counterexample to one `restart_pod` rule: `readiness=24/25`, `stateful=true`, `backup_verified=false`
`next_guard: duplicate_webhook_must_not_double_scale`	`next_guard: stateful + backup=false blocks dry-run even with readiness >= 23/25`
Operational boundary: `quota`, `blast-radius`	Operational boundary: `restart_pod` is not extended to namespace

Minimal fragment:

duel_id: duel-high-memory-001
assertion_id: HM-READINESS-01

counterexample: "readiness=24/25, stateful=true, backup_verified=false"
verdict: PASS
next_guard: "Given stateful=true and backup_verified=false When readiness >= 23/25 Then dry-run is blocked with diagnostics STATEFUL_BACKUP_REQUIRED"

Reviewable Trace

out/duel.json is a local result. In the study package, save not it, but the entry in validation.md or a short precedent indicating which guard appeared after the duel.

Key Ideas

Format the incident scenario in strict Given/When/Then. The minimal example is enough to keep in three lines:

Given: current_replicas=12, remaining_quota=3, max_replicas=15.
When: a webhook requests scale_up_percent=200.
Then: either the scaling fits within the limit, or the action is refused with diagnostics and without changing state.

Each field of Given and Then is later linked to a JSON Schema type and constraint; the schema itself is broken down below in fragments. The full list of fields that in a real rule will go into Given (cluster, namespace, deduplication window, webhook source, trusted monitoring context), and into Then (diagnostic code, absence of repeated action in the deduplication window, preservation of audit trail) should be added as the scenario grows — not as a pre-filled template, but as a reaction to found counterexamples.

This format aligns with the "specification-first" practice in SDD (GitHub Spec Kit) and with user stories with Given/When/Then-style criteria (Wikipedia: Formal specification).

Set the rules of the duel before starting. Otherwise, the dispute between agents will quickly turn into negotiations about the meaning of requirements. Let's introduce the roles. Verifier is the role that searches for a minimal counterexample to the Then assertion. Implementor is the role that fixes the code and the rule after a failure. The Verifier wins if it builds a valid minimal counterexample: it satisfies the input schema, but violates the Then assertion. The Implementor wins only under two conditions: the code and the rule are updated; a re-run of the duel no longer finds the same class of failure and does not break existing invariants.

Minimality of a counterexample is a separate requirement. The counterexample must contain exactly those fields and values without which the violation disappears. Not an arbitrary set of noisy conditions, but a narrow squeezing example.

Bad:

> a counterexample with many noisy fields: cluster_id, namespace, labels, annotations, node_pool, region, current_replicas, remaining_quota, scale_up_percent, last_deploy_at, owner_team.

Problem: during a fix it is unclear which field actually breaks Then. The regression is not reproduced in a clean form.

Good:

> a minimal counterexample with only critical fields: current_replicas=12, remaining_quota=3, scale_up_percent=200.

For example, for autoscale current_replicas=12, remaining_quota=3, pod_cpu=1, scale_up_percent=200 are enough. For reproducibility the Verifier publishes counterexample.json with the fields given_snapshot, when_payload, assertion_id, minimality_trace. The Implementor responds with four artifacts: repair.patch, schema_delta, rationale, and a list of affected_assertions.

Fix operational boundaries as part of the specification, not as verbal team agreements. Let's list them:

quota (quota),
rate limit (rate-limit),
blast radius (blast-radius),
deduplication,
re-action window,
maximum size of change.

Why this matters. If the schema checks only types, then scale_up_percent can be an integer and at the same time lead to an inadmissible resource expenditure.

Therefore, add conditions of the following kind to Then:

target_replicas <= max_replicas,
executed_delta <= remaining_quota / pod_cpu,
actions_per_window <= max_actions_per_window,
affected_services <= blast_radius_limit.

This shifts the check from a purely logical plane to an operational one. The system does not just "reason correctly." It proves that the action will not go beyond the safe radius.

Save each disputed run in validation.md as a chain of evidence, not as a free comment in a ticket. Include in the entry:

duel_id,
assertion_id,
the failing case,
the specification version before the fix,
the JSON Schema change,
the code change,
the new verdict,
a link to the duel test pass.

A separate next_guard field sets a new rule that must be checked in future runs. For example, "a repeated webhook within 2 seconds does not increase executed_delta." Such a journal turns a single incident into a catalog of precedents. If a similar error occurs again, CI can reproduce the old failing case and block the regression before merging.

Embed the duel into the study pipeline of the incident project so that each new incident automatically tightens the specification. A normalized webhook from PagerDuty or Grafana goes through four steps:

schema check (schema lint),
Given/When/Then validation,
Verifier↔Implementor duel,

replay of the history from validation.md after the fix.

What happens if the Verifier has found a new counterexample. The pipeline should not be limited to a red status. It must require schema_delta, an update of the rule, and a repeated green pass. As a result, the project learns not from declarations, but from verifiable traces: new incidents expand the verification matrix, strengthen the CI block, and reduce the space of implicit interpretations.

Examples and Application

flowchart TD
  A[Given/When/Then of the incident]
  B[Verifier: minimal counterexample]
  C[Implementor: limiting policy and schema fix]
  D[Duel replay]
  E[Entry in validation.md]
  A --> B --> C --> D --> E

The scenario is the same autoscale_200pct that we launched in "Minimal Study Scenario." Here we look at it from a different angle: how the Implementor closes the failure through JSON Schema, and not only through the rule. The requested increase requires 12 additional replicas, the quota allows adding only 3, and target_replicas=24 violates max_replicas=15. The Implementor responds with the formula allowed_delta = min(requested_delta, floor(remaining_quota / pod_cpu), max_replicas - current_replicas) and a policy of hard_block | soft_clamp. But a formula without a schema is still just a verbal agreement.

JSON Schema pins down the rule. To avoid getting confused in ten fields at once, let's view it in three short blocks: what identifies the source, what describes the current state, and what sets the response policy.

First, source identification. Without it, two identical requests from different monitoring systems cannot be distinguished:

{
  "cluster_id": {"type": "string", "minLength": 1},
  "source_service": {"type": "string", "enum": ["pagerduty", "grafana"]},
  "scale_up_percent": {"type": "integer", "minimum": 1, "maximum": 1000}
}

Next — the cluster state at the moment of the request. These are the fields the Verifier operates on when building a counterexample:

{
  "current_replicas": {"type": "integer", "minimum": 0},
  "pod_cpu": {"type": "number", "exclusiveMinimum": 0},
  "remaining_quota": {"type": "integer", "minimum": 0},
  "max_replicas": {"type": "integer", "minimum": 1}
}

Finally, the response policy. These are the fields the Implementor is forced to add after the very first counterexample, because without them the rule can only break:

{
  "max_actions_per_window": {"type": "integer", "minimum": 1},
  "clamp_policy": {"type": "string", "enum": ["hard_block", "soft_clamp"]}
}

Assembled, this is a single object with required: [cluster_id, source_service, scale_up_percent, current_replicas, pod_cpu, remaining_quota, max_replicas, max_actions_per_window, clamp_policy]. The main thing in it is not the number of fields, but the fact that the response policy is described on a par with the state.

After the fix, the Verifier must replay not only the original autoscale_200pct, but also neighboring cases:

missing cluster_id,
zero quota,
a repeated webhook inside the deduplication window,

remaining_quota=1 with current_replicas=max_replicas,
conflict of soft_clamp with blast_radius_limit.

This protects against a narrow patch that closes one example and leaves an equivalent failure next to it.

In CI such a run is represented as a sequence of commands. The first check validates the schema. The second runs the duel. The third requires an entry in the journal:

> [project script] — lint_spec.py and lint_validation.py are project gates here; see the runnable analog of the duel in examples/tribunal/README.md.

python3 scripts/spec_ci/lint_spec.py spec/incident-autoscale.md

python3 scripts/tribunal/run_duel.py \
  --scenario autoscale \
  --case autoscale_counter_200pct.json \
  --max-rounds 8 \
  --out .artifacts/duels/autoscale.json

python3 scripts/spec_ci/lint_validation.py \
  validation.md \
  --require next_guard

The validation.md fragment must be concrete enough that another agent or engineer can repeat the dispute without verbal explanations.

For example, the du-2026-001 entry stores:

the failing case autoscale_counter_200pct,
the old rule target_replicas = current_replicas + requested_delta,

the new rule with allowed_delta,
the chosen strategy soft_clamp,
the verdict PASS after the replay,
next_guard: duplicate_webhook_must_not_double_scale.

What to do if the Verifier and the Implementor do not converge after a given number of rounds. Here another role is connected — Coordinator, an arbiter who runs the duel protocol and records the outcome. The Coordinator marks DEFERRED and moves the case to manual-review. It does so only with an explicit description of the disputed invariant. This prevents endless diagnostics cycles and leaves a point in the history to which one can return after clarifying the policy.

Summary

The Verifier↔Implementor LLM duel turns a living specification into a manageable verification mechanism for incident decisions. Let's collect the roles step by step:

Given/When/Then sets the behavioral contract;
JSON Schema constrains the admissible input space;
the Verifier searches for a minimal counterexample;
the Implementor fixes the rule and the implementation;
validation.md records the failure as a regression asset.

The main value of the approach manifests itself in operational boundaries. Quota, rate limit, and blast radius become part of the checkable assertion. Therefore, automatic remediation does not replace safety with a formally correct but dangerous action. The next chapter will turn the duel into a stress-specification generator.

Artifacts and Readiness Criteria

The study minimum is three artifacts and three conditions under which they can be considered ready.

Artifact	Ready when
Given/When/Then scenario	covers one disputed requirement, the checkable fields are linked to JSON Schema
`counterexample.json` or entry in `validation.md`	the input is valid by schema and violates only the checked Then; the counterexample is minimal or explicitly marked as non-minimal
`next_guard`	a new rule is formulated in Given/When/Then form and will be checked after the fix

The full track adds the Implementor's repair.patch / schema_delta, a validation.md entry with duel_id and a link to the repeated run, a matrix of neighboring counterexamples, and a local smoke-pass of the runnable analog of the duel from examples/tribunal/. Consider the full track ready if the Implementor changes the rule and the contract (not just the explanation) and the repeated duel does not find the same class of failure.

Practice

cd book2/examples/tribunal && python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases --out out/duel.json — *expectation: stderr shows PASS autoscale_counter_200pct and PASS duplicate_webhook_within_dedup_window; in out/duel.json for autoscale_counter_200pct the field verdict: "PASS", actual.diagnostic_code: "QUOTA_EXCEEDED_AFTER_CLAMP", actual.allowed_delta: 3.*
Open judgment.example.md and check that for autoscale_counter_200pct.json the field counterexample_id equals the file name without .json, and assertion_id equals allowed_delta_within_quota. *Expectation: identifiers are consistent — counterexample_id matches the file name, assertion_id refers to the violated Then.*

Move to capstone/validation.md one line: "counterexample <counterexample_id> violates Then <assertion_id>; added next_guard: <…>." *Expectation: the counterexample name matches counterexample_id from out/duel.json, the formulation of next_guard is written in Given/When/Then form.*

Check Questions

Why must a counterexample be minimal?
Why doesn't a free-form explanation replace a proof?
What must the Implementor change after a duel failure — besides the code?
The Verifier has found a counterexample, but the Implementor fixes only the code without editing JSON Schema. A week later a similar counterexample passes. Where is the error in the duel procedure?