Study guide: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements

Lesson 3 of 5 in module «Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Topic: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Assertions

Difficulty level: Medium

Estimated study time: 3-4 hours

Prerequisites: Basic understanding of JSON Schema and data validation formats

Familiarity with the BDD (Behavior-Driven Development) approach and Given/When/Then syntax

Experience working with REST APIs, webhooks, and monitoring systems (e.g., Prometheus, Grafana, PagerDuty)

Understanding of the principles of automatic scaling (autoscaling) and incident management

Learning objectives: Describe the roles of the Verifier and the Implementer in the process of adversarial validation (LLM duels).

Learn how to formulate minimal counterexamples to check the operational boundaries of a specification.

Master the practice of extending JSON Schema by incorporating operational limits (quotas, blast-radius).

Implement a validation.md log protocol to record precedents and new rules (next_guard).

Set up a local CI pipeline for automatic specification verification via run_duel.py.

Overview: This module is dedicated to the methodology of adversarial validation of formal specifications using LLMs — the so-called "LLM duels." In real automatic incident resolution systems, incoming data (for example, webhooks) can be formally correct but lead to catastrophic consequences due to violations of operational boundaries. In this course, we examine an architecture in which one language model (the Verifier) attempts to find a minimal destructive counterexample, while the second (the Implementer) fixes the rules and JSON Schema so that the system robustly rejects dangerous actions. Using the autoscale_200pct case study, you will learn how to move verification from a purely logical plane to the operational plane.

Key concepts: Llm-duel: A methodology of adversarial validation where roles are distributed between two LLMs: the Verifier searches for vulnerabilities (counterexamples), and the Implementer fixes the specification and code.

Minimal counterexample: Input data that contains exactly those fields and values without which the violation disappears. It is valid according to JSON Schema but violates the approved Then rule. Using a minimal counterexample avoids regressions and precisely localizes the issue.

Operational boundaries: Limitations of real infrastructure (quota, rate-limit, blast radius, deduplication) that must be formalized in the specification on par with data types.

Given/when/then: A strict format for recording a behavioral contract. Given (initial state), When (incoming impact, e.g., a webhook), Then (expected result or safeguard).

Validation.md: A log file that stores the history of duels. It contains duel_id, assertion_id, counterexample, verdict, and the generated next_guard.

Next guard: A new safety rule formulated after a successful counterexample, which the system must check on all subsequent runs.

Coordinator: An arbiter in the LLM-duel process that is engaged if the Verifier and the Implementer cannot reach an agreement within a given number of rounds, and moves the incident to DEFERRED status.

Practice exercises: Name: Running a training LLM-duel pass

Problem: You need to check the autoscale_spec.yaml specification for resilience to an attack with a request to increase replicas by 200%. Run the local duel script and analyze the baseline verdict.

Solution: 1. Open a terminal and navigate to the example directory: cd book2/examples/tribunal. 2. Run the command: python3 scripts/run_duel.py --spec specs/autoscale_spec.yaml --cases cases/ --out out/duel.json. 3. Open the generated file out/duel.json. Find the autoscale_counter_200pct case and make sure the verdict has changed to PASS (or analyze FAIL if the specification has not yet been patched).

Complexity: beginner

Name: Formulating a minimal counterexample

Problem: A webhook arrived in the system to restart a pod with the parameters: readiness=24/25, stateful=true, backup_verified=false. Formulate a minimal counterexample in JSON format that proves that the dry-run must be blocked.

Solution: For a minimal counterexample, you need to keep only those fields that directly affect the security logic. { "readiness": 24, "stateful": true, "backup_verified": false }. We excluded namespace, pod_id, and other metadata, because without them the logic violation (an attempt at a dry-run of a stateful pod without a backup) does not go away, but the analysis becomes pinpointed.

Complexity: intermediate

Name: Integrating next_guard into validation.md

Problem: The Implementer successfully defended the specification against repeated webhook triggering (deduplication). Record the duel result in validation.md in Given/When/Then format, using the principles described in the lesson.

Solution: Add the following entry to validation.md:

assertion_id: DEDUP-SCALE-01 counterexample: "two webhooks with scale_up_percent=100 arrive with an interval of 1 second" verdict: PASS next_guard: "Given a deduplication window of 2 seconds When a duplicate scaling webhook is received Then executed_delta does not increase again and the diagnostic code DUPLICATE_WEBHOOK_IGNORED is returned".

Complexity: advanced

Case studies: Name: Critical incident in AgentClinic-production: Autoscale 200%

Scenario: A service called appointments-api is running in the cluster. The current CPU load is 98%, and 12 replicas are running. The quota allows adding 3 more (the cluster limit is 15 replicas). At this moment, the automation system sends a webhook: "increase the number of replicas by 200%".

Challenge: Formally, the input data is absolutely correct — the scale_up_percent field is filled in correctly, and the types match. However, executing this command will request the creation of 24 additional replicas, which will lead to quota exhaustion, violation of limits, and service failure in the middle of the scaling operation.

Solution: Using the LLM-duel technique. The Verifier generated a minimal counterexample: { current_replicas: 12, remaining_quota: 3, scale_up_percent: 200 }. The Implementer added operational boundaries to the JSON Schema and logic: introduced the formula allowed_delta = min(requested_delta, floor(remaining_quota / pod_cpu), max_replicas - current_replicas) and a clamp_policy with the values hard_block / soft_clamp.

Result: The autoscaler stopped breaking on infrastructure-invalid (but formally correct) requests. When a request for 200% arrived, the system safely clamped the step to +3 replicas (soft_clamp) and recorded the diagnostic code QUOTA_EXCEEDED_AFTER_CLAMP in the audit trail.

Lessons learned: Formal schema validation is not enough for safe automation; quotas and limits must be part of the specification.

A minimal counterexample allows the system to be tested for resilience to specific classes of failures without data noise.

The result of a duel should automatically become a new rule (next_guard) for the CI pipeline.

Related concepts: Minimal counterexample

Operational boundaries

JSON Schema

Adversarial validation

Study tips: Do not try to immediately introduce an external Coordinator — start with a manual offline run to understand the duel mechanics.

When creating counterexamples, always ask yourself: "What will happen if I remove this field?" If the violation remains, the field is not minimal and should be removed.

Focus on the validation.md format. In real work, this file is your legal base (precedents) for auto-blocking regressions.

Distinguish between the concepts: "poisoned specification" (a requirements defect), "mutants" (a class of defects), and "duel counterexample" (a specific input that breaks Then).

Additional resources: Github spec kit: https://github.com/github/spec-kit — for learning the specification-first approach.

Wikipedia: formal specification: https://en.wikipedia.org/wiki/Formal_specification — the theoretical foundation for formal specifications.

Offline example tribunal: book2/examples/tribunal/ — the source code of the script and JSON examples for local execution.

Summary: The LLM duel (Verifier vs. Implementer) turns formal specifications into a reliable safeguard for incident management. Instead of checking only the correctness of data types, the system moves on to verifying operational boundaries (quotas, blast-radius limits). Minimal counterexamples make it possible to isolate vulnerabilities, and all changes and failures are recorded in validation.md, turning every error into a regression test (next_guard) for future CI.

0 / 10000

Notes are saved in this browser. They will not appear on another device.

Course

Using SDD in Development for Qwen Code CLI. Applied Course

Progress 0 / 95

○ Reading: Practical Part 0. AgentClinic-production Laboratory 🔒 Diagram: Practical Part 0. AgentClinic-production Laboratory 🔒 Study guide: Practical Part 0. AgentClinic-production Laboratory 🔒 Quiz: Practical Part 0. AgentClinic-production Laboratory 🔒 Flashcards: Practical Part 0. AgentClinic-production Laboratory

🔒 Reading: Applied Part 1. Recovering Specifications from Legacy 🔒 Diagram: Applied Part 1. Recovering Specifications from Legacy 🔒 Study guide: Applied Part 1. Recovering Specifications from Legacy 🔒 Quiz: Applied Part 1. Recovering Specifications from Legacy 🔒 Flashcards: Applied Part 1. Recovering Specifications from Legacy

🔒 Reading: Applied Part 2. Specification Defect Diagnostics 🔒 Diagram: Applied Part 2. Specification Defect Diagnostics 🔒 Study guide: Applied Part 2. Specification Defect Diagnostics 🔒 Quiz: Applied Part 2. Specification Defect Diagnostics 🔒 Flashcards: Applied Part 2. Specification Defect Diagnostics

🔒 Reading: Applied Part 3. Project Constitution: First Referendum on Rules 🔒 Diagram: Applied Part 3. Project Constitution: First Referendum on Rules 🔒 Study guide: Applied Part 3. Project Constitution: First Referendum on Rules 🔒 Quiz: Applied Part 3. Project Constitution: First Referendum on Rules 🔒 Flashcards: Applied Part 3. Project Constitution: First Referendum on Rules

🔒 Reading: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements 🔒 Diagram: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements ▸ Study guide: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements 🔒 Quiz: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements 🔒 Flashcards: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements

🔒 Reading: Applied Part 5. Mutation Testing of Specifications 🔒 Diagram: Applied Part 5. Mutation Testing of Specifications 🔒 Study guide: Applied Part 5. Mutation Testing of Specifications 🔒 Quiz: Applied Part 5. Mutation Testing of Specifications 🔒 Flashcards: Applied Part 5. Mutation Testing of Specifications

🔒 Reading: Applied Part 6. Selection of Shadow Specifications 🔒 Diagram: Applied Part 6. Selection of Shadow Specifications 🔒 Study guide: Applied Part 6. Selection of Shadow Specifications 🔒 Quiz: Applied Part 6. Selection of Shadow Specifications 🔒 Flashcards: Applied Part 6. Selection of Shadow Specifications

🔒 Reading: Applied Part 7. Specification CI: specification as an executable artifact 🔒 Diagram: Applied Part 7. Specification CI: specification as an executable artifact 🔒 Study guide: Applied Part 7. Specification CI: specification as an executable artifact 🔒 Quiz: Applied Part 7. Specification CI: specification as an executable artifact 🔒 Flashcards: Applied Part 7. Specification CI: specification as an executable artifact

🔒 Reading: Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents 🔒 Diagram: Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents 🔒 Study guide: Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents 🔒 Quiz: Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents 🔒 Flashcards: Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents

🔒 Reading: Applied Part 9. Model Routing and Token Budget 🔒 Diagram: Applied Part 9. Model Routing and Token Budget 🔒 Study guide: Applied Part 9. Model Routing and Token Budget 🔒 Quiz: Applied Part 9. Model Routing and Token Budget 🔒 Flashcards: Applied Part 9. Model Routing and Token Budget

🔒 Reading: Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode 🔒 Diagram: Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode 🔒 Study guide: Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode 🔒 Quiz: Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode 🔒 Flashcards: Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode

🔒 Reading: Practical Part 11. Integration with a Real API: From Specification to Deployment 🔒 Diagram: Practical Part 11. Integration with a Real API: From Specification to Deployment 🔒 Study guide: Practical Part 11. Integration with a Real API: From Specification to Deployment 🔒 Quiz: Practical Part 11. Integration with a Real API: From Specification to Deployment 🔒 Flashcards: Practical Part 11. Integration with a Real API: From Specification to Deployment

🔒 Reading: Applied Part 12. Production SDD Antipatterns: Diagnostic Map of the Applied Cycle 🔒 Diagram: Applied Part 12. Production SDD Antipatterns: Diagnostic Map of the Applied Cycle 🔒 Study guide: Applied Part 12. Production SDD Antipatterns: Diagnostic Map of the Applied Cycle 🔒 Quiz: Applied Part 12. Production SDD Antipatterns: Diagnostic Map of the Applied Cycle 🔒 Flashcards: Applied Part 12. Production SDD Antipatterns: Diagnostic Map of the Applied Cycle

🔒 Reading: Practical Part 13. Practical Assessment: Build a Production SDD Pipeline 🔒 Diagram: Practical Part 13. Practical Assessment: Build a Production SDD Pipeline 🔒 Study guide: Practical Part 13. Practical Assessment: Build a Production SDD Pipeline 🔒 Quiz: Practical Part 13. Practical Assessment: Build a Production SDD Pipeline 🔒 Flashcards: Practical Part 13. Practical Assessment: Build a Production SDD Pipeline

🔒 Reading: Appendix A. Bridges to the first volume 🔒 Diagram: Appendix A. Bridges to the first volume 🔒 Study guide: Appendix A. Bridges to the first volume 🔒 Quiz: Appendix A. Bridges to the first volume 🔒 Flashcards: Appendix A. Bridges to the first volume

🔒 Reading: Appendix B. Qwen Code Compatibility 🔒 Diagram: Appendix B. Qwen Code Compatibility 🔒 Study guide: Appendix B. Qwen Code Compatibility 🔒 Quiz: Appendix B. Qwen Code Compatibility 🔒 Flashcards: Appendix B. Qwen Code Compatibility

🔒 Reading: Appendix C. Applied SDD Checklists 🔒 Diagram: Appendix C. Applied SDD Checklists 🔒 Study guide: Appendix C. Applied SDD Checklists 🔒 Quiz: Appendix C. Applied SDD Checklists 🔒 Flashcards: Appendix C. Applied SDD Checklists

🔒 Reading: Appendix D. Threshold Calibration 🔒 Diagram: Appendix D. Threshold Calibration 🔒 Study guide: Appendix D. Threshold Calibration 🔒 Quiz: Appendix D. Threshold Calibration 🔒 Flashcards: Appendix D. Threshold Calibration

🔒 Reading: Applied Volume Glossary 🔒 Diagram: Applied Volume Glossary 🔒 Study guide: Applied Volume Glossary 🔒 Quiz: Applied Volume Glossary 🔒 Flashcards: Applied Volume Glossary

Study guide: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements

My notes

Course menu

Course

Study guide: Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements

My notes

Course menu

Course

1. Practical Part 0. AgentClinic-production Laboratory 0 / 5

2. Applied Part 1. Recovering Specifications from Legacy 0 / 5

3. Applied Part 2. Specification Defect Diagnostics 0 / 5

4. Applied Part 3. Project Constitution: First Referendum on Rules 0 / 5

5. Applied Part 4. LLM Duel: Verifier vs. Implementer in Formal Statements 0 / 5

6. Applied Part 5. Mutation Testing of Specifications 0 / 5

7. Applied Part 6. Selection of Shadow Specifications 0 / 5

8. Applied Part 7. Specification CI: specification as an executable artifact 0 / 5

9. Applied Part 8. File Arbitration of Disputed Changes: Roles, Verdicts, and Precedents 0 / 5

10. Applied Part 9. Model Routing and Token Budget 0 / 5

11. Applied Part 10. Protecting Metrics from Goodhart's Law: Guardrail Metrics and Emergency Mode 0 / 5

12. Practical Part 11. Integration with a Real API: From Specification to Deployment 0 / 5

13. Applied Part 12. Production SDD Antipatterns: Diagnostic Map of the Applied Cycle 0 / 5

14. Practical Part 13. Practical Assessment: Build a Production SDD Pipeline 0 / 5

15. Appendix A. Bridges to the first volume 0 / 5

16. Appendix B. Qwen Code Compatibility 0 / 5

17. Appendix C. Applied SDD Checklists 0 / 5

18. Appendix D. Threshold Calibration 0 / 5

19. Applied Volume Glossary 0 / 5