Reading: Applied Part 1. Recovering Specifications from Legacy

Lesson 1 of 5 in module «Applied Part 1. Recovering Specifications from Legacy»

You are viewing the lesson without signing in. Sign in to save progress and take tests.

Source

Applied Part 1. Recovering Specifications from Legacy

Status: Recommendation. Gathering evidence, normalizing the timeline, and separating requirements from the memory bank are well-established engineering techniques. The three-party file-based arbitration at the end of the chapter is on the frontier.

For the learning track, it is enough to assemble one genealogy.md and separate an approved requirement from a hypothesis. File-based arbitration, normalizers, and historical replay are only needed for the full production track.

This chapter continues part 13 of the first volume: there we recovered the constitution of an existing project; here we recover a single production requirement from the traces of an incident. Keep the focus narrow: one claim, two sources, one open question. Anything that requires normalizers, historical replay, or file-based arbitration belongs to the full track.

Before Reading

Foundation from the first volume: part 13 teaches how to recover the constitution of an existing project; here you recover a single production requirement.
Local learning case: node_not_ready, because it makes it easy to demonstrate provenance and uncertainty.

Trace for capstone/: one genealogy.md entry for the main high_memory_usage with two evidence_ref and one open question.
Key terms on the first pass: evidence_ref and memory bank (the boundary between requirement and background context). Other chapter terms — Verifier/Implementor/Safety, Coordinator-Recorder, normalizer, file-based arbitration — are reference material, covered in detail in part 8.
What to defer: log normalizers, historical replay, and file-based arbitration.

In the first volume, AgentClinic was a learning project in TypeScript, Hono, server-side JSX, SQLite, and Vitest. In the second volume we use the learning model AgentClinic-production. The same project is mentally deployed in Kubernetes. Grafana and PagerDuty send webhooks (webhook) into its triage loop, and long-running replicas have accumulated operational history. Python in the second volume is used only for small runnable scripts in examples/, not as the main application stack.

There is no need to set up a real cluster. The legacy traces that chapters 1–11 work with are training post-mortems, dashboards, and logs from the production scenario. The specific incidents that follow (node_not_ready, appointment_latency / appointment_latency_spike, autoscale_200pct, cdn_error_budget_burn, high_memory_usage) are events from this model, not abstract scenarios.

The engineering name for this technique is recovering specifications from observable artifacts: logs, metrics, chats, post-mortems, and verifiable decision traces. If you encounter the figurative formulation "Spec necromancy," treat it only as a short label for this reconstruction, not a separate technique.

Goal

After the SRE team churn, the automated incident management project was left with fragments: 47 pages of unstructured logs, several Slack threads, dashboard screenshots, and post-mortems without a formal SDD. The goal of the chapter is to show how to use such traces to recover an engineering-grade specification for a triage pipeline based on Qwen Code. The alternative — a set of plausible guesses — is not acceptable.

After the section you will be able to:

separate requirements from the background memory bank model (the full definition is in "Key Ideas" below);
assemble evidence into a single chain of events;
extract implicit rules and turn them into verifiable user stories;
anchor the provenance of every item so that disputed decisions can be audited and re-justified later (the SDD framework of "specification as an executable artifact" from GitHub Spec Kit).

Minimum Learning Scenario

Learning Case

Production incident node_not_ready: based on the metrics log, PagerDuty escalation, and one post-mortem, you need to recover a single requirement — when the NodeNotReady event becomes a P1 and when it cannot be auto-closed.

Preparation

book2/examples/templates/genealogy.md — the provenance template.
The training excerpt below is a minimal substitute for logs, post-mortems, and Slack threads.
One disputed fact: the planned deploy window, the canary namespace, or the manual cancellation of the escalation.

A natural question: how is genealogy.md different from git log or git blame. Short answer: because git lacks the fields that are essential here. git log shows which file was changed and who made the change. genealogy.md shows where the requirement itself came from, how confident we are in it (uncertainty), which sources support it (evidence_ref), and which open questions remain unanswered. A "added requirement" commit in git history does not distinguish "we know this firmly from two post-mortems" from "we assumed this in chat." In genealogy.md this distinction is mandatory.

Minimum training excerpt:

grafana:NR-2026-05-17-01  cluster=prod-k8s node=worker-07 event=NodeNotReady count=3 window=10m
pagerduty:NR-2026-05-17-01 escalation=created owner=platform_oncall severity=P1
postmortem:node-not-ready-2026-05 note="auto-resolve was rejected until two stable OK windows"
open_questions:
  - "canary namespace исключает P1 или только снижает уверенность?"

If you don't have your own logs, use this excerpt. If you have real materials, replace the lines with your own, but keep the same minimum: two sources, one claim, one open question.

Steps

Copy the genealogy.md template into the working directory. Expectation: a file appears with sections for source, status, confidence, and open questions.
Write one candidate statement: for example, >=3 NodeNotReady in 10 minutes creates a P1.
Add at least two evidence markers (evidence_ref) and one missing context. Expectation: the statement cannot be read as "just the author's opinion."
Separate the requirement from the memory bank: cluster topology and on-call names must not become the contract.
Rewrite the statement in Given/When/Then and indicate next to it which field of the future JSON Schema will check the threshold, severity, and close condition.
Set the status to approved, needs_clarity, or rejected. Expectation: a disputed fact is not disguised as an approved requirement.

Verification Fact

In genealogy.md there is one entry where the statement, sources, confidence level, missing context, and connection to verifiable behavior are all visible at once. If the threshold or SLA cannot be defended with a reference to a source, the requirement remains a hypothesis.

How This Gets Into `capstone/`

Move into capstone/genealogy.md only one defended entry: the claim, two evidence_ref, the confidence level, and an open question. Do not move the entire timeline, log excerpts, and Slack quotes if they did not become evidence for a specific requirement.

Minimum fragment for high_memory_usage:

- claim: "При memory_percent >= 90% за 10m для appointments-api создаётся P1."
  status: needs_clarity
  evidence_ref: ["grafana:HM-2026-05-17-01", "postmortem:api-memory-2026-05"]
  uncertainty: medium
  open_questions:
    - "Подтверждён ли запрет auto-resolve без двух стабильных окон?"

Reviewable Trace

In the training package, keep only the filled-in genealogy.md or its fragment. Draft log excerpts and temporary tables are not needed in the repository unless they have become verifiable evidence.

Key Ideas

The first discipline of specification recovery is to strictly separate actual requirements from the background memory bank model. By memory bank we mean a separate layer of infrastructure context: everything that helps interpret facts, but is not itself a contract.

If this term seems new, look at it through the lens of the first volume. What lived there in tech-stack.md (what we build on) and in QWEN.md (the agent's persistent context) is called in the second volume by one common name, memory bank. It is the same background layer, only now it is explicitly separated from requirements, because in production scenarios the difference between "contract vs context" becomes critical.

A requirement, unlike the memory bank, describes the behavior of a feature. What is considered a trigger. When an incident is created. What SLA applies. Who receives the escalation. Under what conditions the event is closed.

The memory bank stores something else: cluster topology, team rosters, historical agreements, API limitations, customary communication channels, and operational vocabulary. Why is it important to separate them. If you mix the levels, a false rule can easily appear in the SDD, such as "canary is always non-escalatable." In reality, this may be only the context of a test namespace, not a universal product behavior.

Introduce the separation already at the artifact inventory stage. In the SDD, place statements that can be verified by an observable scenario: >=3 NodeNotReady in 10 minutes creates a P1, NOC receives notification in 15 minutes, closing requires 2 consecutive OKs.

In the memory bank place everything that helps interpret facts but is not a contract:

who was on call the night of the incident;
why an old service name was used in Slack;
which teams have access to Grafana.

This filter reduces the risk that Qwen Code will mistake the infrastructure background for a business rule and start designing behavior based on an incidental detail.

The second idea is to assemble and normalize evidence into a single timeline of events. Each source has its own profile:

logs give observable states and the order of events;
Slack shows the operators' intentions and manual workarounds;
post-mortems capture causes and consequences;
metrics let you assess the scale of degradation.

Before analysis, bring the sources to a common time (UTC). Remove duplicates, isolate event codes, and link records by a single incident, cluster, node, or deployment identifier. Without this, SDD recovery turns into an argument about memories rather than a reconstruction of system behavior.

The normalized chain is built as a sequence ts → source → event_code → actor → affected_scope → evidence_ref, where the last field is the evidence marker (evidence_ref), a reference to a specific place in the source artifact. In the node_not_ready case, the skeleton may show that three NodeNotReady events in 10 minutes almost always preceded the creation of a P1. Then 15 minutes later an escalation to NOC followed. Closing happened only after a couple of stable OKs.

Record exceptions separately: the planned deploy window, canary namespace, temporary loss of metrics, or manual cancellation of an escalation. Do not remove such exceptions as noise — they often point to hidden conditions of the future specification.

> [conceptual interface] — these commands show the expected interface of local normalizers. There are no ready-made timeline_builder.py and evidence_matrix.py in the textbook repository; implement them in your project if you move from the training minimum to the full track.

rg -n "NotReady|NodeNotReady|ALERT|deploy" evidence/raw/* > evidence/index.txt
python3 tools/timeline_builder.py --input evidence/raw --out evidence/timeline.ndjson
python3 tools/evidence_matrix.py \
  --timeline evidence/timeline.ndjson \
  --slack evidence/slack_export.json \
  --metrics evidence/metrics.csv \
  --out evidence/matrix.csv

Verification: each row in evidence/timeline.ndjson contains ts, source, event_code, cluster, namespace, actor, and evidence_ref; empty fields block the transition to requirement derivation.

Next, the diagram shows how the recovered SDD is obtained from legacy. On the right side, an "Arbitration" block appears with three roles and a coordinator: this is the full track, which is covered in detail in part 8. On the first pass, treat the "Arbitration" block as a single step "disputed requirements are checked by an independent role" — the detailed composition of the roles does not need to be read here.

flowchart TD
  subgraph Input["Input: legacy"]
    L[Logs, post-mortem, Slack, metrics]
  end
  subgraph Processing["Processing"]
    P[Parsing and timeline]
    R[Requirement hypotheses and user stories]
  end
  subgraph Arbitration["Arbitration (full track, ch.8)"]
    TBR[Independent role checks disputed requirements]
  end
  subgraph Result["Result"]
    S[Recovered SDD and genealogy.md]
  end
  L --> P --> R --> TBR --> S

The third idea is to extract implicit requirements through Qwen Code, but evaluate each statement by its source and context. Qwen Code here works not as the author of business logic, but as an extraction intermediary. It is given facts, environmental constraints, and a strict response format, where statements without a reference to evidence are forbidden.

A good prompt asks not to "come up with an SDD," but to do something else:

find repeating rules in the event chain;
indicate the supporting sources;
name counterexamples;
assign a confidence level.

Thus the model strengthens the analysis, but is not given the right to turn guesses into requirements. Expect from Qwen Code a list of candidate statements (claims), not a final specification.

Bad:

> REQ-NR-01: frequent NodeNotReady on a node creates a P1.

Problem: there is no threshold, no window, no evidence marker. The rule cannot be verified or challenged.

Good:

> REQ-NR-01: at >=3 NodeNotReady in 10 minutes on one node and correlated 5xx growth, a P1 is created. evidence: logs/node-2026-05-12.parquet#row_4123, slack/thread_11#msg_7, grafana/node_5xx#segment_11:00. confidence: medium. missing_context: planned deploy window.

What this gives in practice. Such a record is more useful than a smooth user story text: it immediately shows where the requirement is stable and where it requires verification with the service owner. If a rule is supported by only one post-mortem and does not match the metrics, it remains a hypothesis, even if it sounds convincing.

> [project script] — qwen -p is runnable on its own, but the input @evidence/matrix.csv must first be assembled in your project. Stabilize the format of the resulting JSON with a separate parser-normalizer.

qwen -p "Прочитай @evidence/matrix.csv. Найди повторяющиеся правила
инцидента node_not_ready. Верни claims с evidence, counterexample,
missing_context и confidence. Не утверждай факты без evidence." \

--approval-mode plan \
  --output-format json \
  > sdd/drafts/nr-claims.qwen.json

qwen -p "Прочитай @sdd/drafts/nr-claims.qwen.json и проведи cross-examine:
для каждого claim проверь source, counterexample и missing_context.
Пометь claim как approved, needs_clarity или rejected." \
  --approval-mode plan \
  --output-format json \
  > sdd/drafts/nr-claims-cross.qwen.json

Verification: Qwen here works in headless Plan Mode. The final JSON from Qwen Code is a report with session messages; if the project needs a strict claims.json, add a separate parser-normalizer and check it with tests.

The fourth idea is to encode requirements simultaneously in Given/When/Then and a machine-readable contract, such as JSON Schema. Given/When/Then keeps the requirement in the language of behavior: initial state, event, expected result.

JSON Schema captures required fields, valid values, numeric boundaries, and data structure. The contract can be validated in CI or in a local validation pipeline. The dual recording eliminates the gap between "understandable to a human" and "verifiable by a machine."

For node_not_ready the behavioral story looks like this:

Given the cluster prod-k8s is in active shift and >=3 NodeNotReady is recorded for one node in 10 minutes;
When the event is correlated with a deployment or 5xx growth in related metrics;
Then an incident severity=P1 is created, the initial response is expected in 8 minutes, auto-escalation to NOC occurs in 15 minutes, and closing is allowed only after 2 consecutive OKs within 10 minutes.

Designate the exception for the canary namespace as a separate condition, not as a note at the end. Otherwise, the validator will not be able to distinguish the standard path from the relaxed threshold. This format moves the conversation about "fast response" into specific numbers, events, and statuses.

The minimum JSON Schema of the same contract (the full form with triggers and a regular expression for auto_resolve_window is in the full track):

{
  "$id": "urn:spec:node-not-ready:v1",
  "type": "object",
  "required": ["rule_id", "severity", "sla_minutes", "conditions"],
  "properties": {
    "rule_id":      {"type": "string"},
    "severity":     {"type": "string", "enum": ["P0", "P1", "P2", "P3"]},
    "sla_minutes":  {"type": "integer", "minimum": 1, "maximum": 120},
    "conditions": {

"type": "object",
      "required": ["event_code", "count", "window_minutes", "namespace_rule"],
      "properties": {
        "count":          {"type": "integer", "minimum": 3},
        "window_minutes": {"type": "integer", "minimum": 1},
        "namespace_rule": {"type": "string", "enum": ["standard", "canary"]}
      }
    }
  }
}

The fifth idea applies only to the full track: disputed recovered requirements can be submitted to file-based arbitration. Three roles vote — Verifier, Implementor, Safety; the Coordinator keeps the log, without voting. The Verifier checks the consistency of numbers and statuses, the Implementor — feasibility in the current triage pipeline, Safety — the boundaries of safe action and the right of veto on critical_risk. The roles, verdicts, and precedents are covered in detail in part 8; the runnable training analogue is [examples/tribunal/](examples/tribunal/). For the training minimum, this step is not needed: genealogy.md with sources, confidence level, and an open question is enough.

The sixth idea is to maintain genealogy.md, a separate provenance registry for each requirement. Why it is needed. The recovered SDD quickly loses value if a month later it is impossible to explain:

why the threshold of 3 events in 10 minutes was chosen;
who confirmed the 8-minute SLA;
why canary received a separate mode.

genealogy.md links a statement to logs, Slack, metrics, post-mortems, the file-based arbitration decision, and the current level of uncertainty. Thus the specification becomes a chain of evidence, not a textual snapshot of collective memory.

- req_id: NR-01
  statement: "При >=3 NodeNotReady за 10m для одного node и росте 5xx создаётся P1."
  source:
    - logs: evidence/normalized_node_logs.parquet#row_4123
    - slack: export/slack_thread_11.json#msg_7
    - metrics: grafana/node_5xx_timeseries.csv#segment_2026-05-12T11:00
  status: approved
  adjudicated_by: [Verifier, Implementor, Safety]
  uncertainty: low
  open_questions: []

If an item remains disputed, do not disguise it as an approved contract. Set uncertainty: medium or uncertainty: high, indicate the reason for the doubt, and add a verification plan:

contact the service owner;

run a replay against historical data;
compare with a neighboring cluster;
collect the missing metric.

Such a provenance registry is especially important for the project's future Constitution. Only rules with a clear origin, scope, and revision mechanism should transfer into it.

Examples and Application

The 4-line training excerpt in "Minimum Learning Scenario" is already the filtered result of normalization. The original set contains:

9 hours of observations;
11 relevant Slack messages;
47 pages of unprocessed logs;
1,248 NodeNotReady events;
63 alerts;
8 previously closed incidents.

After normalization, it is clear that the sharp increase in NodeNotReady coincided with a deployment, some events went into the canary segment with a different auto-escalation logic, and two branches of behavior appear: the standard P1 and the canary path with relaxed thresholds.

> [conceptual interface] — normalizer pseudocode. The runnable examples of the second volume remain on the Python stdlib and live in book2/examples/.

read evidence/normalized_node_logs
sort events by ts
filter event_code == "NodeNotReady"
group by cluster,node in 10m windows
mark windows where count >= 3

link marked windows to alerts and Slack messages in [-15m,+5m]

The [-15m,+5m] window is needed because an operator could have discussed the problem before the formal incident record or already after the automatic alert. If an event belongs to a canary namespace without SLO degradation, set a separate label, do not delete it as noise. If the planned deploy window explains part of the NodeNotReady, directly indicate in the requirement whether this blocks the creation of a P1 or only lowers the confidence.

The recovered SDD becomes a working artifact only after replay: run historical incidents through the new JSON contract and check whether the created severities, SLAs, and escalations match the confirmed outcomes. Mismatches do not always mean a contract error — sometimes they show that the old practice was inconsistent or depended on a specific on-call engineer. What to change in that case — the specification, the memory bank, or the hypothesis status in genealogy.md — is decided by the file-based arbitration from part 8.

Summary

Recovering specifications from legacy restores the SDD not from intuition, but from a verifiable chain of evidence. The route is as follows:

legacy artifacts are normalized into a timeline;

Qwen Code extracts candidate statements with a confidence level;
requirements are separated from the memory bank;
then encoded in Given/When/Then and JSON Schema;
for the full track, they pass through file-based arbitration of Coordinator/Implementor/Verifier;
receive provenance in genealogy.md.

Such a process turns the chaos of logs, chats, and post-mortems into a contract. The contract can be validated, challenged, replayed against historical data, and transferred into a stricter system of rules. In the next chapter, we will deliberately poison the specifications with contradictions and study where Qwen Code starts to get stuck.

Artifacts and Readiness Criteria

Artifact	Ready when
`genealogy.md` with one requirement or hypothesis	the requirement is separated from the `memory bank`, disputed facts are marked as hypotheses
At least two `evidence_ref` and one missing context	the statement cannot be read as "the author's opinion", the threshold/SLA is defended by a source reference or is explicitly marked as not yet approvable
Given/When/Then formulation	verifiable fields are linked to what the JSON Schema covers

The full track adds evidence/timeline.ndjson, evidence/matrix.csv with references to logs, Slack, metrics, and post-mortems, sdd/drafts/nr-claims.qwen.json with candidate statements, contracts/node_not_ready.schema.json, and a file-based arbitration record for requirements that cannot be approved manually. Consider the full track ready if Given/When/Then and JSON Schema describe the same contract, the normalizer produces a reproducible timeline, and the validator or file-based arbitration delivers a verifiable verdict.

Practice

Copy [examples/templates/genealogy.md](examples/templates/genealogy.md) into capstone/genealogy.md and fill in one entry for the main high_memory_usage case: a statement, at least two evidence_ref, a confidence level, and one open question. The training excerpt from "Minimum Learning Scenario" can be used as a substitute for real logs.
Rewrite your statement in Given/When/Then and indicate next to it which three fields of the JSON Schema check the threshold, severity, and close condition. A field that cannot be defended with a source reference should be left as uncertainty: medium, not as an approved contract.

Open [appendix-a-bridges-to-book.md](appendix-a-bridges-to-book.md) and note which chapter of the first volume was the foundation for your genealogy.md. If there is no foundation, this is a signal that the requirement is not yet tied to the training model.

Review Questions

Why is evidence more important than a confident formulation of a requirement?
How does the memory bank differ from an SDD contract and why is it dangerous to mix them?
When can a hypothesis not be moved to an approved requirement?
You have recovered a rule from two post-mortems, but the service owner quit six months ago. What will you do with this rule before adding it to requirements.md?