Defining the shape of what we need to know and why.

Report 001 named the gap: AI-assisted engineering produces opaque process. The output is fine; the conversation that produced it has vanished. The missing artefact is a receipt — signed, dated, hash-chained, queryable.
This report goes a layer deeper. It defines the contract — what a receipt has to contain, and why each piece is in there. Not how to capture it, not where to store it, not how to query it. The contract first; everything else has to satisfy the contract.
That sequencing is deliberate. A clear contract makes the implementation testable. An unclear contract makes the implementation a moving target. So this is about the contract.
The diagram above shows the phases of an engineering project at a high level. Depending on the project and the team, AI assistance can appear at any of these phases — sometimes lightly, sometimes as the primary author. The methodology this report is defining cares most about the phases where AI authorship is most consequential.
Receipts are emitted continuously across the four core phases — Specify, Design, Build, Validate — where the agent is doing meaningful authorship and the cost of opacity is highest. The boundary phases (Define Goals, Deploy & Improve) are conditionally in scope: if AI did substantial authorship there, receipts are emitted; if a human-only process was used, they aren't.
That's where receipts live. Now to what they contain.
Before going further, three terms the methodology uses with precision, because they're going to recur across this report and the ones that follow.
The receipt ledger is the temporal sequence of receipts — each one anchored to the previous by hash, in the order they were produced. The ledger is what makes the methodology tamper-evident: alteration of any earlier receipt invalidates the chain forward. Fig. 5, later in this report, visualises the ledger.
The delegation tree is the authority structure within and across receipts. A principal authorises an agent. That agent may authorise a sub-agent for a scoped task. Each authorisation has its own scope, its own permissions, its own halt discipline. Halts where new permissions are requested and granted are themselves authority transfers, recorded on the tree. The delegation tree expresses these relationships — who authorised whom, with what scope, where halts were due, where new scope was opened in response. This report sketches the field that begins to capture it; future reports will go deeper as the methodology engages with multi-agent work.
Multi-agent work is structurally a tree, often a directed acyclic graph. An orchestrator fans out to parallel sub-agents. Those sub-agents may spawn their own helpers, cross-communicate with siblings, or produce work that flows back into shared descendants. The methodology models this deliberately as a graph — not a chain — because the questions worth asking are graph traversals: which sub-agent in which branch was granted which permission, which skills were honoured across the fleet, where in the branching structure a destructive action originated. A chain cannot answer those questions; a graph can.
The provenance graph is the dependency structure — which work was built on which other work, and which authorities, permissions, tools, and skills were used along the way. When agent X consumes the output of agent Y as an input, that's a graph edge. When agent X invokes a particular tool or skill, that's recorded against the graph. When agent Y raises a halt and is granted a permission that propagates to sub-agents, those grants are graph edges too. In a single agent's work, much of this is also visible in steps[]. When one principal's intent fans out across multiple sub-agent invocations — whether those run as discrete sessions, nested invocations, or some other runtime structure — the graph is what lets you ask the questions that matter: which sub-agent in which branch was granted which permission, which skills were honoured across the fleet, where in the branching structure a destructive action originated. The provenance graph is reconstructed by following the references that link receipts and the events within them. Future reports will deepen this.
This report is about the receipt itself and the ledger that connects them. Delegation trees and provenance graphs are mentioned where the schema engages with them, but their full treatment comes later.
Start with the schema-naming drift from Report 001. Two products, same architect, same house style, divergent schemas. Three plausible explanations, indistinguishable from the outside.
Now imagine a receipt for the build that produced the second product. What would it have to contain to tell you which of the three scenarios actually occurred?
You would need a record of the dialogue — every clarifying question the agent raised and how it was answered. Without that, you cannot tell whether or what the agent asked. So the receipt needs steps[]: an ordered, typed sequence of every action and exchange in the session.
Each step needs to be typed by intent, not just by mechanism. A tool call is a mechanism. "Agent asked a clarifying question; principal gave a response; agent acknowledged the rule; agent did something different anyway" — that is a sequence of typed events: clarifying_question, response, confirmation, override. Without typing, you can reconstruct what happened but not what kind of conversation it was.
The halt events need to be called out as first-class. The moment an agent encounters something not in its closed vocabulary and refuses to invent — "this pattern doesn't match what I'm allowed to do; I'm asking rather than guessing" — is the methodology working. That refusal needs to be visible in the receipt, not buried inside step content. So steps[].halt_triggered is its own boolean, and halt_reason is its own typed enum.
Halts matter beyond the schema-drift case. Surfacing halts and activities within receipts makes it possible to review what happened when a session with an AI agent results in the deletion of production data — it's all too easy to wave a hand and blame new technology and processes with claims that cannot be verified or countered. Imagine instead being able to review a particular session, to see who initiated it, what permissions were granted, what activity we requested, and whether the agent asked before proceeding with a destructive step. That's the difference between an incident report that can only speculate and one that can reconstruct.
That gets you most of the way to diagnosing the schema-naming drift. But not all the way.
A few more fields are needed before the three scenarios can be distinguished cleanly.
Authority. You need to know which agent ran, on whose authority. A receipt without a principal (the accountable human) and an actor (who actually executed) cannot answer the "on whose authority" question that regulated environments will eventually ask. And in any team that uses agents to invoke other agents — Marlow telling Iris to do something, in studio terms — you need a delegation chain, not just a flat actor field. So delegation_path[] captures the chain of authority transfers, each with its own scope and granted-at timestamp.
This field came from a specific incident. The studio's agent operating manual records a violation where one agent directed another to read directly from a database layer it shouldn't have touched. The fact of the violation could be reconstructed afterwards from session logs. But the chain of authority — who told whom they could do what, and on what basis — could not. A flat actor field would not have helped. A delegation chain would have.
Identity at session start. You need to know who began the work, on what device, in what context, before any work happens. The principal's identity, the device fingerprint, the network context, the session start time — all attested at the moment the session begins. Tony, on Tony's device, in Tony's office, at this time is a corroborating context that an auditor can verify; Tony, on a device that's never been seen before, in a different country is a flag. The opening identity attestation is non-optional — without it, the receipt has no anchor for the who and the whence of the work it records.
Runtime. You need to know what the agent was running on. Model family, model version, sampling parameters, the system prompt the model was operating under. Drift between two builds may be explained by drift between two models — a Sonnet that halts reliably and an Opus that waves things through, or the reverse. Without runtime.model.{family, version} and runtime.system_prompt_hash, this hypothesis is untestable.
Context state. You need to know how full the context was when each step happened. There is a hypothesis the methodology cares about: agents drift more when context is loaded. Consider what happens during a long session — the agent has loaded skill documentation, conversation history, tool outputs, and intermediate work. As the context fills, the model is increasingly summarising or compressing what it can hold. The skills that produce halt-and-ask behaviour live in that context. If those skills are the first to degrade under compression, the agent's halt discipline drops late in the session — exactly when complex decisions are being made.
If the hypothesis is true, halt rate should drop as context_state.tokens_used_pct rises. If it's false, no harm done — the field is cheap to capture. Either way, the only way to find out is to capture it at every step. So context_state appears at receipt level and at step level. It also matters operationally: engineering leaders need visibility into cost, performance, and how token consumption maps to outcomes. The same field that tests the methodology's hypothesis serves the budget conversation.
Inputs classified by sensitivity. PII, PHI, PCI, MNPI — the controlled vocabulary differs by regulatory regime, but every regulated environment cares whether the receipt-producing process touched data of a given class. And you need to record this by hash, not by content; the receipt must not become a secondary data exposure.
The spec being built against. A spec reference — the planning document, ADR, or specification the build was meant to satisfy — and a diff against spec with severity typed as a closed enum (none, cosmetic, material, breaking). Without this, divergence is invisible. With it, divergence becomes queryable: "show me all builds where the diff was material and the principal didn't acknowledge it."
That is the receipt, in narrative form. Fig. 4 shows the structural shape it takes.
The diagram shows the envelope in full: a header with the receipt's identity and ledger anchors, a body containing the major sections, and a footer carrying the integrity attestations. The steps[] section is highlighted because it carries the methodology — the typed dialogue between agent and principal is what distinguishes a receipt from a log.
A receipt is anchored by signatures, but not by a single closing signature applied at session end. Three roles, doing three different jobs:
The opening identity attestation is signed at session start. Tony, on this device, in this context, initiated this work, under this authority. The who is established before any work happens — not reconstructed afterwards. This is non-optional. A receipt without an opening identity attestation has no anchor.
The integrity attestation is signed by whatever process captures the receipt — ideally per-step as the work progresses, so each step's provenance is anchored at the moment it occurs rather than retroactively. This makes the recording itself trustworthy: an investigator looking at a step weeks later can verify it was captured by the trusted process, signed at the time, and not altered since. The capture process is itself an accountable party — who recorded this matters separately from who did the work.
The accountability attestation is the principal's review and acceptance of the record — applied not on every session but on the receipts that warrant attention. The methodology is realistic about this: comprehensive principal review of every step of every session is not a methodology, it is a thing that won't happen, and methodologies that depend on things that won't happen perform accountability rather than enable it. The receipt's job is to surface what would otherwise be invisible — anomalies, drift, moments where things diverged from expected behaviour. The principal's job is to be findable and accountable when something flagged warrants investigation, not to perform comprehensive review on clean sessions.
A clean session — opening identity attested, every step's integrity attested, no anomalies surfaced for review — is a defensible record without a closing accountability attestation. A flagged session — diff severity material, halt anomalies, scope grants outside policy — is one where the accountability attestation is the next required step. The methodology doesn't require principals to inspect; it requires them to be findable when inspection is needed.
The mechanics of how these attestations are captured, where the keys live, what the capture process looks like, and how anomalies are surfaced for review — these are the subject of Report 003. What matters here is that the contract requires all three roles, not a single closing signature.
A single receipt is an artefact. A sequence of receipts, each anchored to the previous, is a ledger.
The ledger mechanic is simple: each receipt's chain.previous_receipt_hash is the SHA-256 hash of the receipt before it. Each receipt's own hash is what the next one will reference. Recompute any earlier receipt's hash and compare it to what the next receipt records — if they match, the ledger is intact. If they don't, the ledger has been tampered with, and the location of the break tells you which receipt was altered.
This is what makes the ledger an audit-grade artefact rather than a logfile. A log can be edited after the fact and the edit is undetectable. A ledger cannot — any edit to a past receipt invalidates the hash forward, and the invalidation is observable to anyone holding a later receipt.
The first receipt in a ledger is the genesis — its previous_receipt_hash is null, indicating no predecessor. Every subsequent receipt anchors to its predecessor. The ledger is open on the right; each new receipt extends it.

Every field in the receipt has an origin — one of reasoned, evidenced, regulatory, or operational. The origin is itself a typed value, part of the schema definition.
Reasoned fields came from first-principles thinking about what makes an artefact auditable. Evidenced fields came from a specific incident in the studio's own work. Regulatory fields are required by a named regime. Operational fields came from observing the studio's own agent operations and noticing a pattern worth capturing.
The reason this matters: by tagging origin, the schema documents why each field is in there. A reader can ask "how do you know this field is necessary?" and the schema answers — "we reasoned it," or "we observed it," or "the FCA requires it," or "this incident showed us we needed it."
The honest disposition is that v0.1 is more reasoned than evidenced. The first hundred real receipts will tell us where the reasoning held and where it needs revising. Some fields will graduate from reasoned to evidenced. Some will turn out to be unnecessary and get retired. The methodology's evolution is not hidden; it is part of the schema's structure.
It is as much a part of the methodology to agree and understand what needs to be captured and why as it is to do the actual capturing and analysis.
The fields below are grouped by receipt section, matching the structural anatomy in Fig. 4. The narrative above explained the load-bearing fields; this table is the precise specification.
| Field | Description | Origin |
|---|---|---|
receipt_id | UUID for the receipt | reasoned |
schema_version | Which version this receipt conforms to | reasoned |
chain.previous_receipt_hash | Hash of the previous receipt in the ledger | reasoned |
chain.sequence_number | Position in the ledger | reasoned |
| Field | Description | Origin |
|---|---|---|
initiation.principal_id | The accountable human who began this session | regulatory |
initiation.device_fingerprint | Hardware/OS identity of the originating device | reasoned |
initiation.network_context | Network signature corroborating origin | reasoned |
initiation.started_at | Session start timestamp | reasoned |
initiation.signature | Signed at session start, anchoring the who | regulatory |
| Field | Description | Origin |
|---|---|---|
build.purpose | Why this build was undertaken | reasoned |
spec_reference | Hash or URI of the spec being built against | reasoned |
| Field | Description | Origin |
|---|---|---|
principal | The accountable human (id, role) | regulatory |
actor | Who actually executed (id, type) | reasoned |
delegation_path[] | Chain from principal to actor, with scope at each transfer | evidenced |
| Field | Description | Origin |
|---|---|---|
runtime.agent_name | Agent identity (e.g. claude-code) | reasoned |
runtime.model.{family, version, provider} | Which model ran | reasoned |
runtime.sampling | Temperature, top_p, max_tokens | reasoned |
runtime.system_prompt_hash | Hash of the system prompt at session start | reasoned |
runtime.skills_loaded[] | Skills in context at session start, with version | operational |
runtime.tools_available[] | Tools accessible at runtime | reasoned |
| Field | Description | Origin |
|---|---|---|
context_state.tokens_used_pct | Context utilisation at receipt boundary | operational |
context_state.position_in_session | Step N of total | reasoned |
context_state.compaction_events | Times context was summarised during the session | operational |
| Field | Description | Origin |
|---|---|---|
inputs[].source | Where data crossed into the model context | reasoned |
inputs[].classification | Sensitivity (public, pii, phi, pci, mnpi, etc.) | regulatory |
inputs[].content_hash | Hash of the input — never the raw data | regulatory |
inputs[].redacted | Whether redaction was applied before ingestion | regulatory |
| Field | Description | Origin |
|---|---|---|
steps[] | Ordered, typed sequence of dialogue and actions | reasoned |
steps[].step_type | Typed by intent (clarifying_question, response, override, halt, etc.) | reasoned |
steps[].halt_triggered | Whether this step was a halt-and-ask event | evidenced |
steps[].halt_reason | Why the halt fired | evidenced |
steps[].context_state | Per-step snapshot of context state | operational |
steps[].integrity_signature | Per-step signature by the capture process | reasoned |
| Field | Description | Origin |
|---|---|---|
outputs[].artefact_type | What was produced (ddl, code, doc, etc.) | reasoned |
outputs[].artefact_hash | Hash of the produced artefact | reasoned |
| Field | Description | Origin |
|---|---|---|
diff_against_spec.severity | none, cosmetic, material, breaking | reasoned |
diff_against_spec.human_acknowledged | Did the principal review the diff | regulatory |
| Field | Description | Origin |
|---|---|---|
signoff.state | clean, flagged, accountability_attested, accountability_pending, accountability_declined | regulatory |
signoff.accountability_signature | Principal's review signature, when applied | regulatory |
signoff.applied_at | Timestamp of accountability attestation | reasoned |
| Field | Description | Origin |
|---|---|---|
source_control.system | Version control system (git, mercurial, none, etc.) | reasoned |
source_control.commit_hash | Commit at session start | regulatory |
source_control.is_dirty | Were there uncommitted changes at session start | evidenced |
source_control.commits_during_session[] | Commits made during the session | reasoned |
A handful of plumbing fields — output URIs, retention policy references, repository identifiers — are part of the v0.1 schema but don't bear on the narrative above. They are documented in the full schema reference, which will move to its own published location when the schema reaches v0.1 stable.

One question this schema invites: if tamper-evidence matters, why not blockchain?
Three properties are worth distinguishing:
The receipt schema commits to (1) and (2). It does not commit to (3). The reason is operational: a regulator does not need decentralised consensus to trust evidence; they need the evidence to be signed by an accountable party and demonstrably unaltered. Hash-chained, signed receipts in append-only storage achieve that. Periodic anchoring to a public timestamp source provides external verifiability without putting receipt content on-chain.
A blockchain would provide all three properties at the cost of latency, transaction fees, public exposure of receipt metadata, and operational overhead. The minimum mechanism that achieves the audit purpose is the right one.
The signing model the methodology uses — opening identity attestation, per-step integrity attestation, deferred accountability attestation — is what existing supply-chain provenance standards (in-toto, SCITT) were designed for. Receipts compose with these standards rather than reinventing them. Where stronger properties are required — multi-party attestation, cross-organisation provenance — the composition is straightforward.
This is the working principle. Don't reinvent. Compose. Cite. Adapt.
This work has been operating under a working name. AI-Verifiable Engineering is the term used through this report — engineering work that can be verified, after the fact, to have followed an agreed practice, with evidence captured at the time of the work rather than reconstructed afterwards.
The name is internally precise. It is not a claim that this methodology verifies AI. It is a claim that AI-assisted engineering can be made verifiable — through closed vocabularies, halt conditions, signed receipts of the dialogue that produced the work, and a ledger that anchors them in time.
The name is provisional. There may be a better one. The methodology is not committed to a name until the practice is settled and the name is genuinely earned. For now, the working term carries the meaning, and the schema above is its v0.1 contract.
That contract is itself a draft, until it has survived contact with at least three real builds across two products. Some fields will graduate from reasoned to evidenced as the methodology meets reality. Some will turn out to be unnecessary. The version itself will be revised when the evidence requires it.
That is how the methodology is meant to work — slowly, with evidence, with its own basis visible. The schema is not handed down complete. It is built one receipt at a time, refined when the corpus reveals what the reasoning missed, and republished honestly when the version shifts.
Report 003 will introduce the capture layer — where and how the various attributes that make up a receipt are captured, what the gaps are between what the schema asks for and what the runtime can actually provide, and how the three signing roles are operationally implemented. Report 004 will look at the first real receipts, what the schema held under contact with reality, and what changed when the methodology entered the workflow. Subsequent reports will go deeper on delegation trees, provenance graphs, and the corpus-level questions the schema makes answerable.
The contract is now defined. The work continues.

Navigare necesse est
If this resonates — particularly if you're working in regulated environments where AI-assisted engineering work needs to be defensible — I'd be interested to hear from you. Detailed schema documentation, including frontmatter not surfaced here, lives in the methodology workspace and will move to its own published location when the contract is stable.
Subscribe to The Approach. Working notes on AI-assisted engineering. Published when there's something worth saying.
One message, no commitment. We reply personally within two working days.