Why R1 · the case against model-driven agents

1. What breaks with model-driven agents

Model-driven agents are loops where the language model decides, on each turn, whether to keep working or to stop. The loop is short, the freedom is large, and the failures are characteristic. Four shapes recur in production deployments.

Premature termination

The model declares the task complete after one shallow attempt. The artifact looks plausible, sometimes runs, and is wrong in a way the user discovers only later. There is no record of why the agent thought it was done; only that it did.

Confident silence on the unknown

The model encounters something outside its training distribution (a private API, an unusual environment, an under-documented library) and proceeds anyway, hallucinating a plausible call shape. No retry against ground truth, no confession of uncertainty, no escalation.

Drift across long horizons

For tasks that span dozens of tool calls, the model's working memory ends up reflecting the most recent turns disproportionately. Earlier decisions, constraints, and TODOs decay. The agent ends in a state inconsistent with the task it was given.

No reproducibility, no audit

The artifact ships, but there is no canonical record of what the agent thought, what tools it called, what data it read, or why it chose any particular path. Reviewing the agent's work means redoing the agent's work.

2. What the harness does differently

R1 is harness-driven. The harness, not the model, decides when a phase is complete. The model proposes; the harness gates. This is the entire architectural commitment, and everything below follows from it.

PLAN → EXECUTE → VERIFY → COMMIT

Every task moves through four explicit phases. PLAN fixes scope and produces a tracked task list. EXECUTE runs tools against that list. VERIFY runs a different model family against the artifact, with the originals not in its context. COMMIT is the only point at which anything is allowed to merge, publish, pay, or reply. Each transition is gate-checked.

Cross-family adversarial review

By default, the implementer and reviewer cannot share a model family. A reviewer that watched the implementer think will rationalise the implementer's mistakes; a fresh reviewer will not. The override is allowed and recorded; the receipt makes the decision visible.

STOKE tracing

STOKE is the protocol for the trace. Every thought, tool call, memory read, memory write, and gate decision becomes a content-addressed event in a graph. The graph is canonicalised, signed with Ed25519, and published as the receipt for the task. Two separate runs produce two separate receipts, comparable byte-for-byte.

That receipt model now extends cleanly to remote control and long-running work. Beacon is a sibling protocol to STOKE, not a replacement for it: Beacon establishes who may act, from which device, with which capability token; STOKE records what the action actually did. Daemon Mode keeps that same harness loop alive across process crashes instead of depending on an external cron shell.

The gate machine

Gates are first-class. Built-in gates ship for syntax, tests, lint, secret scanning, and policy. Custom gates register as small WASM modules; they observe a phase transition and return allow, block, or escalate. The harness will not advance a phase past a blocking gate.

3. Most agents are black boxes; most transparent agents are toys

The state of the field has been an ugly choice: pick a black-box framework that ships impressive demos and gives you nothing to debug, or pick a transparent toy that traces every step and cannot complete a real task.

R1 is both. The harness gives you full transparency: every event in the receipt, every memory tier readable on disk, every gate decision logged. The harness also runs to completion: the four-phase loop, the multi-model adversarial review, and the gate machine produce a coding agent that finishes work, not just narrates it.

Model-driven loop

The model decides when to stop. Tools are optional. Memory is the conversation. The artifact ships when the model says it's ready. The trace, if any, is unstructured chat history.

R1 harness loop

The harness decides when a phase is complete. Tools are required by gate. Memory is tiered (L0 identity, L1 critical, L2 topical, L3 semantic). The artifact ships when every gate passes. The trace is the STOKE receipt; signed, comparable, and standalone.

4. When R1 is, and isn't, the right tool

Use R1 when

The work product needs to be reviewable later by someone who wasn't there.
The cost of a wrong decision is higher than the cost of an extra phase.
You want a record that survives the model that produced it.
The task spans long enough that drift is a real failure mode.

Skip R1 when

You want a chat companion, not an agent that ships artifacts.
The task is one tool call wide and one turn deep.
Reproducibility doesn't matter; the artifact is throwaway.

If you want the protocol that powers the receipt, the STOKE spec is the next page. If you want the companion control plane, read Beacon Protocol. If you want deterministic skill authoring, start at Skill Wizard. If you want to install and run, start at the quickstart.

The case against model-driven agents.