Build agents extend the RL improvement loop to greenfield building. Instead of optimizing an existing metric, they build new modules from specs and iterate until every assertion passes.
The Pattern
spec → eval assertions → agent TOML → `tenet peter agent {name}` → Karpathy loop → PR
“Granularity of feedback determines speed of convergence.” A monolithic eval with 16 checks stalled at 7%. The same eval decomposed into 6 page-level evals — each hit 100% in one round. Same agent, same code, different gradient.
Writing a Build Eval
// eval/build/storage-adapter.ts
export async function evaluate(): Promise<number> {
const checks = [
{ name: "interface-exists", pass: existsSync(resolve("src/lib/storage/interface.ts")) },
{ name: "has-read-method", pass: fileContains("src/lib/storage/interface.ts", "read(") },
{ name: "local-impl", pass: existsSync(resolve("src/lib/storage/local.ts")) },
{ name: "compiles", pass: tscPasses() },
]
return checks.filter(c => c.pass).length / checks.length
}
The eval checks AGENT_WORKTREE env var so it tests the agent’s worktree, not the main repo.
Agent TOML Config
[agent]
name = "build-storage-adapter"
scope = "build" # triggers build-specific behavior
metric = "spec_compliance"
direction = "maximize"
time_budget_seconds = 600
[eval]
script = "eval/build/storage-adapter.ts"
data = "eval/fixtures/build-baseline.jsonl"
[task]
description = """
Create the TenetStorage adapter with interface,
LocalStorage, and CloudStorage implementations.
Exact file paths: src/lib/storage/interface.ts, etc.
"""
Build vs RL Agents
| RL Agent | Build Agent |
|---|
| Goal | Improve existing metric | Build from spec |
| Baseline | Current score | Zero |
| Rounds | 5-50, small changes | 3-10, creates files |
| Worktree | From origin/main | From HEAD (inherits merged work) |
| Turns | 15 per round | 40 per round |
| Early stop | No | Yes (stops at 1.0) |
Build Supervisor
Between rounds, checkRound() detects patterns:
- Stalled: 3+ rounds at same score → injects hint
- Filename mismatch: files created but eval can’t find them → alerts
- Repeated reverts: same checks failing → suggests different approach
The supervisor logs learnings to .jfl/build-learnings.jsonl for future sessions.
Eval Decomposition
Break complex builds into sub-evals. Instead of one frontend eval with 16 checks, create 6 page-level evals with 2-3 checks each. Each scores independently, giving the agent gradient from round 1.