Build evals extend the RL improvement loop to greenfield building. Instead of optimizing an existing metric, agents build new modules from specs and iterate until every assertion passes.
The Pattern
spec → eval assertions → agent TOML → `tenet peter agent {name}` → Karpathy loop → PR
- Write a spec describing what to build
- Write an eval script with assertions (file exists? method exists? compiles?)
- Create an agent TOML config with the spec as the task
- Run Peter Parker — the agent iterates from 0% → 100%
- PR created automatically when score hits 1.0
Writing a Build Eval
A build eval is a TypeScript file that checks spec compliance:
// eval/build/storage-adapter.ts
export async function evaluate(): Promise<number> {
const checks = [
{ name: "interface-exists", pass: existsSync("src/lib/storage/interface.ts") },
{ name: "has-read-method", pass: fileContains("src/lib/storage/interface.ts", "read(") },
{ name: "has-write-method", pass: fileContains("src/lib/storage/interface.ts", "write(") },
{ name: "local-impl", pass: existsSync("src/lib/storage/local.ts") },
{ name: "cloud-impl", pass: existsSync("src/lib/storage/cloud.ts") },
{ name: "compiles", pass: tscPasses() },
]
return checks.filter(c => c.pass).length / checks.length
}
Agent TOML Config
[agent]
name = "build-storage-adapter"
scope = "build"
metric = "spec_compliance"
direction = "maximize"
time_budget_seconds = 600
[eval]
script = "eval/build/storage-adapter.ts"
data = "eval/fixtures/build-baseline.jsonl"
[task]
description = """
Create the TenetStorage adapter with interface,
LocalStorage, and CloudStorage implementations.
"""
Key Insight
“Granularity of feedback determines speed of convergence.”A monolithic eval with 16 checks stalled at 7% for hours. The same eval decomposed into 6 page-level evals — each hit 100% in one round. Same agent, same code, different gradient.
Build vs RL Agents
| RL Agent | Build Agent |
|---|
| Goal | Improve existing metric | Build new code from spec |
| Baseline | Current score (e.g., 0.43) | Zero (nothing exists) |
| Rounds | 5-50, small changes | 3-10, creates files |
| Worktree | From origin/main | From HEAD (inherits merged work) |
| Turns | 15 per round | 40 per round |
| Early stop | No (keep improving) | Yes (stops at 1.0) |