Skip to main content
Build agents extend the RL improvement loop to greenfield building. Instead of optimizing an existing metric, they build new modules from specs and iterate until every assertion passes.

The Pattern

spec → eval assertions → agent TOML → `tenet peter agent {name}` → Karpathy loop → PR
“Granularity of feedback determines speed of convergence.” A monolithic eval with 16 checks stalled at 7%. The same eval decomposed into 6 page-level evals — each hit 100% in one round. Same agent, same code, different gradient.

Writing a Build Eval

// eval/build/storage-adapter.ts
export async function evaluate(): Promise<number> {
  const checks = [
    { name: "interface-exists", pass: existsSync(resolve("src/lib/storage/interface.ts")) },
    { name: "has-read-method", pass: fileContains("src/lib/storage/interface.ts", "read(") },
    { name: "local-impl", pass: existsSync(resolve("src/lib/storage/local.ts")) },
    { name: "compiles", pass: tscPasses() },
  ]
  return checks.filter(c => c.pass).length / checks.length
}
The eval checks AGENT_WORKTREE env var so it tests the agent’s worktree, not the main repo.

Agent TOML Config

[agent]
name = "build-storage-adapter"
scope = "build"           # triggers build-specific behavior
metric = "spec_compliance"
direction = "maximize"
time_budget_seconds = 600

[eval]
script = "eval/build/storage-adapter.ts"
data = "eval/fixtures/build-baseline.jsonl"

[task]
description = """
Create the TenetStorage adapter with interface, 
LocalStorage, and CloudStorage implementations.
Exact file paths: src/lib/storage/interface.ts, etc.
"""

Build vs RL Agents

RL AgentBuild Agent
GoalImprove existing metricBuild from spec
BaselineCurrent scoreZero
Rounds5-50, small changes3-10, creates files
WorktreeFrom origin/mainFrom HEAD (inherits merged work)
Turns15 per round40 per round
Early stopNoYes (stops at 1.0)

Build Supervisor

Between rounds, checkRound() detects patterns:
  • Stalled: 3+ rounds at same score → injects hint
  • Filename mismatch: files created but eval can’t find them → alerts
  • Repeated reverts: same checks failing → suggests different approach
The supervisor logs learnings to .jfl/build-learnings.jsonl for future sessions.

Eval Decomposition

Break complex builds into sub-evals. Instead of one frontend eval with 16 checks, create 6 page-level evals with 2-3 checks each. Each scores independently, giving the agent gradient from round 1.