TENET implements a simplified reinforcement learning loop for code improvement. It’s not traditional RL with neural network policies playing Atari — it’s the Karpathy autoresearch pattern applied to codebases.
The Three Components
1. State (World Model)
Before each round, TENET captures the system state:
interface WorldState {
systemState: {
activeAgents: string[]
hubConnections: number
buildStatus: Record<string, string>
pendingEvals: number
}
contextState: {
recentCommits: number
openPRs: number
failingTests: number
codeChurn: number
}
agentState: {
lastEvalScore: number
rewardEMA: number
actionHistory: string[]
consecutiveFailures: number
}
}
This gets converted to an RLState for the policy head:
interface RLState {
composite_score: number
dimension_scores: Record<string, number>
tests_passing: number
tests_total: number
trajectory_length: number
recent_deltas: number[]
agent: string
}
2. Action (Policy Head Selects)
The policy head is a 14M-parameter transformer that predicts reward for candidate actions:
interface RLAction {
type: "fix" | "refactor" | "feature" | "test" | "config" | "experiment"
description: string
files_affected: string[]
scope: "small" | "medium" | "large"
}
The agent generates task descriptions informed by:
- Experiment history (what worked, what didn’t)
- Policy head predictions (which action type is most promising)
- Product context (what the team is focused on)
3. Reward (Eval Delta)
After the agent makes changes:
reward = eval_score_after - eval_score_before
- Positive delta → KEPT (change merged to session branch)
- Zero or negative delta → REVERTED (
git reset --hard HEAD~1)
Training Tuple
Every round produces a training tuple, regardless of outcome:
{
"agent": "test-coverage",
"state": {
"composite_score": 0.1276,
"dimension_scores": { "test_pass_rate": 1.0, "build_health": 1.0 },
"trajectory_length": 3,
"recent_deltas": [0.0031, -0.0002]
},
"action": {
"type": "test",
"description": "Add tests for claude-md-generator.ts",
"files_affected": ["src/utils/__tests__/claude-md-generator.test.ts"],
"scope": "medium"
},
"reward": {
"composite_delta": 0.0031,
"improved": true
}
}
Why This Works
Traditional RL needs millions of episodes. TENET works with hundreds because:
- The action space is constrained — agents modify specific files in a focused scope
- The eval is deterministic — same code produces the same score
- The environment resets cleanly —
git reset provides perfect rollback
- History informs action — agents see what worked/failed in past rounds
The Karpathy Connection
This is the autoresearch pattern:
- Propose an experiment (agent generates a code change)
- Run the experiment (eval script measures the result)
- Evaluate the outcome (delta > 0?)
- Learn from the result (training tuple → policy head)
- Repeat with better-informed proposals
The key insight: you don’t need massive compute. You need a good reward signal (eval script) and focused actions (scope files).
Common Pitfalls
Bad reward signal = wasted compute. If your eval is at ceiling (100% test pass rate), agents can’t improve it. If your eval measures the wrong thing (code hygiene when the agent changes functionality), agents will be reverted every time.Always verify your eval has room to improve before running agents.
| Pitfall | Symptom | Fix |
|---|
| Eval at ceiling | 0% keep rate, delta always 0 | Measure something with gradient |
| Wrong metric | Agent makes good changes, still reverted | Align eval with what agent actually changes |
| Eval tests wrong code | Agent’s worktree not evaluated | Use AGENT_WORKTREE env var |
| Scope too broad | Agent changes unrelated files | Narrow scope_files in agent config |
| Too many rounds | Diminishing returns | Cap at 5-10 rounds per session |