Skip to main content
TENET implements a simplified reinforcement learning loop for code improvement. It’s not traditional RL with neural network policies playing Atari — it’s the Karpathy autoresearch pattern applied to codebases.

The Three Components

1. State (World Model)

Before each round, TENET captures the system state:
interface WorldState {
  systemState: {
    activeAgents: string[]
    hubConnections: number
    buildStatus: Record<string, string>
    pendingEvals: number
  }
  contextState: {
    recentCommits: number
    openPRs: number
    failingTests: number
    codeChurn: number
  }
  agentState: {
    lastEvalScore: number
    rewardEMA: number
    actionHistory: string[]
    consecutiveFailures: number
  }
}
This gets converted to an RLState for the policy head:
interface RLState {
  composite_score: number
  dimension_scores: Record<string, number>
  tests_passing: number
  tests_total: number
  trajectory_length: number
  recent_deltas: number[]
  agent: string
}

2. Action (Policy Head Selects)

The policy head is a 14M-parameter transformer that predicts reward for candidate actions:
interface RLAction {
  type: "fix" | "refactor" | "feature" | "test" | "config" | "experiment"
  description: string
  files_affected: string[]
  scope: "small" | "medium" | "large"
}
The agent generates task descriptions informed by:
  • Experiment history (what worked, what didn’t)
  • Policy head predictions (which action type is most promising)
  • Product context (what the team is focused on)

3. Reward (Eval Delta)

After the agent makes changes:
reward = eval_score_after - eval_score_before
  • Positive delta → KEPT (change merged to session branch)
  • Zero or negative delta → REVERTED (git reset --hard HEAD~1)

Training Tuple

Every round produces a training tuple, regardless of outcome:
{
  "agent": "test-coverage",
  "state": {
    "composite_score": 0.1276,
    "dimension_scores": { "test_pass_rate": 1.0, "build_health": 1.0 },
    "trajectory_length": 3,
    "recent_deltas": [0.0031, -0.0002]
  },
  "action": {
    "type": "test",
    "description": "Add tests for claude-md-generator.ts",
    "files_affected": ["src/utils/__tests__/claude-md-generator.test.ts"],
    "scope": "medium"
  },
  "reward": {
    "composite_delta": 0.0031,
    "improved": true
  }
}

Why This Works

Traditional RL needs millions of episodes. TENET works with hundreds because:
  1. The action space is constrained — agents modify specific files in a focused scope
  2. The eval is deterministic — same code produces the same score
  3. The environment resets cleanlygit reset provides perfect rollback
  4. History informs action — agents see what worked/failed in past rounds

The Karpathy Connection

This is the autoresearch pattern:
  • Propose an experiment (agent generates a code change)
  • Run the experiment (eval script measures the result)
  • Evaluate the outcome (delta > 0?)
  • Learn from the result (training tuple → policy head)
  • Repeat with better-informed proposals
The key insight: you don’t need massive compute. You need a good reward signal (eval script) and focused actions (scope files).

Common Pitfalls

Bad reward signal = wasted compute. If your eval is at ceiling (100% test pass rate), agents can’t improve it. If your eval measures the wrong thing (code hygiene when the agent changes functionality), agents will be reverted every time.Always verify your eval has room to improve before running agents.
PitfallSymptomFix
Eval at ceiling0% keep rate, delta always 0Measure something with gradient
Wrong metricAgent makes good changes, still revertedAlign eval with what agent actually changes
Eval tests wrong codeAgent’s worktree not evaluatedUse AGENT_WORKTREE env var
Scope too broadAgent changes unrelated filesNarrow scope_files in agent config
Too many roundsDiminishing returnsCap at 5-10 rounds per session