TENET — Documentation

TENET implements a simplified reinforcement learning loop for code improvement. It’s not traditional RL with neural network policies playing Atari — it’s the Karpathy autoresearch pattern applied to codebases.

The Three Components

1. State (World Model)

Before each round, TENET captures the system state:

interface WorldState {
  systemState: {
    activeAgents: string[]
    hubConnections: number
    buildStatus: Record<string, string>
    pendingEvals: number
  }
  contextState: {
    recentCommits: number
    openPRs: number
    failingTests: number
    codeChurn: number
  }
  agentState: {
    lastEvalScore: number
    rewardEMA: number
    actionHistory: string[]
    consecutiveFailures: number
  }
}

This gets converted to an RLState for the policy head:

interface RLState {
  composite_score: number
  dimension_scores: Record<string, number>
  tests_passing: number
  tests_total: number
  trajectory_length: number
  recent_deltas: number[]
  agent: string
}

2. Action (Policy Head Selects)

The policy head is a 14M-parameter transformer that predicts reward for candidate actions:

interface RLAction {
  type: "fix" | "refactor" | "feature" | "test" | "config" | "experiment"
  description: string
  files_affected: string[]
  scope: "small" | "medium" | "large"
}

The agent generates task descriptions informed by:

Experiment history (what worked, what didn’t)
Policy head predictions (which action type is most promising)
Product context (what the team is focused on)

3. Reward (Eval Delta)

After the agent makes changes:

reward = eval_score_after - eval_score_before

Positive delta → KEPT (change merged to session branch)
Zero or negative delta → REVERTED (git reset --hard HEAD~1)

Training Tuple

Every round produces a training tuple, regardless of outcome:

{
  "agent": "test-coverage",
  "state": {
    "composite_score": 0.1276,
    "dimension_scores": { "test_pass_rate": 1.0, "build_health": 1.0 },
    "trajectory_length": 3,
    "recent_deltas": [0.0031, -0.0002]
  },
  "action": {
    "type": "test",
    "description": "Add tests for claude-md-generator.ts",
    "files_affected": ["src/utils/__tests__/claude-md-generator.test.ts"],
    "scope": "medium"
  },
  "reward": {
    "composite_delta": 0.0031,
    "improved": true
  }
}

Why This Works

Traditional RL needs millions of episodes. TENET works with hundreds because:

The action space is constrained — agents modify specific files in a focused scope
The eval is deterministic — same code produces the same score
The environment resets cleanly — git reset provides perfect rollback
History informs action — agents see what worked/failed in past rounds

The Karpathy Connection

This is the autoresearch pattern:

Propose an experiment (agent generates a code change)
Run the experiment (eval script measures the result)
Evaluate the outcome (delta > 0?)
Learn from the result (training tuple → policy head)
Repeat with better-informed proposals

The key insight: you don’t need massive compute. You need a good reward signal (eval script) and focused actions (scope files).

Common Pitfalls

Bad reward signal = wasted compute. If your eval is at ceiling (100% test pass rate), agents can’t improve it. If your eval measures the wrong thing (code hygiene when the agent changes functionality), agents will be reverted every time.Always verify your eval has room to improve before running agents.

Pitfall	Symptom	Fix
Eval at ceiling	0% keep rate, delta always 0	Measure something with gradient
Wrong metric	Agent makes good changes, still reverted	Align eval with what agent actually changes
Eval tests wrong code	Agent’s worktree not evaluated	Use `AGENT_WORKTREE` env var
Scope too broad	Agent changes unrelated files	Narrow `scope_files` in agent config
Too many rounds	Diminishing returns	Cap at 5-10 rounds per session

​The Three Components

​1. State (World Model)

​2. Action (Policy Head Selects)

​3. Reward (Eval Delta)

​Training Tuple

​Why This Works

​The Karpathy Connection

​Common Pitfalls

The Three Components

1. State (World Model)

2. Action (Policy Head Selects)

3. Reward (Eval Delta)

Training Tuple

Why This Works

The Karpathy Connection

Common Pitfalls