Skip to main content
The eval system is the reward function for the RL loop. It runs before and after an agent makes changes, and the delta determines whether the change is kept.

Eval Flow

Baseline eval (before changes)
  → Agent makes code change
  → Post-change eval (same script)
  → delta = post - baseline
  → delta > 0 → KEPT
  → delta ≤ 0 → REVERTED

Eval Store

All eval results are stored in .jfl/eval.jsonl:
{
  "v": 1,
  "ts": "2026-03-22T21:30:00Z",
  "agent": "test-coverage",
  "run_id": "test-coverage-4bc3ff95",
  "metrics": {
    "coverage_percent": 0.1307,
    "line_pct": 13.37,
    "branch_pct": 12.27
  },
  "composite": 0.1307,
  "delta": 0.0031,
  "improved": true
}

Viewing Eval History

# Current eval status
jfl eval status

# Compare two snapshots
jfl eval compare

# View trajectory for an agent
jfl eval trajectory --agent test-coverage

Eval Snapshots

When an agent starts, TENET freezes the eval script into a snapshot (SHA-based). This ensures the eval doesn’t change mid-run — the same script measures baseline and post-change. Snapshots are cached at ~/.cache/jfl/eval-snapshots/<hash>/.

Writing Good Evals

See Eval Scripts for the complete guide on writing eval scripts that produce real gradient. Key principles:
  1. Output JSON with a primary metric
  2. Use AGENT_WORKTREE for cross-repo agents
  3. Ensure the metric has room to improve (not at ceiling)
  4. Keep evals fast (under 30s) and deterministic