Skip to main content
Eval scripts are the reward function for your agents. They measure a metric before and after an agent makes changes. The delta determines whether the change is kept or reverted.

Anatomy of an Eval Script

An eval script is a bash or TypeScript file that:
  1. Runs tests, measurements, or analysis
  2. Outputs a JSON object with a primary metric
  3. Exits with code 0 (even if the metric is bad)
eval/test-coverage.sh
#!/usr/bin/env bash
set -euo pipefail

# Use AGENT_WORKTREE when running in agent context,
# fall back to main repo for manual runs
CLI_DIR="${AGENT_WORKTREE:-$HOME/CascadeProjects/my-project}"

# Run jest with coverage
cd "$CLI_DIR"
npx jest --coverage --coverageReporters=json-summary --silent

# Parse coverage report
COVERAGE_FILE="coverage/coverage-summary.json"
LINE_PCT=$(python3 -c "import json; print(json.load(open('$COVERAGE_FILE'))['total']['lines']['pct'])")

# Output JSON with primary metric
SCORE=$(python3 -c "print(round($LINE_PCT / 100.0, 4))")
echo "{\"coverage_percent\": $SCORE, \"line_pct\": $LINE_PCT}"

The AGENT_WORKTREE Pattern

Critical: Agents work in isolated git worktrees at /tmp/jfl-agent-*. Your eval script must test the agent’s changes, not the main branch.Always use AGENT_WORKTREE when referencing the target repo:
CLI_DIR="${AGENT_WORKTREE:-$HOME/CascadeProjects/my-project}"
This uses the worktree when running in agent context, and falls back to the main repo for manual testing.

Good vs Bad Eval Scripts

❌ Bad: Metric at ceiling

# Tests pass rate — already 100%, no gradient
PASSED=$(npx jest --json | jq '.numPassedTests')
TOTAL=$(npx jest --json | jq '.numTotalTests')
echo "{\"pass_rate\": $(echo "$PASSED / $TOTAL" | bc -l)}"
# Always outputs 1.0 — agent can't improve this

✅ Good: Metric with real gradient

# Coverage percentage — 12.8% baseline, 87% room to improve
npx jest --coverage --coverageReporters=json-summary
LINE_PCT=$(jq '.total.lines.pct' coverage/coverage-summary.json)
echo "{\"coverage_percent\": $(echo "$LINE_PCT / 100" | bc -l)}"
# Outputs 0.128 — every test file moves this number

❌ Bad: Binary pass/fail

# Does the build succeed? Yes/no
if npm run build; then echo '{"build": 1.0}'; else echo '{"build": 0.0}'; fi
# No gradient — agent gets 1.0 or 0.0, nothing in between

✅ Good: Continuous signal

# Console.log count — 4461 instances, every cleanup moves the number
COUNT=$(grep -r "console\.\(log\|error\|warn\)" src/ | wc -l)
SCORE=$(python3 -c "print(round(1.0 - min($COUNT, 5000) / 5000, 4))")
echo "{\"quality_score\": $SCORE, \"console_count\": $COUNT}"

Eval Script Requirements

  1. Output JSON — Must output a single JSON object on stdout
  2. Primary metric — The JSON must include the metric field matching your agent config
  3. Exit 0 — Even if the metric is bad. Non-zero exit = eval failure, not bad score
  4. Deterministic — Same code should produce the same score
  5. Fast — Under 30 seconds is ideal. Maximum is the agent’s time_budget_seconds
  6. Stderr for logs — Write debug output to stderr, not stdout
echo "Running coverage..." >&2  # Goes to stderr (logs)
echo '{"coverage_percent": 0.128}'  # Goes to stdout (result)

Composite Metrics

For multi-dimensional quality, create a weighted composite:
eval/code-quality.sh
#!/usr/bin/env bash
CLI_DIR="${AGENT_WORKTREE:-$HOME/CascadeProjects/my-project}"

# Multiple signals
TODOS=$(grep -r "TODO\|FIXME" "$CLI_DIR/src" | wc -l)
CONSOLE_LOGS=$(grep -r "console\.log" "$CLI_DIR/src" | wc -l)
ANY_USAGE=$(grep -r ": any" "$CLI_DIR/src" | wc -l)

# Individual scores (0-1)
TODO_SCORE=$(python3 -c "print(round(1.0 - min($TODOS, 50) / 100, 4))")
CONSOLE_SCORE=$(python3 -c "print(round(1.0 - min($CONSOLE_LOGS, 5000) / 5000, 4))")
ANY_SCORE=$(python3 -c "print(round(1.0 - min($ANY_USAGE, 1000) / 1000, 4))")

# Weighted composite
QUALITY=$(python3 -c "print(round($TODO_SCORE * 0.3 + $CONSOLE_SCORE * 0.4 + $ANY_SCORE * 0.3, 4))")

echo "{\"quality_score\": $QUALITY, \"todos\": $TODOS, \"console_logs\": $CONSOLE_LOGS, \"any_usage\": $ANY_USAGE}"

Testing Your Eval

Run it manually first:
# From your project root
bash eval/test-coverage.sh
# {"coverage_percent": 0.128, "line_pct": 12.8}

# With agent worktree simulation
AGENT_WORKTREE=/tmp/test-worktree bash eval/test-coverage.sh
Make sure the output is valid JSON and the primary metric has room to improve.