Anatomy of an Eval Script
An eval script is a bash or TypeScript file that:- Runs tests, measurements, or analysis
- Outputs a JSON object with a primary metric
- Exits with code 0 (even if the metric is bad)
eval/test-coverage.sh
The AGENT_WORKTREE Pattern
Good vs Bad Eval Scripts
❌ Bad: Metric at ceiling
✅ Good: Metric with real gradient
❌ Bad: Binary pass/fail
✅ Good: Continuous signal
Eval Script Requirements
- Output JSON — Must output a single JSON object on stdout
- Primary metric — The JSON must include the metric field matching your agent config
- Exit 0 — Even if the metric is bad. Non-zero exit = eval failure, not bad score
- Deterministic — Same code should produce the same score
- Fast — Under 30 seconds is ideal. Maximum is the agent’s
time_budget_seconds - Stderr for logs — Write debug output to stderr, not stdout
Composite Metrics
For multi-dimensional quality, create a weighted composite:eval/code-quality.sh