Documentation Index
Fetch the complete documentation index at: https://docs.10et.ai/llms.txt
Use this file to discover all available pages before exploring further.
Eval scripts are the reward function for your agents. They measure a metric before and after an agent makes changes. The delta determines whether the change is kept or reverted.
Anatomy of an Eval Script
An eval script is a bash or TypeScript file that:
- Runs tests, measurements, or analysis
- Outputs a JSON object with a primary metric
- Exits with code 0 (even if the metric is bad)
#!/usr/bin/env bash
set -euo pipefail
# Use AGENT_WORKTREE when running in agent context,
# fall back to main repo for manual runs
CLI_DIR="${AGENT_WORKTREE:-$HOME/CascadeProjects/my-project}"
# Run jest with coverage
cd "$CLI_DIR"
npx jest --coverage --coverageReporters=json-summary --silent
# Parse coverage report
COVERAGE_FILE="coverage/coverage-summary.json"
LINE_PCT=$(python3 -c "import json; print(json.load(open('$COVERAGE_FILE'))['total']['lines']['pct'])")
# Output JSON with primary metric
SCORE=$(python3 -c "print(round($LINE_PCT / 100.0, 4))")
echo "{\"coverage_percent\": $SCORE, \"line_pct\": $LINE_PCT}"
The AGENT_WORKTREE Pattern
Critical: Agents work in isolated git worktrees at /tmp/tenet-agent-*. Your eval script must test the agent’s changes, not the main branch.Always use AGENT_WORKTREE when referencing the target repo:CLI_DIR="${AGENT_WORKTREE:-$HOME/CascadeProjects/my-project}"
This uses the worktree when running in agent context, and falls back to the main repo for manual testing.
Good vs Bad Eval Scripts
❌ Bad: Metric at ceiling
# Tests pass rate — already 100%, no gradient
PASSED=$(npx jest --json | jq '.numPassedTests')
TOTAL=$(npx jest --json | jq '.numTotalTests')
echo "{\"pass_rate\": $(echo "$PASSED / $TOTAL" | bc -l)}"
# Always outputs 1.0 — agent can't improve this
✅ Good: Metric with real gradient
# Coverage percentage — 12.8% baseline, 87% room to improve
npx jest --coverage --coverageReporters=json-summary
LINE_PCT=$(jq '.total.lines.pct' coverage/coverage-summary.json)
echo "{\"coverage_percent\": $(echo "$LINE_PCT / 100" | bc -l)}"
# Outputs 0.128 — every test file moves this number
❌ Bad: Binary pass/fail
# Does the build succeed? Yes/no
if npm run build; then echo '{"build": 1.0}'; else echo '{"build": 0.0}'; fi
# No gradient — agent gets 1.0 or 0.0, nothing in between
✅ Good: Continuous signal
# Console.log count — 4461 instances, every cleanup moves the number
COUNT=$(grep -r "console\.\(log\|error\|warn\)" src/ | wc -l)
SCORE=$(python3 -c "print(round(1.0 - min($COUNT, 5000) / 5000, 4))")
echo "{\"quality_score\": $SCORE, \"console_count\": $COUNT}"
Eval Script Requirements
- Output JSON — Must output a single JSON object on stdout
- Primary metric — The JSON must include the metric field matching your agent config
- Exit 0 — Even if the metric is bad. Non-zero exit = eval failure, not bad score
- Deterministic — Same code should produce the same score
- Fast — Under 30 seconds is ideal. Maximum is the agent’s
time_budget_seconds
- Stderr for logs — Write debug output to stderr, not stdout
echo "Running coverage..." >&2 # Goes to stderr (logs)
echo '{"coverage_percent": 0.128}' # Goes to stdout (result)
Composite Metrics
For multi-dimensional quality, create a weighted composite:
#!/usr/bin/env bash
CLI_DIR="${AGENT_WORKTREE:-$HOME/CascadeProjects/my-project}"
# Multiple signals
TODOS=$(grep -r "TODO\|FIXME" "$CLI_DIR/src" | wc -l)
CONSOLE_LOGS=$(grep -r "console\.log" "$CLI_DIR/src" | wc -l)
ANY_USAGE=$(grep -r ": any" "$CLI_DIR/src" | wc -l)
# Individual scores (0-1)
TODO_SCORE=$(python3 -c "print(round(1.0 - min($TODOS, 50) / 100, 4))")
CONSOLE_SCORE=$(python3 -c "print(round(1.0 - min($CONSOLE_LOGS, 5000) / 5000, 4))")
ANY_SCORE=$(python3 -c "print(round(1.0 - min($ANY_USAGE, 1000) / 1000, 4))")
# Weighted composite
QUALITY=$(python3 -c "print(round($TODO_SCORE * 0.3 + $CONSOLE_SCORE * 0.4 + $ANY_SCORE * 0.3, 4))")
echo "{\"quality_score\": $QUALITY, \"todos\": $TODOS, \"console_logs\": $CONSOLE_LOGS, \"any_usage\": $ANY_USAGE}"
Testing Your Eval
Run it manually first:
# From your project root
bash eval/test-coverage.sh
# {"coverage_percent": 0.128, "line_pct": 12.8}
# With agent worktree simulation
AGENT_WORKTREE=/tmp/test-worktree bash eval/test-coverage.sh
Make sure the output is valid JSON and the primary metric has room to improve.