Skip to main content
The policy head is a neural network trained on your project’s training buffer. It predicts which actions will produce positive reward given the current system state.

Architecture

Input: RLState (composite score, dimensions, trajectory)

4-layer transformer (512 hidden, 8 heads)

Output: predicted reward for each candidate action
Specs:
  • 14M parameters
  • Trained on MPS (Apple Silicon) or CPU
  • Checkpoint: .jfl/checkpoints/policy-head-v2.json
  • Weights: .jfl/checkpoints/best_policy_head.pt

How It’s Used

During agent runs, the policy head scores candidate actions:
# Score a single action
jfl policy score --type fix --description "Add error handling to auth module" --scope small

# Rank multiple actions
jfl policy rank '[
  {"type": "test", "description": "Add tests for config loader", "scope": "small"},
  {"type": "refactor", "description": "Extract auth middleware", "scope": "medium"},
  {"type": "fix", "description": "Fix memory leak in hub", "scope": "large"}
]'
  Ranked Actions (predicted reward):
  1. [+0.0042] test: Add tests for config loader (small)
  2. [+0.0018] fix: Fix memory leak in hub (large)
  3. [-0.0003] refactor: Extract auth middleware (medium)

Training

When It Trains

The nightly loop retrains automatically when 50+ new tuples have accumulated since last training:
# In peter daily:
BUFFER_SIZE=$(wc -l < .jfl/training-buffer.jsonl)
LAST_TRAINED=$(jq '.trained_on' .jfl/checkpoints/policy-head-v2.json)
NEW_TUPLES=$((BUFFER_SIZE - LAST_TRAINED))
if [ "$NEW_TUPLES" -ge 50 ]; then
  jfl train transform && jfl train policy-head --force
fi

Manual Training

# Transform raw tuples into training format
jfl train transform

# Train policy head
jfl train policy-head --force

Training Data

The training buffer (.jfl/training-buffer.jsonl) contains tuples from:
  • Agent autoresearch rounds (kept and reverted)
  • Manual journal entries (mined by tuple miner)
  • Cross-service events
Current stats: 2764 tuples, 91.6% validation accuracy.

Checkpoint

{
  "version": 2,
  "architecture": "transformer-4layer-512h",
  "embedding_dim": 768,
  "hidden_dim": 512,
  "num_layers": 4,
  "num_heads": 8,
  "trained_on": 1565,
  "val_accuracy": 0.9164,
  "parameters": 14191628,
  "tool_to_index": {
    "fix_bug": 0,
    "refactor_code": 1,
    "add_feature": 2,
    "add_tests": 3,
    "update_config": 4,
    "run_experiment": 5
  }
}

When to Use GPUs

The policy head is small (14M params). Training on Apple Silicon MPS takes ~2 minutes. You don’t need cloud GPUs unless:
  • You’re training on 10K+ tuples
  • You want to experiment with larger architectures
  • You’re running parallel training across multiple projects
For most users, MPS or CPU is sufficient.