Generalized from Karpathy's autoresearch. Same loop, any domain.
An AI agent runs an infinite hill-climbing loop: modify → run → measure → keep or revert → repeat. No human in the loop. Wake up to a TSV of completed experiments.
This works for any project where you have:
- A quantifiable metric (one or two scalars)
- A fast feedback loop (seconds to minutes per run)
- A sandboxed file to modify
- An immutable eval harness
LOOP FOREVER:
1. Read current state (code, config, prompt — whatever you're optimizing)
2. Form a hypothesis ("what if I try X?")
3. Edit the target file
4. Git commit (audit trail)
5. Run the experiment (fixed time/resource budget)
6. Extract metrics from output
7. If improved → keep (advance branch)
If equal/worse → revert (git reset)
8. Log to results.tsv
9. NEVER STOP — human will interrupt when done
Every autoexp run needs four things defined upfront:
| Component | What it is | Example |
|---|---|---|
| Target file | The ONE file the agent can modify | train.py, prompt.txt, config.yaml |
| Eval harness | Immutable script that produces the metric | evaluate.py, run_bench.sh |
| Metric(s) | 1-2 scalar values, lower or higher = better | val_bpb ↓, accuracy ↑, latency_ms ↓ |
| Budget | Time/cost cap per experiment | 5 min wall clock, max 100 runs, $X total |
For two metrics (e.g., accuracy ↑ and latency ↓), define a dominance rule:
- Primary/secondary: Improve primary; secondary is a soft constraint (e.g., "accuracy must improve; latency shouldn't 2x")
- Pareto: Keep if better on either metric without regressing on the other
- Weighted:
score = w1 * metric1 + w2 * metric2— collapse to single scalar
Tab-separated. Simple. Diffable. No infrastructure.
commit metric_1 metric_2 status description
a1b2c3d 0.9979 44.0 keep baseline
b2c3d4e 0.9932 44.2 keep increased learning rate to 0.04
c3d4e5f 1.0050 44.0 discard switched to GeLU activation
d4e5f6g 0.0000 0.0 crash doubled model width (OOM)
Statuses: keep | discard | crash
Always experiment on a dedicated branch:
git checkout -b autoexp/<tag> # e.g., autoexp/mar8-prompt-tuningMain stays clean. The branch is your lab notebook. Every commit is a recoverable experiment.
| Domain | Target file | Metric | Eval harness |
|---|---|---|---|
| ML training | train.py |
val_loss ↓ | Fixed eval function |
| Prompt engineering | prompt.txt |
eval_accuracy ↑ | LLM judge or test suite |
| RAG pipelines | config.yaml |
retrieval_precision ↑ | Benchmark query set |
| Compiler/perf tuning | flags.conf |
runtime_ms ↓ | Benchmark binary |
| API optimization | handler.py |
p99_latency ↓ | Load test script |
| System prompts | system.md |
task_score ↑ | Eval suite with rubric |
| CSS/layout | styles.css |
lighthouse_score ↑ | Lighthouse CI |
| SQL queries | query.sql |
exec_time_ms ↓ | EXPLAIN ANALYZE wrapper |
| Infrastructure | terraform.tf |
cost_per_hour ↓ | terraform plan parser |
- Single file constraint — prevents the agent from refactoring the universe
- Fixed budget — no runaway experiments
- Git commit per try — perfect audit trail, trivial revert
- TSV logging — zero-infra results tracking
- "NEVER STOP" — the agent is a tireless researcher, not an assistant waiting for permission
- Subjective quality (UI aesthetics, writing style) — no scalar metric
- Slow feedback (deploy → wait for user traffic → measure) — loop stalls
- Multi-file changes (architectural refactors) — too much surface area
- Safety-critical systems — autonomous modification without human review is a bad idea
Karpathy's version costs GPU-hours. In an LLM-agent context, each iteration costs API tokens. Add a budget cap:
max_experiments: 50 # hard stop
max_cost_usd: 10.00 # estimated from token usage
max_wall_clock_hours: 8 # total run time
The agent should log cumulative cost in the TSV or a separate budget.log.
If each experiment takes ~5 minutes, you get ~12/hour. Over an 8-hour sleep, that's ~100 experiments. You wake up to a results table your agent built while you dreamed.
The human's job: define the metric, set the constraints, review the results. The agent's job: relentlessly explore the space.
Inspired by @karpathy/autoresearch. Generalized for any quantifiable optimization problem.