autoexp — Autonomous Experimentation Loop

Generalized from Karpathy's autoresearch. Same loop, any domain.

The Idea

An AI agent runs an infinite hill-climbing loop: modify → run → measure → keep or revert → repeat. No human in the loop. Wake up to a TSV of completed experiments.

This works for any project where you have:

A quantifiable metric (one or two scalars)
A fast feedback loop (seconds to minutes per run)
A sandboxed file to modify
An immutable eval harness

The Pattern

LOOP FOREVER:
  1. Read current state (code, config, prompt — whatever you're optimizing)
  2. Form a hypothesis ("what if I try X?")
  3. Edit the target file
  4. Git commit (audit trail)
  5. Run the experiment (fixed time/resource budget)
  6. Extract metrics from output
  7. If improved → keep (advance branch)
     If equal/worse → revert (git reset)
  8. Log to results.tsv
  9. NEVER STOP — human will interrupt when done

Setup Contract

Every autoexp run needs four things defined upfront:

Component	What it is	Example
Target file	The ONE file the agent can modify	`train.py`, `prompt.txt`, `config.yaml`
Eval harness	Immutable script that produces the metric	`evaluate.py`, `run_bench.sh`
Metric(s)	1-2 scalar values, lower or higher = better	`val_bpb ↓`, `accuracy ↑`, `latency_ms ↓`
Budget	Time/cost cap per experiment	5 min wall clock, max 100 runs, $X total

Dual Metric Mode

For two metrics (e.g., accuracy ↑ and latency ↓), define a dominance rule:

Primary/secondary: Improve primary; secondary is a soft constraint (e.g., "accuracy must improve; latency shouldn't 2x")
Pareto: Keep if better on either metric without regressing on the other
Weighted: score = w1 * metric1 + w2 * metric2 — collapse to single scalar

Results Logging

Tab-separated. Simple. Diffable. No infrastructure.

commit	metric_1	metric_2	status	description
a1b2c3d	0.9979	44.0	keep	baseline
b2c3d4e	0.9932	44.2	keep	increased learning rate to 0.04
c3d4e5f	1.0050	44.0	discard	switched to GeLU activation
d4e5f6g	0.0000	0.0	crash	doubled model width (OOM)

Statuses: keep | discard | crash

Branch Isolation

Always experiment on a dedicated branch:

git checkout -b autoexp/<tag>  # e.g., autoexp/mar8-prompt-tuning

Main stays clean. The branch is your lab notebook. Every commit is a recoverable experiment.

Where This Applies

Domain	Target file	Metric	Eval harness
ML training	`train.py`	val_loss ↓	Fixed eval function
Prompt engineering	`prompt.txt`	eval_accuracy ↑	LLM judge or test suite
RAG pipelines	`config.yaml`	retrieval_precision ↑	Benchmark query set
Compiler/perf tuning	`flags.conf`	runtime_ms ↓	Benchmark binary
API optimization	`handler.py`	p99_latency ↓	Load test script
System prompts	`system.md`	task_score ↑	Eval suite with rubric
CSS/layout	`styles.css`	lighthouse_score ↑	Lighthouse CI
SQL queries	`query.sql`	exec_time_ms ↓	EXPLAIN ANALYZE wrapper
Infrastructure	`terraform.tf`	cost_per_hour ↓	`terraform plan` parser

What Makes This Work

Single file constraint — prevents the agent from refactoring the universe
Fixed budget — no runaway experiments
Git commit per try — perfect audit trail, trivial revert
TSV logging — zero-infra results tracking
"NEVER STOP" — the agent is a tireless researcher, not an assistant waiting for permission

What This Doesn't Work For

Subjective quality (UI aesthetics, writing style) — no scalar metric
Slow feedback (deploy → wait for user traffic → measure) — loop stalls
Multi-file changes (architectural refactors) — too much surface area
Safety-critical systems — autonomous modification without human review is a bad idea

Cost Awareness

Karpathy's version costs GPU-hours. In an LLM-agent context, each iteration costs API tokens. Add a budget cap:

max_experiments: 50        # hard stop
max_cost_usd: 10.00        # estimated from token usage
max_wall_clock_hours: 8    # total run time

The agent should log cumulative cost in the TSV or a separate budget.log.

The Philosophy

If each experiment takes ~5 minutes, you get ~12/hour. Over an 8-hour sleep, that's ~100 experiments. You wake up to a results table your agent built while you dreamed.

The human's job: define the metric, set the constraints, review the results. The agent's job: relentlessly explore the space.

Inspired by @karpathy/autoresearch. Generalized for any quantifiable optimization problem.

adhishthite/autoexp-generic.md

Select an option

No results found