Skip to content

Instantly share code, notes, and snippets.

@adhishthite
Created March 7, 2026 21:28
Show Gist options
  • Select an option

  • Save adhishthite/16d8fd9076e85c033b75e187e8a6b94e to your computer and use it in GitHub Desktop.

Select an option

Save adhishthite/16d8fd9076e85c033b75e187e8a6b94e to your computer and use it in GitHub Desktop.
autoexp — Autonomous Experimentation Loop. Generalized from Karpathy's autoresearch for any quantifiable metric project.

autoexp — Autonomous Experimentation Loop

Generalized from Karpathy's autoresearch. Same loop, any domain.


The Idea

An AI agent runs an infinite hill-climbing loop: modify → run → measure → keep or revert → repeat. No human in the loop. Wake up to a TSV of completed experiments.

This works for any project where you have:

  1. A quantifiable metric (one or two scalars)
  2. A fast feedback loop (seconds to minutes per run)
  3. A sandboxed file to modify
  4. An immutable eval harness

The Pattern

LOOP FOREVER:
  1. Read current state (code, config, prompt — whatever you're optimizing)
  2. Form a hypothesis ("what if I try X?")
  3. Edit the target file
  4. Git commit (audit trail)
  5. Run the experiment (fixed time/resource budget)
  6. Extract metrics from output
  7. If improved → keep (advance branch)
     If equal/worse → revert (git reset)
  8. Log to results.tsv
  9. NEVER STOP — human will interrupt when done

Setup Contract

Every autoexp run needs four things defined upfront:

Component What it is Example
Target file The ONE file the agent can modify train.py, prompt.txt, config.yaml
Eval harness Immutable script that produces the metric evaluate.py, run_bench.sh
Metric(s) 1-2 scalar values, lower or higher = better val_bpb ↓, accuracy ↑, latency_ms ↓
Budget Time/cost cap per experiment 5 min wall clock, max 100 runs, $X total

Dual Metric Mode

For two metrics (e.g., accuracy ↑ and latency ↓), define a dominance rule:

  • Primary/secondary: Improve primary; secondary is a soft constraint (e.g., "accuracy must improve; latency shouldn't 2x")
  • Pareto: Keep if better on either metric without regressing on the other
  • Weighted: score = w1 * metric1 + w2 * metric2 — collapse to single scalar

Results Logging

Tab-separated. Simple. Diffable. No infrastructure.

commit	metric_1	metric_2	status	description
a1b2c3d	0.9979	44.0	keep	baseline
b2c3d4e	0.9932	44.2	keep	increased learning rate to 0.04
c3d4e5f	1.0050	44.0	discard	switched to GeLU activation
d4e5f6g	0.0000	0.0	crash	doubled model width (OOM)

Statuses: keep | discard | crash

Branch Isolation

Always experiment on a dedicated branch:

git checkout -b autoexp/<tag>  # e.g., autoexp/mar8-prompt-tuning

Main stays clean. The branch is your lab notebook. Every commit is a recoverable experiment.

Where This Applies

Domain Target file Metric Eval harness
ML training train.py val_loss ↓ Fixed eval function
Prompt engineering prompt.txt eval_accuracy ↑ LLM judge or test suite
RAG pipelines config.yaml retrieval_precision ↑ Benchmark query set
Compiler/perf tuning flags.conf runtime_ms ↓ Benchmark binary
API optimization handler.py p99_latency ↓ Load test script
System prompts system.md task_score ↑ Eval suite with rubric
CSS/layout styles.css lighthouse_score ↑ Lighthouse CI
SQL queries query.sql exec_time_ms ↓ EXPLAIN ANALYZE wrapper
Infrastructure terraform.tf cost_per_hour ↓ terraform plan parser

What Makes This Work

  • Single file constraint — prevents the agent from refactoring the universe
  • Fixed budget — no runaway experiments
  • Git commit per try — perfect audit trail, trivial revert
  • TSV logging — zero-infra results tracking
  • "NEVER STOP" — the agent is a tireless researcher, not an assistant waiting for permission

What This Doesn't Work For

  • Subjective quality (UI aesthetics, writing style) — no scalar metric
  • Slow feedback (deploy → wait for user traffic → measure) — loop stalls
  • Multi-file changes (architectural refactors) — too much surface area
  • Safety-critical systems — autonomous modification without human review is a bad idea

Cost Awareness

Karpathy's version costs GPU-hours. In an LLM-agent context, each iteration costs API tokens. Add a budget cap:

max_experiments: 50        # hard stop
max_cost_usd: 10.00        # estimated from token usage
max_wall_clock_hours: 8    # total run time

The agent should log cumulative cost in the TSV or a separate budget.log.

The Philosophy

If each experiment takes ~5 minutes, you get ~12/hour. Over an 8-hour sleep, that's ~100 experiments. You wake up to a results table your agent built while you dreamed.

The human's job: define the metric, set the constraints, review the results. The agent's job: relentlessly explore the space.


Inspired by @karpathy/autoresearch. Generalized for any quantifiable optimization problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment