Autoresearch
From charlesreid1
What is autoresearch?
Autoresearch is an idea originated by Andrej Karpathy. The core idea is to create an agent loop where the agent is trying to improve a program (or, improve its objective function) in a loop. You give the the agent a bite-sized program and a prompt telling it how to experiment on the program, and you put a fixed time limit on how long the program can run.
Example
The original application of this idea by Andrej applied the idea to a neural network. Andrej set a wall time limit of 5 minutes, and let the autoresearch loop run overnight, allowing for about 100 experiments to be run. Over the course of those 100 experiments, the agent was able to find several techniques to improve the performance of the neural network.
Link: https://myoid.com/karpathy-autoresearch-autonomous-experiments/
More Examples
Autoresearch/Example Applications
Procedure
The basic procedure for creating an autoresearch repository is as follows:
- repo root: this is the meta-layer. contains a readme with a description of how it all works, gitignore, maybe the initial prompt or any other follow-up prompts, or notes.
- autoresearch/ directory: this is a "clean" directory where the only files in the directory are the ones the autoresearch bot needs to see. this makes it very straightforward to sandbox the autoresearcher.
contents of the autoresearch/ directory:
- bench.py: benchmark script, runs the benchmarks and prints the metrics (this is usually where you set a wall time limit)
- program.py (this is the thing the bot will tinker with.)
- program.md (this is where you give the autoresearch bot instructions and rules, set its objectives, tell it what it can and can't do, provide it with specialist knowledge and hints)
- requirements.txt (programs to install. this is not something the autoresearcher should be able to change.)
bench.py
run the program.py for fixed wall clock duration. validates correctness at checkpoints.
program.py
Think of this as the heart of the operation. This file contains the code that the autoresearch loop actually tinkers with. Think of it as the library that defines the general operation, and think of bench.py as the driver that feeds specific inputs to the library.
program.md
Arguably the most important file in an autoresearch loop.
The program.md file needs to:
- Include a setup step, to ensure the autoresearcher has a branch set up, and adds its commits to that branch
- Include instructions about what executable to use (Python, virtual environment, etc.)
- Describe the experiments:
- What is the program's objective function?
- What files are allowed to change? What files are not allowed to change?
- How to run an experiment
- How to obtain the metric measurement from the output/log
- How and where to log each experiment
- The experiment loop
- Running autonomously
- Possible areas of exploration
It's a lot of housekeeping, but as you are writing or tweaking the program prompt, it can be helpful to actually run the autoresearcher, see how it fails, and update the prompt to address that.
Here is an example:
# Autoresearch
You are an autonomous research agent optimizing (description of program.py).
Your goal is to maximize (metric) while preserving (exact correctness?).
## Setup
To set up a new experiment, work with the user to:
1. **Agree on a run tag**: propose a tag based on today's date (e.g. `mar5`). The branch `autoresearch/<tag>` must not already exist — this is a fresh run.
2. **Create the branch**: `git checkout -b autoresearch/<tag>` from current master.
3. **Read the in-scope files**: The repo is small. Read these files for full context:
- `bench.py` — benchmark harness with hardcoded test case, correctness checkpoints, and evaluation. Do not modify.
- `program.py` — the file you modify. Contains a toroidal cellular automata class and a run_benchmark entry point.
4. **Initialize results.tsv**: Create `results.tsv` with just the header row. The baseline will be recorded after the first run.
5. **Confirm and go**: Confirm setup looks good.
Once you get confirmation, kick off the experimentation.
## Using Pypy
Utilize the python binary at `vpp/bin/python` to run the benchmarking step.
Ensure scipy and numpy are installed, if not, stop and ask the human to install scipy and numpy.
## Experimentation
Each experiment will run a cellular automata simulation (B3/S23 Game of Life, 150 x 240 toroidal grid)
for a fixed time budget of 3 minutes.
### Experiment optimization target
- **Metric:** `generations_per_s` (higher is better) is the number of generations per second
that the simulator sustained over the 3 minute running duration.
- **Hard constraint:** All checkpoint cell counts (team1, team2) must
exactly match the hardcoded checkpoints in bench.py. Any divergence = FAIL,
the change is rejected.
### Experiment rules
**What you CAN do:**
- Modify `program.py` - this is the only file you may edit. You may change:
- Data structures (the sparse row-list representation, dicts, sets, numpy arrays, etc.)
- Algorithms (neighbor counting, dead-neighbor collection, birth/survival logic)
- Toroidal wrapping strategy
- Memory layout and access patterns
- Use of numpy or other pure-Python optimizations
- Caching, precomputation, lookup tables
**What you CANNOT do:**
- Do not modify `bench.py`. It is read-only. It contains the hardcoded test case, correctness checkpoints, and evaluation harness.
- Do not install new packages or add dependencies. You can only use what's already in `pyproject.toml`.
- Do not modify the `run_benchmark()` function signature or return format
- Do not modify the cellular automata rules (B3/S23, Game of Life)
- Do not modify the toroidal boundary conditions or grid initial conditions
- Do not modify the two-color team assignments (majority rule, checkerboard tiebreak)
- Do not modify the victory detection (it must continue to work correctly, although you may optimize its implementation)
### Program output format
Once the script finishes, it will output a summary like this in the log file:
```
---
total_generations: 1000
total_walltime: 180
generations_per_s: 5.56
```
Note that the script is configured to always stop after 3 minutes, so depending on the computing platform of this computer the numbers might look different.
You can extract the key metric from the log file:
```
grep "^generations_per_s:" run.log
```
### Logging experiment results
When an experiment is done, log it to `results.tsv` (tab-separated, NOT comma-separated).
The TSV has a header row and 4 columns:
```
commit generations_per_s status description
```
1. git commit hash (short, 7 chars)
2. `generations_per_s` achieved (e.g. 1.234), use 0.000000 for crashes
3. status: `keep`, `discard`, or `crash`
4. short text description of what this experiment tried
### The experiment loop
The experiment runs on a dedicated branch (e.g. `autoresearch/mar5` or `autoresearch/mar5-gpu0`).
LOOP FOREVER:
1. Look at the git state: the current branch/commit we're on
2. Tune `program.py` with an experimental idea by directly hacking the code.
3. git commit
4. Run the experiment/benchmark: `vpp/bin/python bench.py > run.log 2>&1` (redirect everything, do NOT use tee or let output flood your context)
5. Read out the results: `grep "^generations_per_s:" run.log`
6. If the grep output is empty, the run crashed. Run `tail -n 50 run.log` to read the Python stack trace and attempt a fix. If you can't get things to work after more than a few attempts, give up.
7. Record the results in the tsv (NOTE: do not commit the results.tsv file, leave it untracked by git)
8. If `generations_per_s` improved (higher), you "advance" the branch, keeping the git commit
9. If `generations_per_s` did not improve (equal or lower), you git reset back to where you started
### Correctness checkpoints
The benchmark checks cell counts at generations:
0, 1, 2, 3, 4, 5, 60, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 5000
Checkpoints are validated in order from earliest to latest. If your change
breaks the simulation, it will be caught within the first few generations.
This implies:
- A totally wrong algorithm fails at generation 1 (instant feedback)
- An off-by-one or edge case might fail at generation 50 or 100
- Subtle numerical issues might only show at generation 1000+
## Loop
The idea is that you are a completely autonomous researcher trying things out. If they work, keep. If they don't, discard. And you're advancing the branch so that you can iterate. If you feel like you're getting stuck in some way, you can rewind but you should probably do this very very sparingly (if ever).
**Timeout**: Each experiment should take ~3 minutes total (+ a few seconds for startup and eval overhead). If a run exceeds 10 minutes, kill it and treat it as a failure (discard and revert).
**Crashes**: If a run crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it, log "crash" as the status in the tsv, and move on.
**NEVER STOP**: Once the experiment loop has begun (after the initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human might be asleep, or gone from a computer and expects you to continue working *indefinitely* until you are manually stopped. You are autonomous. If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes. The loop runs until the human interrupts you, period.
As an example use case, a user might leave you running while they sleep. If each experiment takes you ~5 minutes then you can run approx 12/hour, for a total of about 100 over the duration of the average human sleep. The user then wakes up to experimental results, all completed by you while they slept!
## Addendum
### Known bottlenecks and optimization ideas
The current implementation has several performance characteristics worth
investigating:
1. **Sparse row-list representation** — State is stored as `[[y, x1, x2, ...], ...]`.
Every `add_cell` call does a linear scan. A dict-of-sets or 2D numpy array
could be faster for lookup-heavy operations.
2. **Neighbor counting** — `get_neighbors_from_alive` and `get_color_from_alive`
do linear scans through the state lists to find adjacent rows. With a
dict-based structure, neighbor lookups become O(1).
3. **String key construction** — Dead neighbors are tracked via string keys
like `"x,y"`. Tuple keys `(x, y)` avoid string allocation and parsing.
4. **Redundant toroidal wrapping** — Modulo arithmetic is applied repeatedly
in many methods. Precomputing wrapped coordinates or using a dense grid
eliminates this overhead.
5. **get_cell_color linear scan** — Called once per neighbor per live cell.
With a color lookup dict or array, this becomes O(1).
6. **Dense grid approach** — For a 150×240 grid, a pair of numpy arrays
(one per team) with convolution-based neighbor counting could be
dramatically faster than the sparse approach.
### Simplicity criterion
Prefer simple changes with clear wins. A 10-line refactor that gives 2x
speedup is better than a 200-line rewrite that gives 2.1x. If a complex
change produces only marginal improvement, discard it and try something
else.
Adapting Autoresearch
Autoresearch Strengths and Weaknesses
A paper by Trehan and Chopra cites six failure modes for autonomous experiments:
- bias toward training-data defaults
- implementation drift under execution pressure
- memory degredation across long-time-horizon tasks
- overexcitement (ending too early)
- insufficient domain intelligence
- weak scientific taste in experimental design
What autoresearch is good at:
- AI/ML and CS research - training data for AI have heavy coverage of these topics, so they start with supercharged understanding
- experiments in this space also have properties that make them convenient for this type of loop (fast feedback, single scalar metrics, code-is-the-experiment)
What autoresearch is not good at:
- Low-repetition experiments - everything is one-off, everything requires customization
- Systems with slow feedback (hours or days)
- Systems where it is difficult to define "good", because it is subjective or depends on many things
- (Example: instead of Karpathy's scalar val_bpb metric that is computed quickly at the end of the loop, a domain problem in biology or chemistry might require a free energy estimate that takes hours to compute, requires statistical analysis to interpret and grade, and depends entirely on the experimental data set)
Adapting autoresearch to research questions isn't about finding mitigations for each of the six failure modes. It's about redesigning the feedback loop for domains where the problems of interest give slower, more expensive, or more ambiguous feedback. Not trying to replicate the full scientific research loop - just subproblems.
The most promising applications are going to be subproblems that do have the fast scalar metrics and tighter loops:
- optimizing sequence alignment parameters against known benchmarks
- tuning feature engineering for property prediction
- searching hyperparameter spaces for model improvements
Strategies for finding subproblems: 6 steps
In the previous section we mentioned the most promising application of autoresearch is to find subproblems with the right "shape". Let's talk about how to help scientists find problems with the right shape in their work.
Step 1: Map the full workflow, not the science question.
The scientist will almost always start any description of their work with the research question. What you really want to know is the workflow. Figure out what the concrete computational steps are between (raw data) and (paper/report).
The flow might look something like (data cleaning) -> (alignment) -> (filtering) -> (feature extraction) -> (model fitting) -> viz -> (interpretation) -> (writing). Each of those is a candidate subproblem. Scientists might collapse the boring parts, or skip over them in the description. Do the opposite - expand the boring parts back out.
Step 2: Score each step on three dimensions.
Score each step of the workflow using 3 dimensinos:
- how repetitive is it
- how fast is the feedback
- how well-defined is good
We mentioned these criteria in the prior section, but here you grade each step of the workflow, and look for the steps with the highest grades. Steps with the lowest score are basically where the scientist's greatest value lies.
Step 3: Learn the groan points
Good tactic for surfacing problems that can be fixed generally, but also for surfacing subproblems where autoresearch might be able to help.
Listen for complaints like "spent two days manually doing X" or "the most annoying part is Y" or "I did a grid search" or "I eyeballed it to see if Z". These map to classification/filtering problems, optimization problems, automated QC problems, pipeline automation problems.
These are good problems because they are almost always high-repetition, stuff people hate doing, and doesn't threaten people's contributions. Some of them probably won't need autoresearch to solve, but it can still be a useful way to surface autoresearch-shaped problems.
Step 4: 100x scenario
Another strategy for uncovering problems where autoresearch can help is to think about which steps in a workflow are currently done manually, and whether those would be feasible if throughput were to increase 100x. Again, some steps probably won't need autoresearch to unblock or to automate, but it can be a useful way to surface problems to apply autoresearch to.
Step 5: Propose the subproblem, not the autoresearch solution
This is more about how you get researchers onboard. Rather than coming to the researcher and saying "I want to use autoresearch to solve your XYZ problem," instead, describe the subproblem/pain point, describe what aspects of the problem make it a good autoresearch problem, and ask if it's a bottleneck worth removing. If they agree it's a real bottleneck, then move on to discussing the autoresearch loop.
Step 6: Build a problem inventory
Shared spreadsheet or running document with:
- subproblem description
- what project it supports
- repetitiveness score
- feedback speed
- metric clarity
- estimated time savings
- status
Review monthly with the team. Makes contributions visible, give scientists a structure to start their requests with, and accumulates institutional knowledge.
Example subproblems
Computational biology:
- data wrangling and format conversion (something nearly every team has to deal with)
- parameter sensitivity analysis ("how much do results change if i tweak this threshold?" is a natural autoresearch loop, scientist can define a metric and have agent sweep space)
- benchmark reproduction (trying to reproduce a benchmark from a paper on different data - well-defined, clear target metric, and tedious)
- visualization iteration (generating 20 variations of a figure to find the right representation of complex data. fast feedback, clear preference signal, high repetition.)
- literature-grounded hypothesis filtering (given a list of 500 candidate genes/compounds/targets, which ones have prior evidence supporting investigation? RAG over lit databases)
Counter "this problem is too hard" claims with "you're right, the problem is way beyond what AI can do end-to-end. But what about step X, or step Y, are those bottlenecks?"
Tips for solving your subproblem with autoresearch: 8 observations
Once you find a problem with the right shape for autoresearch, here are some things to keep in mind:
Observation 1: Define what the agent is and is not allowed to change.
This is something covered in the program.md document example above, but it also requires nuance. Asking the agent to "optimize" a program can have a range of meanings. On one end, modifying parameters within an existing implementation; on the other, rewriting the entire algorithm, using different data structures, or major refactors.
It can be useful to run separate autoresearch loops, one where you limit the changes that can be made and look for micro-optimizations, and another where you allow more drastic changes and find macro-optimizations.
Observation 2: Correctness is a silent risk.
If your agent is optimizing a metric around speed, it may achieve a major speedup by making the code return an incorrect result. If correctness is not enforced, the "optimizations" the agent finds will be useless. Benchmark script needs to include a correctness gate along with a speed measurement.
Observation 3: Choose benchmark workload carefully, then diversify.
Many computational problems will have a different cost depending on the inputs. (Dense vs sparse, for example.) If you only use one input data set, the agent will optimize the code for that input data set. It can potentially lead to poor performance on other input data sets.
If possible, the benchmark should include multiple inputs. If not, optimizations should be tested against a robust set of test inputs that will uncover any performance issues. Can potentially run separate autoresearch loops for different types of inputs.
Observation 4: Wall time measurements need care.
Beware of warm-up effects, OS background/scheduling noise, thermal throttling, JIT compilation, and memory allocation amortization. These can all affect metrics that rely on wall time. The benchmark script can account for this different ways when computing the final metric: discarding first N generations, taking median of multiple runs, reporting mean/variance, etc. Agents that use noisy measurements can end up chasing noise and wasting time on imaginary optimizations. Give the agent a clean signal.
Observation 5: Track what you are learning.
Autoresearch idea from Karpathy uses git commits ads the experiment record. But you might also want to include a record of why things worked. After 100 experiments, you want to be able to answer the question: are the gains from algorithm changes, from better memory layout, from parallelization strategy, or from something else?
Can update the program.md to instruct agents to log a one-line hypothesis before each experiment and a one-line explanation after. cluster successful experiments by category. understand what class of experiments had most impact.
Observation 6: Watch for plateau, and plan for it
Lower-hanging fruit (obvious parallelization, cache friendliness, compiler flag tuning) will probably take 20-30 experiments, then gains will start to fall off. That's normal.
Have a plan for when the agent gets there. Let the agent expand the search space (make bigger architectural changes), change constraints (different grid sizes, different rules), or even "declare victory" on a subproblem and move on.
(WORST OUTCOME: burning through API tokens running 500 experiments to chase a 0.0001% improvement.)
Observation 7: Language and toolchain matter
The library's language matters, because it affects where the big wins come from. Python's big wins come from escaping Python (numpy, cython, vectorization). C's big wins come from memory layout, SIMD, cache behavior. Rust's big wins come from zero-call abstractions, avoiding unnecessary allocations.
Observation 8: 80/20 benchmark/instructions.
Rule of thumb: spend 80% of your setup time on the benchmark suite, 20% on the agent instructions.