From charlesreid1

This page contains some examples of problems solved/lessons learned with Autoresearch.

Not all problems are a good fit for autoresearch. We cover some of those attempts too.

Game of Life

Repo: https://github.com/golly-splorts/autoresearch-gollyx-python

This was my first experience running an autoresearch loop.

I relied pretty heavily on Karpathy's autoresearch program.md template, with similar guard rails (no installing new packages, keep the experimentation focused) but updated it to be focused exclusively on optimizing performance for cellular automata.

The autoresearch loop would modify a core program.py, but run a benchmark in bench.py (which imported the program). That structure allowed the benchmark to do some fast checks early on in the simulation, and stop if an experiment broke the code or started giving wrong answers. I also gave it a hard stop at 3 minutes. The metric I had autoresearch optimize was number of generations per second.

I let the autoresearch loop run for about 3 hours. It achieved a massive 160x speedup within the first hour. In fact, it was so successful that it completely maxed out the benchmark - the test case it had been given previously took about 15 minutes to run to completion, but with the huge speedups, the test case would finish running in a few seconds, which cranked up the number of generations per second.

That means autoresearch was basically a smashing success, so much so that I need to come up with new test cases for future autoresearch efforts involving that library.

Jane Street - Permutation of Neural Network Layers

Problem: https://huggingface.co/spaces/jane-street/droppedaneuralnet

Repo: https://github.com/charlesreid1/jane-street-droppedaneuralnet-2026-03

This was an interesting test case for autoresearch that was less about optimization and more about solving a puzzle. Basically, the gist of the Jane Street puzzle is that you have a layered ResNet neural network with 96 distinct layers that go in a particular order. However, the original ordering of the layers has been lost, and needs to be reconstructed. You have known inputs and matching outputs from the original model.

The autoresearch task here was twofold:

  • First, rewrite the harness to focus on a different metric - mean square error between the output produced by the predicted permutation of the layers, versus the output produced by the original model.
  • Second, on the Dwarkesh Podcast, the host mentioned that someone had solved this problem, and gave a high-level overview of the solution procedure. I wanted to see if autoresearch could find its way to implementing the solution itself, given that general description. So, I dropped more hints in this autoresearch prompt.

It took about 20 experiments (1-2 hours) for the agent to have an effective method for narrowing combinatoric search space by pairing each layer together. It took about 20 more experiments for it to rule out several ways of arranging the layers, until it struck upon an efficient way, and a few experiments later it had found the arrangement of layers that led to a mean square error of 0 - the exact solution.

4 hours total, end to end. I only manually intervened once around 2 hours in (to fix authentication snag). Even though I'd set a strict wall time of 5 minutes, I noticed that the next experiment it was running after I fixed the authentication error was going to run for 15+ minutes. It had somehow decided to increase the allowed wall time so that it could test a hypothesis on more samples, and that was really slowing down the simulation. I reiterated the strict 5 minute limit, and it pivoted strategies, broadening the ideas it would consider instead of getting lost down a rabbit hole.

This one intervention turned out to be important, because the different arrangements of the neural network layers would all lead to a pretty uniform non-zero MSE, except for the ONE EXACT arrangement that was correct. Spending a bunch of compute cycles trying to squeeze more and more MSE out of an incorrect arrangement would be a waste. The better approach was keeping the loop short and the benchmark fast.

Counting Castles

The Mount Everest of Project Euler problems.

Going in, I thought this problem would be computationally intensive, so I figured it might be a good fit for autoresearch. In fact, it was poorly suited for autoresearch, for a couple of reasons:

  • Project Euler problems are designed to be fun to discover and dive into - autoresearch tends to paper that over. Admittedly, struggling with this problem like an itch I can't scratch for years has been frustrating for me, so this was less about the fun and more about cracking the problem once and for all, however many tokens it might take. It was, as they say, personal.
  • The structure of this and other Project Euler problems don't fit weell with autoresearch. The problem presents a handful of inputs with corresponding solutions (small inputs), then one real problem without a known solution (real input). This means whatever "research" the agent puts into improving the program for smaller problems probably won't feed into the larger problem in the end, because it will require a totally different algorithm and approach. Distinguishing this in the program.md and the benchmark ate up a lot of time and effort that was not really necessary.
  • Figuring out the metric (Wall time) was a big issue. I started by giving the autoresearch agent both the smaller problems (with known inputs/outputs) and the large problem, and it unsuccessfully tried to run a slow algorithm on a huge input. It also seemed to get stuck after the program would time out, and not know what to do, so it would sit and do nothing and think and think until I interrupted. If autoresearch tried to optimize too small of a problem, it would start looking for tricks for java virtual machine runtime overhead and bit packing and other super obscure language-specific optimizations.
  • My program.md prompt was too complicated. I was trying to give the agent the autoresearch instructions, and on top of that I gave it several sections of problem hints, plus extra instructions tailored for Java. The autoresearch prompt itself took much, much too long to write, longer than it took autoresearch to find a solution.

In hindsight, i should have fed the original problem, plus all of my hints, to the agent, without any of the autoresearch stuff. I estimate it probably would have taken an unsupervised agent working in a loop less than an hour to solve the problem, given sufficient mathematics knowledge (and that's where the hints really shine).