blog.
How I Won a Solver Medal at AIMO3
- AIMO3
- Gemma 4
- Math Reasoning
- Agentic LLMs
- Kaggle
AIMO3 is the third iteration of the AI Mathematical Olympiad Progress Prize, a $2.2M Kaggle competition backed by XTX Markets. You get one H100, zero internet, a five-hour clock, and 50 original olympiad-level math problems. Your model has to return a non-negative integer between 0 and 99999 for each one. That's it. No partial credit, no explanation score, just right or wrong.
Previous winners scored 34/50 using GPU clusters. This edition bumped difficulty up to full IMO level, number theory, combinatorics, geometry, algebra, all in a single offline Kaggle notebook. I ended up with a solver medal, and this is what the pipeline looked like.
What I tried first
The obvious first attempt was a single-pass solver: load Qwen 3 base model, write a strong system prompt, generate once, extract the boxed answer. It works fine on textbook problems. It falls apart completely on IMO-level problems where the model needs to verify, backtrack, or test a conjecture numerically before committing. A single generation pass gives the model no feedback loop, it's just vibing with no error correction.
I also tried building a verifier layer, a second model pass that reads the first model's solution and checks for logical errors. In practice this just doubled inference time and the verifier wasn't reliably better than the generator at catching mistakes. What I actually needed was not a smarter critic, but a way for the model to run code against its own reasoning mid-solution and course-correct on the spot.
What finally worked: parallel agents with voting
The winning design is dead simple in concept: run 8 independent attempts per problem in parallel, each attempt is an agentic loop where the model can call a Python tool as many times as it wants, collect all final answers, and pick the one with the most weighted votes. Each attempt runs in its own persistent Jupyter kernel, the model writes code, the sandbox executes it, the output goes back into the conversation, and the model keeps going for up to 128 turns or until the time budget runs out.
Voting is weighted by a confidence proxy: attempts with fewer Python errors get higher weight. If 4 out of 8 attempts agree on the same answer before the timeout, the solver stops early, no point burning compute when the consensus is already clear. This early-stop at CFG.early_stop = 4 alone recovered significant wall-clock time on easier problems, leaving budget for the hard ones.
Trick 1 - preloading model files into OS page cache
Gemma-4-31B at full precision is a lot of disk to read at model load time. The standard transformers from_pretrained() call reads each shard sequentially, which is brutally slow on Kaggle's storage. Before calling from_pretrained(), I walk the entire model directory, collect every file, and read them all in parallel across 8 threads, each thread reads 1GB chunks and discards the data immediately. The only point is to force the OS to page those bytes into RAM so the actual model load hits cache instead of disk.
This is not a transformers feature or an HF trick. It's just reading files before PyTorch needs them so the kernel has them warm. On the H100 node, this took about 40-50 seconds but saved several minutes off the actual model load. In a 5-hour competition where the first problem doesn't arrive until the model is loaded, that delta matters.
Trick 2 - a pool of 16 persistent Jupyter kernels
Instead of spawning a new Python process for each tool call, I initialize 16 AIMO3Sandbox instances at startup, each one is a full KernelManager with its own set of 5 ports, pre-loaded with math, numpy, sympy, itertools, collections, and mpmath at 64 decimal places. They sit in a thread-safe queue. When an attempt needs to run code, it grabs a kernel from the pool, executes, and puts it back after a %reset -f.
This means kernel startup overhead happens exactly once per kernel per run, not once per tool call. With 8 attempts running in parallel, each doing multiple tool calls per turn, cold-starting kernels on demand would've been a catastrophe. The pool also made sandbox timeout handling cleaner, on timeout, I interrupt the kernel with km.interrupt_kernel() so it stays alive in the pool rather than dying and needing to be replaced.
Trick 3 - budget allocation across problems
Five hours across 50 problems averages to 6 minutes each, but olympiad problems are wildly uneven in difficulty. Spending 6 minutes on a combinatorics warmup and 6 minutes on a hard number theory problem is a bad trade. Instead, before each problem I compute remaining wall-clock time, subtract a 300-second reserve for each unsolved problem still in the queue, and allocate whatever's left to the current problem, capped at 900 seconds.
This means easy problems that hit early-stop at 4 agreeing votes bank time for harder ones. Hard problems that need 8 full attempts and still have no consensus get the most budget. The system naturally weights time toward difficulty without any explicit difficulty classifier. It's just greedy allocation with a floor and ceiling, and it kept me from bleeding out on a single hard problem.