blog.
How I Reached #1 on ARC-AGI-2
- ARC-AGI-2
- Test-Time Training
- Qwen
- Kaggle
- Reasoning
ARC-AGI-2 is the second iteration of François Chollet's Abstract and Reasoning Corpus, a benchmark designed to measure fluid intelligence in AI systems. The task sounds deceptively simple: you're shown a few input-output grid pairs demonstrating some transformation rule, then asked to apply that same rule to a new input grid. The outputs are 10-color grids, up to 30x30. No math, no language, just pattern recognition and reasoning about spatial transformations. Frontier models with hundreds of billions of parameters still barely crack it.
When I looked at this competition, the thing I kept coming back to was that I'd already solved a harder version of the core problem at AIMO3: time-budget-aware parallel solving with aggregated voting. The domain was completely different, but the skeleton was the same. I started adapting from there.
What I tried first
The natural first attempt was pure few-shot prompting, feed the model the training grid pairs as context and ask it to predict the test output. This fails almost immediately on ARC-AGI-2 because the transformations are compositional and often involve rules the model has never seen expressed in language. Greedy decoding gives you one guess and zero recovery mechanism when it's wrong.
I also tried ensembling outputs from multiple random prompt orderings. Shuffling which training pair comes first changes what the model attends to, but without any actual adaptation, you're just sampling the same confused prior from different angles. The model doesn't know the puzzle any better after seeing the examples in a different order. What I actually needed was for the model to internalize the transformation rule, not just see it in context.
What finally worked: test-time training per puzzle
The core insight behind the winning approach is that each ARC puzzle is its own tiny dataset. The training pairs in the puzzle are literally labeled examples of the exact rule you need to apply. So instead of treating inference as a static lookup, I fine-tune the model on those examples, with heavy augmentation, before running inference on the test grid. This is Test-Time Training (TTT): the model temporarily learns the specific transformation rule for this puzzle, then solves it.
I used Qwen3-4B loaded in 4-bit quantization with Unsloth, wrapped in a LoRA adapter with r=256 targeting all projection layers, embed_tokens, and lm_head. For each puzzle, the LoRA weights are reset to defaults, fine-tuned for one epoch on the augmented puzzle, and then used for inference. The base weights never change. This runs on two T4 GPUs in parallel via mp.spawn, with a shared puzzle queue, the same pattern I used for the kernel pool in AIMO3.
aggressive augmentation during training
A puzzle might have only 3 or 4 training pairs. Fine-tuning on 4 examples is almost useless unless you can manufacture more. The augmentation pipeline generates 16 variants per puzzle by composing three types of transformations: transpose, three rotations (rot90, rot90×2, rot90×3), and random color permutations. Each variant is a completely valid instance of the same underlying rule, just viewed from a different angle or with relabeled colors. After augmentation, the training set is large enough for the LoRA adapter to actually learn something.
The color permutation is particularly important. ARC grids use 10 colors (0–9) and models trained purely on original grids can accidentally learn color-specific patterns instead of structural ones. Randomly permuting all color indices across every augmented variant forces the model to learn the transformation as a structural relationship, not a color shortcut. The same permutation is applied consistently to both input and output grids so the rule still holds.
turbo DFS with vocabulary restriction and KV cache
Standard beam search for grid generation is expensive because the vocabulary is the full tokenizer, tens of thousands of tokens. ARC grids only ever contain the digits 0–9, newlines, and the EOS token. That's 12 tokens total. I restricted decoding to only this ARC vocabulary and built a custom DFS decoder called turbo_dfs that explores the solution space depth-first, pruning any path whose cumulative NLL score exceeds -log(0.2). Paths that exceed the budget are abandoned early rather than completed and discarded at the end.
The KV cache reuse is what makes this fast enough to be practical. Each node in the DFS tree corresponds to a partial grid sequence. When the DFS extends a node by one token, I pass the existing past_key_values cache from the parent node forward, the model only processes the single new token at each step, not the entire sequence from scratch. Since the search tree fans out from a shared prefix (the puzzle context), the first forward pass is the only expensive one. Everything after that is single-token extensions on top of cached activations.
augmented re-scoring to pick the best solution
The DFS beam search returns multiple candidate solutions for each puzzle, different grids that were all plausible under the model's distribution. Picking the right one is not trivial. The beam score (cumulative NLL) alone isn't a reliable signal because it only measures how fluent the output is, not whether it's actually the correct transformation. To get a better signal, I score each candidate solution against augmented versions of the puzzle: I create the augmented dataset with the candidate as the answer, then ask the model to score how well it predicts that candidate as the output across several augmented query-answer pairs.
A correct solution should score well across all augmentations of the puzzle, because the same rule applies regardless of rotation or color permutation. An incorrect solution that looks plausible from one angle will typically score badly once you rotate the puzzle or permute the colors. The final ranking combines beam score and mean augmented score using two selection algorithms, score_full_probmul_3 and score_kgmon, and the top-2 predictions per puzzle are submitted as attempt_1 and attempt_2.
Adapting from AIMO3
Coming off AIMO3, three patterns carried over directly. First, the global time budget logic, a 12-hour wall clock with a 600-second reserve, and a per-puzzle hard cutoff at 1200 seconds. Puzzles that finish fast bank time for harder ones. Second, the parallel worker pattern, instead of a kernel pool, I had two GPU processes pulling from a shared mp.Manager queue, each handling a disjoint subset of puzzles. Third, the multiple-candidate-plus-voting structure, in AIMO3 it was 8 attempts voting on an integer answer, here it's DFS beams ranked by augmented scores. The framing is identical.
The main new thing ARC-AGI-2 added that AIMO3 didn't need was the test-time training loop itself. In AIMO3 the model was fixed and the loop was multi-turn reasoning. Here, the model adapts to each puzzle before reasoning at all. That felt like a natural extension, instead of giving the model more turns, give it a few gradient steps first.
Open source and what came after
After the submission went through I open-sourced the full notebook on Kaggle. The response was way beyond what I expected. Other competitors forked it, swapped in larger Qwen checkpoints, extended the augmentation pipeline, and started building ensemble strategies on top of the turbo_dfs decoder. Within a few weeks, multiple teams had pushed the core approach further than I could alone, stronger LoRA configs, better augmentation seeds, refined scoring functions. Seeing the notebook become a foundation rather than just a personal submission was genuinely the most satisfying part of this.
If you're looking at this to build something stronger: the highest-leverage changes I'd chase next are a larger base model with more LoRA capacity, a learned scoring function to replace the heuristic augmented re-scoring, and a smarter per-puzzle time allocator that uses early inference results to decide whether to keep training or cut losses and move on.