blog.
Why OpenAI Sent Me $500 for a Research Project
- Parameter Golf
- OpenAI
- Cross-Sparse Attention
- GPTQ
- Model Compression
On March 18, 2026, OpenAI launched Parameter Golf, a competition to train the best language model that fits inside 16MB and trains in under 10 minutes on 8×H100 GPUs. Submissions are judged on a single number: bits-per-byte (BPB) on a held-out slice of the FineWeb dataset. The baseline they shipped scores 1.2244. The current SOTA is 1.1147. I submitted PR #531 on March 22. My score: 1.1271 BPB. Top 5 on the global leaderboard. OpenAI sent me $500.
My submission was called XSA4_EMA_GPQ. Three techniques stacked on top of each other, each one doing a different job. This post is exactly what each of them is, why it works, and why I picked this specific combination.
What bits-per-byte actually means
BPB stands for bits-per-byte. It answers a deceptively simple question: if you used this model as a compressor, how many bits would it need to encode each raw byte of text? A model that perfectly predicts every next character needs 0 bits, it already knows what's coming. A model that's completely random needs 8 bits per byte (since a byte has 8 bits of entropy). English text actually has a theoretical entropy of around 1.0–1.1 BPB. So every submission in this competition is trying to get a model to compress text nearly as well as the physical limits of English.
BPB is tokenizer-agnostic because the denominator is always raw UTF-8 bytes, not tokens. This is important because models with different vocabulary sizes would otherwise be unfairly compared, a byte-level model that uses 256-token vocab and a BPE model that uses 16,384-token vocab would look wildly different on a per-token loss metric even if they model the same information. BPB normalises everything to the same denominator. Lower is strictly better.
The constraint: 16MB is nothing
At float32, 16MB gives you exactly 4 million parameters. At float16, 8 million. At int8, 16 million. At int4 (4 bits per weight), 32 million. The entire architecture game in this competition is about packing as many useful parameters as possible into that budget, which immediately makes quantization not optional, but central. The 10-minute training cap on 8×H100s means you can do roughly 4,000–6,000 gradient steps depending on sequence length and batch size. Every architectural choice has to earn its keep within that compute budget.
The baseline model ships as a 9-layer, 512-dim transformer with a 1024-token vocabulary, tied embeddings, and int8 quantization. It scores 1.2244 BPB. The top submission improved that to 1.1147, a 9% compression improvement, by stacking techniques across quantization, architecture, and training strategy. My submission sits at 1.1271, contributing primarily through the combination of XSA4, EMA, and GPTQ Int6.
Cross-Sparse Attention (XSA4)
Standard transformer self-attention is applied at every single layer. Every token attends to every other token, at every depth of the network. This is expensive in terms of both compute and the number of parameters sitting in Q, K, V, and output projection matrices. Cross-Sparse Attention (XSA) is an architectural change: instead of running full self-attention at every layer, you run it only at specific layers, and make the other layers MLP-only. XSA4 means full self-attention is applied only at the last 4 layers of the model. Every earlier layer is pure feedforward.
Why does this help? Two reasons. First, early layers in a transformer mostly learn local, position-sensitive features, things like character n-grams and short-range syntax. Full attention with a global receptive field is overkill for that job. Second, removing attention from early layers frees up parameter budget. The parameters you save on Q/K/V/O projections for those layers can be reallocated to wider MLP layers or more depth. XSA variants dominate 3 of the top 6 leaderboard slots. The intuition is correct: not every layer needs to see everything at once.
Exponential Moving Average (EMA)
During training, model weights are updated every step via gradient descent. At any given step, the current weights reflect the last gradient update, which is noisy, especially early in training when the loss surface is steep. Exponential Moving Average (EMA) maintains a shadow copy of the weights that is a smoothed version of the training trajectory. At every step, the EMA weights are updated as: EMA_weights = 0.997 × EMA_weights + 0.003 × current_weights. The shadow copy changes slowly, accumulating a weighted average of all past weight states with recent states contributing more.
The key insight is that the EMA weights are almost always better than the final raw checkpoint for inference. The training optimizer takes aggressive steps to minimise loss quickly, those steps sometimes overshoot into slightly worse regions of the loss surface. The EMA smooths out those overshoots. When you evaluate at the end of training, you evaluate the EMA weights, not the raw optimiser weights. In practice, EMA at decay 0.997 reliably drops BPB by 0.002–0.005 compared to the raw final checkpoint, which in this competition is a meaningful gap. However, as other competitors discovered, EMA only helps when training is long enough, at under ~4,000 steps the shadow weights get contaminated by early poorly-converged states and can actually hurt.
GPTQ with Int6 mixed precision
GPTQ stands for Generative Pre-Training Quantization. It is a post-training quantization method that is much smarter than naive rounding. The naive approach is: take each float32 weight, round it to the nearest int8 value, done. GPTQ instead uses second-order information, specifically, the Hessian of the loss with respect to each weight, to figure out which weights are most sensitive to rounding error, and compensates for the error introduced by quantizing one weight by adjusting its neighbours. This is called error propagation with Cholesky decomposition, and it means GPTQ can quantize more aggressively while preserving nearly the same model quality.
I used GPTQ with int6 precision for the MLP weights, 6 bits per weight instead of 8. Int6 fits more parameters in the 16MB budget than int8, and because GPTQ is compensating for rounding error rather than just hoping the model survives it, the quality degradation is minimal. The attention projections stayed at int8 because attention heads are more sensitive to weight perturbations. This mixed-precision strategy, int6 for MLP, int8 for attention, is the same philosophy as my AIMO3 work: allocate the budget where sensitivity is lowest, protect the budget where it matters most.
Why this combination specifically
XSA4 improves the architecture's parameter efficiency, you get the same modelling quality with fewer attention parameters, freeing room for deeper or wider MLPs. EMA improves the weight quality at inference time, the final model you submit is smoother and more generalised than any single checkpoint. GPTQ Int6 compresses the weights further into the 16MB budget without the quality loss that naive quantization would cause. Each technique attacks a different axis: architecture, training stability, and compression. They are additive because they don't interfere with each other.
The combination lands at 1.1271 BPB, which is 0.0081 above the #1 slot. The top submission adds self-generated GPTQ calibration, after training, the model generates its own calibration sequences from its own distribution, collects Hessians from forward hooks, and runs GPTQ on those self-generated sequences rather than held-out data. That extra step is what closes the gap. In a future submission that would be the first thing I'd add on top of this stack.
What AIMO3 and ARC-AGI taught me here
The instinct that carried over from both competitions was: don't optimise the obvious thing in isolation. In AIMO3, I didn't just pick a better model, I built a better inference loop. In ARC-AGI, I didn't just prompt better, I fine-tuned per puzzle. In Parameter Golf, the obvious thing is to make the model smaller. The actual lever is information density per parameter, and XSA + EMA + GPTQ are three completely different angles on that same problem, architectural density, training-time density, and compression-time density.
The budget allocation mindset also carried over directly. In AIMO3 I distributed wall-clock time across problems. Here I distributed bit budget across components, int6 where sensitivity is low, int8 where it's high, full attention only in the last 4 layers. It's the same greedy allocation logic, just the resource is bits instead of seconds.
The $500 and what it actually is
OpenAI awards prizes and compute credits to submissions that meaningfully advance the leaderboard or demonstrate novel technique combinations. PR #531, XSA4_EMA_GPQ, dated March 22, 2026. The competition is explicitly designed as a talent-scouting pipeline: top performers are considered for junior researcher roles in June 2026, including students and olympiad competitors.
The $500 is less interesting than the fact that the submission is fully open source on the OpenAI repository, code, weights, and training config. The combination of XSA4 + EMA + GPTQ was immediately available for anyone to fork and extend. That's the real point of doing it publicly. The techniques that improve BPB in a 16MB toy competition are the exact same techniques that make models run on phones, edge hardware, and constrained inference budgets. Parameter Golf is a talent hunt dressed up as a compression challenge, and it turns out competing seriously in toy problems is one of the faster ways to learn real things.