blog archive.
I use this space to write about AI systems, model training, and the practical engineering decisions behind shipping reliable machine learning work.
The writing follows the same pattern as the rest of this portfolio: direct notes, measurable trade-offs, and lessons that came out of real implementation work rather than abstract speculation.
How I Built a Custom Harness to Beat the TerminalBench Leaderboard — Using GLM
Apr 202613 min read- The harness I built around GLM for TerminalBench, covering entry-point discovery, enforced planning state, trace retention, and parallel subagents.
- Why the leaderboard data convinced me that harness engineering can close the gap between a mid-tier model and frontier agents.
Read article- TerminalBench
- GLM
- Agent Harnesses
- Benchmarking
- Tool Use
Why OpenAI Sent Me $500 for a Research Project
Apr 20268 min read- How I built XSA4 + EMA + GPTQ-Int6 to place top-5 globally in OpenAI's Parameter Golf challenge.
- Explaining bits-per-byte, Cross-Sparse Attention, EMA smoothing, and what GPTQ actually does under the hood.
Read article- Parameter Golf
- OpenAI
- Cross-Sparse Attention
- GPTQ
- Model Compression
How I Reached #1 on ARC-AGI-2
Apr 20267 min read- How I adapted my parallel agent setup from AIMO3 to hit the top of the ARC Prize 2026 leaderboard.
- I used per-puzzle test-time training, a heavily restricted DFS beam search, and an augmented re-scoring trick to make it work.
Read article- ARC-AGI-2
- Test-Time Training
- Qwen
- Kaggle
- Reasoning
How I Won a Solver Medal at AIMO3
Apr 20266 min read- A breakdown of the system I built for AIMO3, a $2.2M Kaggle math competition.
- I used Gemma-4-31B, parallel sandboxed execution, weighted voting, and a nasty budget-aware time allocator to scrape a solver medal.
Read article- AIMO3
- Gemma 4
- Math Reasoning
- Agentic LLMs
- Kaggle
Designing LLM Systems That Stay Fast Under Load
Apr 20266 min read- How I break a strict latency budget across retrieval, prompt assembly, inference, and post-processing when moving AI systems from a neat demo to actual production.
- Engineering trade-offs I care about: caching, parallel execution, and why I sometimes deliberately pick a smaller model over a massive one.
Read article- LLM Systems
- Latency
- Inference
- Observability
What Fine-Tuning Smaller Models Taught Me About Reasoning
Mar 20268 min read- Notes on building compact reasoning models when compute is tight.
- Why I think dataset hygiene and strict evaluation loops matter way more than scaling up expensive, unrepeatable training runs.
Read article- Fine-tuning
- Reasoning
- PEFT
- Evaluation
Pragnyan Ramtha · 2026