A new mini-model called TRM shows that recursive reasoning with tiny networks can outperform large language models on tasks like Sudoku and the ARC-AGI test - using only a fraction of the compute power.
Researchers at Samsung SAIL Montreal introduced the "Tiny Recursive Model" (TRM), a compact design that outperforms large models such as o3-mini and Gemini 2.5 Pro on complex reasoning tasks, despite having just seven million parameters. By comparison, the smallest language models typically range from 3 to 7 billion parameters.
According to the study "Less is More: Recursive Reasoning with Tiny Networks," TRM reaches 45 percent on ARC-AGI-1 and 8 percent on ARC-AGI-2, outperforming much larger models including o3-mini-high (3.0 percent on ARC-AGI-2), Gemini 2.5 Pro (4.9 percent), DeepSeek R1 (1.3 percent), and Claude 3.7 (0.7 percent). The authors say TRM achieves this with less than 0.01 percent of the parameters used in most large models. More specialized systems such as Grok-4-thinking (16.0 percent) and Grok-4-Heavy (29.4 percent) still lead the pack.
In other benchmarks, TRM boosted test accuracy on Sudoku-Extreme from 55.0 to 87.4 percent and on Maze-Hard from 74.5 to 85.3 percent compared to the "Hierarchical Reasoning Model" that inspired its design.
Small model, big impact
TRM functions like a tight, repeating correction loop. It maintains two pieces of short-term memory: the current solution ("y") and a sort of scratchpad for intermediate steps ("z"). At each stage, the model updates this scratchpad by reviewing the task, its current solution, and its prior notes, then produces an improved output based on that information.
This loop runs multiple times, gradually refining earlier mistakes without requiring a massive model or lengthy chains of reasoning. The researchers say a small network with only a few million parameters is enough to make this process work.
During training, TRM receives step-by-step feedback and learns to estimate a stop probability, preventing unnecessary iterations. Depending on the task, it uses either simple MLPs (for fixed-size grids like Sudoku) or self-attention (for larger structures such as ARC-AGI).
What the results do - and don't - mean
TRM demonstrates that small, targeted models can be extremely efficient on narrow, structured reasoning tasks. It improves its answers incrementally and benefits greatly from data augmentation. The paper also emphasizes that architecture choices - such as preferring MLPs over attention for smaller grids - depend on the dataset, and TRM consistently beats larger general-purpose systems in those scenarios.
However, the findings don't imply that large language models are obsolete as a path toward more general capabilities. TRM operates within well-defined grid problems and it isn't suited for open-ended, text-based, or multimodal tasks since it’s not a generative system.
Instead, it represents a promising building block for reasoning tasks, not a replacement for transformer-based language models. Further experiments adapting TRM to new domains are already underway and could expand its potential applications.
Independent replication and tests using the private ARC-AGI datasets from the ARC Institute are still pending.