With its new MiMo-7B model, Xiaomi aims to demonstrate that mathematical and programming tasks can be handled effectively by relatively small language models. According to the company, the results are intended to match or even surpass those achieved by larger competitors.
While many current open-source reasoning models use 32 billion parameters, Xiaomi relies on a 7B-parameter architecture, seeking to maximize its efficiency through tailored pre- and post-training strategies.
The research team reports that MiMo-7B was pre-trained on roughly 25 trillion tokens, with the goal of familiarizing the model with reasoning patterns early on. To support this, new extraction tools for mathematical formulas and code—covering formats like HTML and PDF—were developed. A three-stage data mixing process was also used, emphasizing synthetically generated tasks.
During the final pre-training phase, the proportion of math and code data was increased to about 70 percent. The context length was extended to 32,768 tokens to allow the model to process more complex, extended reasoning.
Another element of the training process is multi-token prediction (MTP), in which the model attempts to anticipate several subsequent tokens at once. This technique is designed to improve accuracy and speed up inference.
Reinforcement learning with test case-based rewards
After pre-training, two versions of the model were further refined using reinforcement learning (RL): MiMo-7B-RL-Zero was trained directly from the base model, while MiMo-7B-RL was developed from a previously fine-tuned SFT version. The training data includes 130,000 verifiable math and programming tasks.
Special attention was paid to the reward system for code-related tasks, which uses a “Test Difficulty Driven Reward” to weight individual test cases by difficulty. This approach is intended to address the common issue of sparse rewards, where models receive little feedback for particularly challenging problems.
To improve training stability, an “Easy Data Re-Sampling” method was employed. Tasks that the model already handles well are sampled less frequently, increasing sampling efficiency without distorting training.
Benchmark results and competitive performance
According to the report, MiMo-7B-RL achieves a score of 55.4 on the AIME 2025 math benchmark—4.7 points higher than OpenAI’s o1-mini. On LiveCodeBench v5, the model scores 57.8 percent, well ahead of Alibaba’s 32B QwQ-Preview at 41.9 percent. However, Alibaba’s recently released Qwen3-30B-A3B achieves 62.6 percent on the same benchmark, and the Qwen3-4B model also surpasses its larger predecessor with 54.2 percent. These results position MiMo-7B-RL as a competitive entry in the trend toward smaller, high-performing reasoning models.
The authors also note ongoing challenges. Maintaining a stable balance between math and code capabilities during RL training is difficult, and issues like unintended language mixing—for example, Chinese output in English tasks—remain unresolved.
Xiaomi has published MiMo-7B-Base, MiMo-7B-RL-Zero, and MiMo-7B-RL under an open license on GitHub. The company also sees the project as a methodological contribution, showing that smaller models can make inroads into areas traditionally dominated by larger systems through targeted training strategies.