OLMo 3 debuts as the first fully open "thinking" model with step-by-step logic exposed to users

Nov 20, 2025

Allen AI

Key Points

The Allen Institute for AI has launched OLMo 3, a new generation of fully transparent AI models, including the first openly available 32B model that reveals its reasoning processes.
OLMo 3 offers complete transparency on training steps, checkpoints, and datasets.
Built on the Dolma 3 dataset, the models are openly licensed and can be accessed immediately on Hugging Face and the Ai2 Playground.

The Allen Institute for AI (Ai2) has launched OLMo 3, a new line of fully open AI models. This release includes the first open 32B "thinking" model, designed to make its reasoning process visible while running 2.5 times more efficiently than similar models.

The OLMo 3 family comes in three versions: OLMo 3-Base (7B and 32B), OLMo 3-Think (7B and 32B), and OLMo 3-Instruct (7B). Each model supports a 65,000-token context window, 16 times larger than the previous OLMo 2.

Ai2 says this is the first time researchers and developers get access to everything from training data to deployment. Every training step, checkpoint, and dataset is open for inspection, and users can trace individual reasoning steps back to the exact data that produced them.

Efficiency gains without sacrificing performance

According to Ai2, the OLMo 3-Base 7B model is trained with 2.5 times the compute efficiency of Meta’s Llama-3.1-8B, measured by GPU hours per token. Despite the efficiency boost, OLMo 3 models are said to achieve performance that rivals much larger systems. OLMo 3 outperforms open competitors like Apertus-70B and SmolLM 3 on reasoning, comprehension, and long-context benchmarks.

CEO Ali Farhadi explained that "high performance doesn't have to come at high cost" and that the system demonstrates how "responsible, sustainable AI can scale without compromise." Here’s how the Reasoning model stacks up on benchmarks:

Skill	Benchmark	Olmo 3-Think (32B)	Qwen 3 32B	Qwen 3 VL 32B Thinking	Gemma 3 27B Instruct	DeepSeek R1 Distill 32B
Math	MATH	96.1 ▲	95.4	96.7	87.4	92.6
AIME 2024	76.8	80.8	86.3	28.9	70.3
AIME 2025	72.5	70.9	78.8	22.9	56.3
OMEGA	50.8 ▲	47.7	50.8	24.0	38.9
Reasoning	BigBenchHard	89.8 ▲	90.6	91.1	82.4	89.7
ZebraLogic	76.0	88.3	96.1	24.8	69.4
AGI Eval English	88.2	90.0	92.2	76.9	88.1
Coding	HumanEvalPlus	91.4 ▲	91.2	90.6	79.2	92.3
MBPP+	68.0	70.6	66.2	65.7	70.1
LiveCodeBench v3	83.5	90.2	84.8	39.0	79.5
IF	IFEval	89.0 ★	86.5	85.5	85.4	78.7
IFBench	47.6	37.3	55.1	31.3	23.8
Knowledge & QA	MMLU	85.4	88.8	90.1	74.6	88.0
PopQA	31.9 ▲	30.7	32.2	30.2	26.7
GPQA	58.1	67.3	67.4	45.0	61.8
Chat	AlpacaEval 2 LC	74.2	75.6	80.9	65.5	26.2
Safety	Safety	68.8	69.0	82.7	68.6	63.6

(★ indicates Olmo won the category; ▲ indicates Olmo is within 2.0 points of the top score. Additional comparisons are available in the full report.)

Bringing transparency to reasoning models

OLMo 3-Think is the first fully open model to generate explicit, step-by-step reasoning chains. Until now, this kind of visible logic was limited to closed systems like OpenAI’s o1 series. With OLMo 3, users can see exactly how the model reaches its conclusions and follow the entire process from data to output. The new models are available for testing in the Ai2 Playground.

Most so-called open-source models only release their weights, keeping their datasets and training process private. These are really "open weights" models, offering only partial transparency. The best open-weight reasoning models, like Kimi K2 Thinking from Moonshot AI, have mostly come from China. OLMo 3 goes further by opening up the full pipeline.

Open tools for custom training and evaluation

OLMo 3 is trained on the Dolma 3 dataset, which contains six trillion tokens from web content, scientific papers, and code. Ai2 also released the Dolci Suite for fine-tuning reasoning skills and OLMES for reproducible model evaluation.

All models are released under the Apache 2.0 license and are available on Hugging Face and in the Ai2 Playground. Teams can fine-tune these models for new domains, experiment with different training goals, or build on the published checkpoints.

Earlier this year, Ai2’s OLMo 2 32B matched the performance of commercial models like GPT-4o mini while using only about a third of the compute resources. OLMo 3 continues this work, focusing on further improvements in openness, efficiency, and transparency.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: AI2