Ad
Skip to content

With Nemotron 3 Nano Omni, Nvidia reveals what really goes into a modern multimodal model

Image description
Nvidia

Key Points

  • Nvidia has released Nemotron 3 Nano Omni, an open AI model that processes text, images, video, and audio and is built for agentic applications.
  • Training involved 717 billion tokens. Much of the synthetic training data comes from competing models like Qwen, gpt-oss, and DeepSeek-OCR.
  • Along with the model weights, Nvidia is also releasing parts of the training data and pipelines. The model is cleared for commercial use.

Nvidia has released Nemotron 3 Nano Omni, an open multimodal model that handles text, images, video, and audio. The interesting part isn't just the performance - it's the training data, which draws on models like Qwen, GPT-OSS, Kimi, and DeepSeek-OCR.

Nemotron 3 Nano Omni is an open-source multimodal model that processes text, images, video, and audio in a single architecture. The 30-billion-parameter model uses a Mamba-Transformer hybrid with Mixture-of-Experts, activating about three billion parameters per query. It runs on Nvidia's own C-RADIOv4-H vision encoder and the Parakeet-TDT audio encoder, with a context window of up to 256,000 tokens. The only officially supported language is English.

According to the technical report, Nemotron 3 Nano Omni is built mainly for agentic applications: document processing, computer-use agents, video and audio analysis, and voice interaction. On benchmarks like OCRBenchV2, MMLongBench-Doc, WorldSense, and VoiceBench, the model beats its predecessor, Nemotron Nano V2 VL, and goes toe-to-toe with Alibaba's Qwen3-Omni. On OSWorld, a benchmark for GUI agents, accuracy jumps from 11.1 to 47.4 points compared to the previous version. Nvidia says throughput at the same interactivity level is up to nine times higher than Qwen3-Omni.

How rival models shaped the training data

The benchmarks are one thing, but there are also interesting details about the training data, the kind of detail you only get with a true open-source release. Nvidia processed roughly 717 billion tokens across seven training stages, with the context window expanding at each step.

Ad
DEC_D_Incontent-1

A big chunk of the synthetic training data comes from competing models. Image captions, question-answer pairs, and reasoning traces were generated using Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen2.5-VL-72B-Instruct, OpenAI's gpt-oss-120b, Kimi-K2.5, GLM-4.1V-9B-Thinking, and DeepSeek-OCR. Nvidia also pulled in GPT-4o and Gemini 3 Flash Preview to handle filtering.

Using other models to train new ones is common practice across the industry, though most developers aren't this upfront about it. Companies like OpenAI, Anthropic, and Google have repeatedly accused Chinese AI labs of large-scale distillation efforts.

The audio data includes Nvidia's own Granary and SIFT-50M datasets, along with captions from Qwen's Omni-Captioner. For the reinforcement learning stage, the team built a five-stage pipeline spanning 25 environments, covering tasks like visual grounding, chart and document understanding, GUI clicks, and automatic speech recognition.

Along with the weights in BF16, FP8, and NVFP4, Nvidia is releasing parts of the training data, the training pipelines on Megatron-Bridge, and the RL recipes on NeMo-RL. That sets this release apart from projects that only ship weights. Reasoning mode is on by default, so users have to turn it off manually for tasks that don't need chain-of-thought. The model ships under the NVIDIA Open Model Agreement, which allows commercial use.

Ad
DEC_D_Incontent-2

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.