Nvidia has unveiled its video-based approach to generating training data for robots and self-driving cars using what it calls "world models" on its Cosmos platform.
According to Nvidia CEO Jensen Huang, robotics may be approaching its "ChatGPT moment" thanks to these world models - a breakthrough he likens to the impact of large language models.
These models, which Nvidia calls World Foundation Models (WFMs), are trained on a massive dataset: 9,000 trillion tokens from 20 million hours of video showing real-world human interactions, industrial processes, and data from robotics and driving.
They can generate physics-based videos from various inputs - text, images, video, and data from robot sensors or motion. Nvidia says they've specifically optimized these models to handle physical interactions, maintain object consistency, and create simulated industrial and driving environments.
The idea is that developers could generate simulations and realistic training data without expensive real-world testing, accelerating the development and real-world deployment of robots.
The processing pipeline running on Nvidia's Blackwell platform can chew through 20 million hours of video in two weeks, compared to 40 days on Hopper GPUs, or three years on CPUs, according to Nvidia. Its Cosmos Tokenizer (Github) compresses image and video data eight times more efficiently and runs twelve times faster than previous solutions.
Several companies are already testing the waters, according to Nvidia. Uber is exploring the technology for autonomous vehicle development. Robotics companies such as 1X, Agile Robots, and Figure AI, along with autonomous vehicle software developers Waabi and Wayve, plan to use the platform for training, safety testing, search for edge and corner case driving scenarios, and more.
Availability and access
Nvidia offers the models in three main variants. The "Nano" version contains 4 billion parameters and is designed as a lightweight option for real-time applications running directly on devices. The "Super" variant serves as a general-purpose base model for a wide range of applications. The "Ultra" version, packing 14 billion parameters, offers the highest level of detail and is primarily intended for training specialized models, so-called distillation.
Alongside these three main variants, Nvidia is also including two specialized models: a 12-billion-parameter upsampling model for refining text prompts, and a 7-billion-parameter video decoder specifically optimized for AR applications.
The Cosmos models are available now as a preview in Nvidia's API catalog, with the full family accessible through the NGC catalog and Hugging Face. Nvidia has released all models under their Open Model License, which permits commercial use of the technology.
Developers can deploy the models through Nvidia's DGX cloud service, while enterprise customers can access them through the Nvidia AI Enterprise software platform. Nvidia has also announced that the models will soon be available as optimized Nvidia NIM microservices.
Critics question video-based world models
While tech giants Google DeepMind and OpenAI, along with startups like Wayve, are pursuing the idea that video models can serve as effective world models by combining images, audio, and text, the concept has faced significant criticism.
A recent study from Bytedance Research and Tsinghua University argues that current video AI models, including OpenAI's Sora, fail to grasp fundamental laws of physics. Instead of learning generally valid rules, the researchers found that these models merely respond to superficial characteristics of the training data, such as color patterns.
Meta's AI chief researcher Yann LeCun has taken an even stronger stance against this approach, describing the attempt to predict the world through pixel generation as "wasteful and doomed to failure."