How DeepMind's Genie AI could reshape robotics by generating interactive worlds from images

Researchers at DeepMind have developed Genie, a model that creates worlds from images and moves video game characters around in them on their own. It sounds like a gimmick, but it could be the basis for something much bigger.

"What if, given a large corpus of videos from the Internet, we could not only train models capable of generating novel images or videos, but entire interactive experiences?" That's the question researchers at Google DeepMind asked themselves as they developed their new AI model, Genie (Generative Interactive Environments).

Video: Google DeepMind

Genie can transform various types of image prompts into virtual worlds and logically move game characters within them. At first glance, this is particularly interesting for video games. However, the researchers believe that their approach with Genie could be an important step toward world models for robotics applications.

Foundation model for 2D platformers

In its largest form, Genie is an 11-billion-parameter AI model that has the properties of a foundation model for 2D platformers: Given a visual input that is completely unknown to the model and a human-specified action that is roughly equivalent to pressing a gamepad button, Genie generates a virtual world in which the action is performed.

The actual actor, i.e. the sword-wielding hero or a ball in a hand-drawn sketch, is not fixed - the model has learned through training which elements in an image usually perform actions and then move them independently. Another interesting observation: Genie even takes into account the parallax effect that occurs when the foreground and background move at different speeds in a game.

Unlabeled gaming videos from the internet as training material

The special feature of the model is that it learns exclusively from videos - i.e. it does not receive any other information such as gamepad inputs during training. A collection of originally 200,000 hours of freely available game videos from the Internet served as the basis, which the researchers filtered down to 30,000 hours of material specifically for 2D platforms.

Genie consists of three components: a video tokenizer that generates tokens from frames, a latent action model that predicts actions between frames, and a dynamics model that predicts the next frame of the video. For the latent action model, the team limits the number of predicted actions to a small, discrete set of codes to enable human playability and further improve controllability.

Genie uses "spatiotemporal (ST) transformers" for its components. As is often the case with transformers, the team found that Genie's performance improved as the number of parameters increased.

Recommendation

AI research

AI agents outperform human teams in hacking competitions

Genie is a framework that works outside of 2D platformers

The team trained a smaller model with 2.5 billion parameters using videos of robotic arms. Genie impressively demonstrated its ability to imagine a coherent environment and reproduce inputs such as specific motion sequences. Genie even simulates the deformation of objects.

Video: Tim Rocktäschel/X

The team believes that the experiment shows that the method underlying Genie can be used to train a basic robotics model with larger video datasets. This could generate simulations that can be controlled at a low level and used for a variety of applications, such as training robotic agents.

Researchers exercise caution

Genie thus lays an important foundation for further research. Thanks to its general approach, the model could be trained on an even larger amount of Internet videos to simulate more realistic and coherent environments. Currently, the model's weak point is a very limited memory of 16 frames and a speed of only one frame per second in the new generations.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Genie could enable a large amount of people to generate their own game-like experiences. This could be positive for those who wish to express their creativity in a new way, for example children who could design and step into their own imagined worlds. We also recognize that with significant advances, it will be critical to explore the possibilities of using this technology to amplify existing human game generation and creativity—and empowering relevant industries to utilize Genie to enable their next generation of playable world development.

From the paper

Similar to OpenAI's Sora, Google DeepMind has chosen not to release model code or weights.

How DeepMind's Genie AI could reshape robotics by generating interactive worlds from images

Foundation model for 2D platformers

Unlabeled gaming videos from the internet as training material

AI agents outperform human teams in hacking competitions

Genie is a framework that works outside of 2D platformers

Researchers exercise caution

OpenAI beats Deepseek by a surprisingly wide margin in Google's latest reasoning benchmark

Google develops AI research assistant to accelerate scientific discoveries

AI image generation gets a boost by borrowing ideas from reasoning models

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

How DeepMind's Genie AI could reshape robotics by generating interactive worlds from images

Foundation model for 2D platformers

Unlabeled gaming videos from the internet as training material

Genie is a framework that works outside of 2D platformers

Researchers exercise caution

Share

Bank details