Content
summary Summary

Deepmind's SIMA can perform tasks in different video game worlds, such as Valheim or No Man's Sky, using only text prompts.

Ad

Google Deepmind researchers introduce SIMA (Scalable Instructable Multiworld Agent), an AI agent for 3D video game environments that can translate natural language instructions into actions.

SIMA was trained and tested in collaboration with eight game studios and across nine different video games, including No Man's Sky, Valheim, and Teardown.

Video: Google Deepmind

Ad
Ad

The Deepmind team trained SIMA using game recordings in which a player either gave instructions to another player or described their own game. The team then linked these instructions to game actions.

The agent is primarily trained to imitate behavior (behavioral cloning). It imitates the actions performed by the people in the collected data while following the language instructions.

In this way, the agent learns to make connections between the language descriptions, visual impressions, and corresponding actions.

Google Deepmind SIMA uses pre-trained models and learns from humans

The core of the SIMA agent consists of several components that work together to convert visual input (what the agent "sees") and language input (the instructions it receives) into actions (keyboard and mouse commands).

Image: Google Deepmind

Image and text encoders are responsible for translating the visual and language input into a form that the agent can process. This is done using pre-trained models that already have a comprehensive understanding of images and text.

Recommendation

A transformer model integrates the information from the encoders and past actions to form an internal representation of the current state. A special memory mechanism helps the agent to remember previous actions and their results, which is crucial for understanding multi-step tasks.

Finally, the agent uses this state representation to decide which actions to perform next. These actions are keyboard and mouse commands executed in the virtual environment.

SIMA does not require access to the game's source code, only screen images and natural language instructions. The agent interacts with the virtual environment via keyboard and mouse and is therefore potentially compatible with any virtual environment.

SIMA masters 600 skills

In tests, SIMA mastered 600 basic skills such as navigation, object interaction, and menu control. The team expects future agents to be able to perform complex strategic planning and multifaceted tasks.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Image: Google Deepmind

SIMA differs from other AI systems for video games in that it takes a broad approach, learning in a variety of environments rather than focusing on one or a few specific tasks.

Research shows that an agent trained in many games performs better than an agent specialized in a single game. In addition, SIMA integrates pre-trained models to take advantage of existing knowledge about language and visual perception, and combines this with specific training data from the 3D environments.

The team hopes that this research will contribute to the development of a new generation of general-purpose, language-driven AI agents. With more sophisticated models, projects like SIMA could one day achieve complex goals and become useful on the Internet and in the real world.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google Deepmind introduces SIMA (Scalable Instructable Multiworld Agent), an AI agent that can translate natural language instructions into actions in 3D video game environments such as Valheim or No Man's Sky.
  • Developed in collaboration with eight game studios and nine different video games, SIMA focuses on behavioral cloning to make connections between language instructions, visual impressions, and corresponding actions.
  • In tests, SIMA has mastered 600 basic skills, and the Deepmind team hopes that future agents will be able to perform complex strategic planning and multi-layered tasks, which could contribute to the development of a new generation of general-purpose language-controlled AI agents.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.