Content
summary Summary

Foundation models have dominated AI research over the past few years. Now, Microsoft is unveiling an Interactive Agent Foundation Model designed to perform better in the virtual and real world.

In their new work, researchers from Microsoft Research, Stanford University, and the University of California present the Interactive Agent Foundation Model, which has been trained for a variety of applications using text, image, and action data. The team uses a unified framework that combines different pre-training strategies for image, text, and action. With this work, the team aims to demonstrate the feasibility of such a versatile and adaptable AI framework and test it in three domains: robotics, game AI, and healthcare.

Image: Durante, Sarkar, Gong et al.

The 277 million parameter model was pre-trained with 13.4 million video frames. These included a variety of robotics and gaming tasks, including Minecraft, with the data including text descriptions and robot action tokens along with the videos. In tests, the model demonstrated its ability to effectively model actions in a variety of domains, such as controlling a robot or predicting actions in Minecraft. The team has also shown that the model can be used in domains such as healthcare, where it was fine-tuned with additional video and text data and its performance benefited from the other data sources.

Microsoft's "Interactive Agent Foundation Model" as an embodied agent

The work can be seen as a contribution to basic research on the path from static, task-specific models to dynamic, agent-based systems, and is reminiscent of work such as Deepmind's GATO. The researchers emphasize the need to generate dynamic behavior that is grounded in an understanding of environmental contexts. To this end, they also define a new paradigm for embodied agents:

Ad
Ad

We define the embodied agent paradigm as “any intelligent agent capable of autonomously taking suitable and seamless action based on sensory input, whether in the physical world or in a virtual or mixed-reality environment representing the physical world.”

Image: Durante, Sarkar, Gong et al.

Importantly, an embodied agent is conceptualized as a member of a collaborative system in which it communicates with humans and performs a variety of actions based on human needs. In this way, according to the team, embodied agents can facilitate tedious tasks in the virtual and physical world.

The researchers plan to make their code and models publicly available soon.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from Microsoft Research, Stanford University, and the University of California present the Interactive Agent Foundation Model, an AI framework for text, image, and action data in diverse applications such as robotics, game AI, and healthcare.
  • The model, with 277 million parameters, was trained on 13.4 million video frames and demonstrates capabilities in controlling robots and predicting actions in games such as Minecraft.
  • The researchers are defining a new paradigm for “embodied agents” that can autonomously perform appropriate and seamless actions based on sensory input in both the physical and virtual worlds.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.