summary Summary

Dynalang is an AI agent that understands language and its environment by making predictions about the future in environments with a multimodal world model.

A major challenge in AI research is to enable AI agents, such as robots, to communicate naturally with humans. Today's agents, such as Google's PaLM-SayCan, understand simple commands like "get the blue block". But they struggle with more complex language situations, such as knowledge transfer ("the top left button turns off the TV"), situational information ("we're running out of milk"), or coordination ("the living room has already been vacuumed").

For example, when an agent hears "I put the bowls away," it should respond differently depending on the task: If it is washing dishes, it should move on to the next cleaning step; if it is serving dinner, it should go get the bowls.

In a new paper, UC Berkeley researchers hypothesize that language can help AI agents anticipate the future: what they will see, how the world will react, and what situations are important. With the right training, this could create an agent that learns a model of its environment through language and responds better in those situations.


Dynalang relies on token and image prediction in Deepmind's DreamerV3

The team is developing Dynalang, an AI agent that learns a model of the world from visual and textual input. It is based on Google Deepmind's DreamerV3, condenses the multimodal inputs into a common representation, and is trained to predict future representations based on its actions.

The approach is similar to training large language models that learn to predict the next token in a sentence. What makes Dynalang unique is that the agent learns by predicting future text as well as observations - meaning images - and rewards. This also makes it different from other reinforcement learning approaches, which usually only predict optimal actions.

During the learning of the world model, the model compresses the observations of images and text into a latent representation. The model is trained to predict the next representation and reconstruct observations from the representation. During policy learning, rollouts are sampled from the world model and the policy is trained to maximize rewards. | Image: Lin et al.

According to the team, Dynalang extracts relevant information from text and learns multimodal associations. For example, if the agent reads, "The book is in the living room," and later sees the book there, the agent will correlate the language and visuals through their impact on its predictions.

The team evaluated Dynalang in a number of interactive environments with different language contexts. These included a simulated home environment, where the agent receives cues about future observations, dynamics, and corrections to perform cleaning tasks more efficiently; a gaming environment; and realistic 3D house scans for navigation tasks.

Dynalang can also learn from web data

Dynalang has learned to use language and image prediction for all tasks to improve its performance, often outperforming other specialized AI architectures. The agent can also generate text and read manuals to learn new games. The team also shows that the architecture allows Dynalang to be trained with offline data without actions or rewards - that is, text and video data that is not actively collected as it explores an environment. In one test, the researchers trained Dynalang with a small dataset of short stories, which improved the agent's performance.


Video: Lin et al.

"The ability to pretrain on video and text without actions or rewards suggests that Dynalang could be scaled to large web datasets, paving then way towards a self-improving multimodal agent that interacts with humans in the world."

The team cites as limitations the architecture used, which is not optimal for certain long-horizon environments. In addition, the quality of the text produced is still far from that of large language models, but could approach it in the future.

More information and the code is available on the Dynalang project page.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • UC Berkeley researchers are developing Dynalang, an AI agent that uses visual and textual input to build a model of the world and act in test environments.
  • Based on Google Deepmind's DreamerV3, Dynalang learns to predict future text, images, and rewards to develop a better understanding of language grounded in the world.
  • Dynalang has been tested in simulated home environments, game environments, and 3D house scans, and could improve robots in the future.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.