Google's new robot system relies on a speech model for control. Thanks to internal monologues, the system can interact more flexibly with its environment.
Flexible robots will one day perform multiple tasks in the real world. They must first possess a large repertoire of basic skills and be able to plan their use. This includes recognizing when they need to change their course of action because a particular action or an entire plan is not successful.
Such planning, constant perceptual feedback, and control of the system at all levels are some of the subtasks that such an embodied agent must seamlessly combine to act intelligently in its environment. AI researchers tackle these challenges with a variety of approaches. Many of these rely on reinforcement learning to teach robots how to move. Tasks like planning and flexibility take more work.
Meta's AI chief Yann LeCun, for example, unveiled his plans for an autonomous AI in March. While it won't be located in a robot (for now), it otherwise has all the building blocks for a flexible agent with the ability to plan.
Large language models for embodied agents
Central to LeCun's model is a world model in which a basic understanding of the world is to be located within the AI system. These world models do not yet exist.
One reason to assume that they are possible has been provided in recent years by large language models. These models can generate and process text, but also have a wealth of knowledge about the world due to training with gigantic amounts of text. In some examples, they also show a rudimentary - though not stable - ability to reason, such as in Google's PaLM experiments.
AI researchers from Google's robotics division, among others, are therefore asking: can language models serve as reasoning models that combine multiple feedback sources and become interactive problem solvers for embodied tasks, such as robot manipulation?
Other work has already shown that language models can be used to plan actions in robots. The Google team now asks whether the capabilities of language models can also be used to reschedule when things go wrong.
Google shows inner robot monologues
The model used is what is called "thinking in language." As an example, the team cites an inner monologue that might play out when a person is trying to unlock a door: "I need to unlock the door; I'm trying to take this key and put it in the lock ... no, wait, it doesn't fit, I'll try another one ... this one worked, now I can turn the key."
This thought process involved decisions about immediate actions to solve the higher-level task (picking up the key, unlocking the door). Next, observations about the results of the attempted actions (key does not fit). Finally, corrective actions in response to those observations (trying a different key).
Such a thought process is therefore a natural framework for integrating feedback for large language models, the researchers say. They call the approach "inner monologue."
Video: Google
While older approaches directly let a speech model generate a complex plan for a goal, leaving no opportunity for corrections, the Google team continuously feeds the speech model with additional information as the robot interacts with the environment.
This may include a description of the objects visible in a scene or feedback on whether an action was successful or not. Based on this information, the speech model can also ask follow-up questions of the human if an instruction is unclear or no longer executable.
Google Inner Monologue controls robots in simulation and reality
Google's team is testing Inner Monologue in simulation and reality, with the language model also generating commands that control the robot. The model was only prepared for this with a few examples (few-shot-learning). In the simulation, a virtual robot arm thus sorts virtual objects. In reality, a real robot arm sorts plastic bananas and ketchup bottles. If an action is unsuccessful, the language model issues the same command again.
Video: Google
Google's robot use in a real test environment is impressive, with a moving robotic arm picking up, stowing, or discarding beverage cans or snacks while dealing with human intervention. It repeats failed actions, describes scenes, and asks appropriate follow-up questions.
Thanks to its language capabilities, the system can continuously adapt to new instructions and set itself new goals when old ones are not achievable. It also understands multiple languages, can use past actions and environmental feedback to better understand a scene, and can handle typos. Video examples of this can be found on Inner Monologue's project page.
In the future, the team hopes to reduce the model's reliance on human feedback, such as using advanced image/video captioning and answering visual questions.