Content
summary Summary

In collaboration with OpenAI, robotics company Figure has developed a robot that can hold full conversations and plan and execute its actions.

Ad

This is achieved by connecting the robot to a multimodal model trained by OpenAI that understands images and text.

By connecting to the model, Figure's robot, called "Figure 01," is able to describe its environment, interpret everyday situations, and perform actions based on highly ambiguous, context-dependent requests.

All actions in the video are learned, not remote-controlled, and executed at normal speed.

Ad
Ad

Corey Lynch, robotics and AI engineer at Figure, is excited about the progress: "Even just a few years ago, I would have thought having a full conversation with a humanoid robot while it plans and carries out its own fully learned behaviors would be something we would have to wait decades to see. Obviously, a lot has changed."

Similar robotics research has already been demonstrated by Google with its RT models, which allow a robot to navigate an everyday environment and plan and execute complex actions based on the input and output of language and image models. But Google's demo robots weren't so chatty.

A robot that listens, plans, thinks, reasons and acts

Lynch describes the robot's capabilities in detail. These include describing its visual experiences, planning future actions, reflecting on its memories, and verbally explaining its conclusions that lead to actions.

To accomplish this, OpenAI's multimodal model processes the entire conversation history, including past images, to generate spoken responses that a human can respond to. The same model also decides which learned behavior the robot should perform to execute a given command.

For example, the robot can correctly answer the question "Can you put that there?" by referring to previous parts of the conversation to determine what is meant by "that" and "put that there." In one example, it understands that the dishes lying around should probably go in the dish rack, something I couldn't figure out until I was over 40.

Recommendation

The robot's actions are controlled by what are called visuomotor transformers, which translate images directly into actions. They process the images from the robot's cameras at a frequency of 10 Hz and generate actions with 24 degrees of freedom (wrist positions and finger angles) at a frequency of 200 Hz.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • In partnership with OpenAI, the robotics company Figure AI has developed a robot called "Figure 01" that can hold complex conversations and plan and execute independent actions thanks to a multimodal model trained by OpenAI that interprets images and text.
  • The robot can describe its visual experiences, plan future actions, reflect on its memories, and verbally explain its conclusions. OpenAI's multimodal model processes the entire conversation history, including past images, to generate verbal responses and decide which learned behavior the robot should perform.
  • The robot can access and analyze its entire conversation history to respond to complex queries. For example, it can correctly answer the question "Can you put that there?" by referring to previous parts of the conversation and understanding what "that" and "there" mean.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.