Content
summary Summary

A study by the University of California, Berkeley, enables robots to navigate based on the principle of word prediction from language models. This approach could pave the way for a new generation of robots that can navigate complex environments with minimal training.

In their paper, "Humanoid Locomotion as Next Token Prediction," the researchers treat the complex task of robot locomotion as a sequence prediction problem, similar to predicting the next word in language generation.

They use the same Transformer technology that made the breakthrough in large language models and adapt it to predicting robot steps.

The robot's steps are treated as "tokens", comparable to words in a sentence. By predicting these tokens, the Transformer learns to predict the next movement based on the previous movement sequence. In other words, the robot predicts each next step based on the steps it has already taken.

Ad
Ad

The model was trained using a mix of data sources, including human motion data and YouTube videos. According to the researchers, the robot was able to navigate the streets of San Francisco without seeing any specific examples of the environment beforehand (so-called zero-shot). This was achieved by utilizing just 27 hours of walking data for model training.

The model can also execute commands that it had not seen in training, such as walking backwards. This adaptability could enable robots to move flexibly in complex real-world environments with a fraction of the training effort otherwise required.

Predictions help optimize diverse multimodal training data

The researchers' method shines in its ability to handle assorted data sources, ranging from videos and sensor readings to computer simulations, converting this information into a uniform format for the Transformer's use.

They also devised a strategy for capitalizing on incomplete data by introducing learnable mask tokens that can predict the available information, thus overcoming gaps in the data. For example, for YouTube videos, the researchers used the joint positions of the human body to transfer the motion to the humanoid robot.

Das Forschungsteam verwendete vielfältige Daten für das Robotertraining: neuronale Netzpolitik mit kompletten Sequenzen, modellbasierte Steuerungen ohne Aktionen, annähernd übertragene Bewegungserfassungen von Menschen und aus Internet-Videos rekonstruierte menschliche Posen.
The research team trained the robot using a diverse set of data, including everything from simulated neural network sequences to motion capture data from humans and reconstructed movements from online videos. | Image: Radosavovic et al.

The team's core insight is that even with incomplete trajectories, where certain sensory or motor information is missing, the model can still learn effectively by predicting the available information and filling in the gaps with learnable mask tokens.

Recommendation

They believe that the model's success in making accurate predictions under these conditions suggests it is developing a more sophisticated understanding of the physical world, an intuition that holds exciting potential for the future of autonomous robotics. It's as if the robot is learning to think on its feet, which could change the way robots move in the future.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at UC Berkeley have developed a new model for humanoid robotics that predicts steps as "tokens" and learns movements as an autoregressive sequence from language models. Instead of a word, the model predicts a movement.
  • The model has been trained on a variety of data, including human motion data and YouTube videos, enabling robots to navigate unfamiliar environments - in the researchers' example, San Francisco - without prior adaptation.
  • The technology shows the potential to efficiently deploy robots in complex environments with minimal training. It could open up new possibilities for flexible and intelligent robotic applications.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.