Researchers teach a robot to walk around San Francisco using AI's word prediction techniques

A study by the University of California, Berkeley, enables robots to navigate based on the principle of word prediction from language models. This approach could pave the way for a new generation of robots that can navigate complex environments with minimal training.

In their paper, "Humanoid Locomotion as Next Token Prediction," the researchers treat the complex task of robot locomotion as a sequence prediction problem, similar to predicting the next word in language generation.

They use the same Transformer technology that made the breakthrough in large language models and adapt it to predicting robot steps.

The robot's steps are treated as "tokens", comparable to words in a sentence. By predicting these tokens, the Transformer learns to predict the next movement based on the previous movement sequence. In other words, the robot predicts each next step based on the steps it has already taken.

The model was trained using a mix of data sources, including human motion data and YouTube videos. According to the researchers, the robot was able to navigate the streets of San Francisco without seeing any specific examples of the environment beforehand (so-called zero-shot). This was achieved by utilizing just 27 hours of walking data for model training.

The model can also execute commands that it had not seen in training, such as walking backwards. This adaptability could enable robots to move flexibly in complex real-world environments with a fraction of the training effort otherwise required.

Predictions help optimize diverse multimodal training data

The researchers' method shines in its ability to handle assorted data sources, ranging from videos and sensor readings to computer simulations, converting this information into a uniform format for the Transformer's use.

They also devised a strategy for capitalizing on incomplete data by introducing learnable mask tokens that can predict the available information, thus overcoming gaps in the data. For example, for YouTube videos, the researchers used the joint positions of the human body to transfer the motion to the humanoid robot.

Das Forschungsteam verwendete vielfältige Daten für das Robotertraining: neuronale Netzpolitik mit kompletten Sequenzen, modellbasierte Steuerungen ohne Aktionen, annähernd übertragene Bewegungserfassungen von Menschen und aus Internet-Videos rekonstruierte menschliche Posen. — The research team trained the robot using a diverse set of data, including everything from simulated neural network sequences to motion capture data from humans and reconstructed movements from online videos. | Image: Radosavovic et al.

The team's core insight is that even with incomplete trajectories, where certain sensory or motor information is missing, the model can still learn effectively by predicting the available information and filling in the gaps with learnable mask tokens.

Recommendation

AI research

Apple AI researchers question OpenAI's claims about o1's reasoning capabilities

They believe that the model's success in making accurate predictions under these conditions suggests it is developing a more sophisticated understanding of the physical world, an intuition that holds exciting potential for the future of autonomous robotics. It's as if the robot is learning to think on its feet, which could change the way robots move in the future.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Researchers teach a robot to walk around San Francisco using AI's word prediction techniques

Predictions help optimize diverse multimodal training data

Apple AI researchers question OpenAI's claims about o1's reasoning capabilities

Amazon's new DeepFleet mode helps robots deliver your packages even faster

Hugging Face bets on open source to solve robotics' transparency problem

Nvidia positions GR00T N1 to dominate robotics ecosystem

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Researchers teach a robot to walk around San Francisco using AI's word prediction techniques

Predictions help optimize diverse multimodal training data

Share

Bank details