Researchers have developed DrEureka, a method that uses language models to automate the transfer of robot skills learned in simulation to the real world.
Transferring robot skills learned in simulation to the real world, known as sim-to-real transfer, is a promising approach for developing robot skills at scale. However, this process requires a lot of manual work, such as designing reward functions and setting simulation parameters.
Researchers at the University of Pennsylvania, Nvidia, and UT Austin have now developed DrEureka, a method that uses large language models to automate the sim-to-real process.
Nvidia's DrEureka uses physical knowledge from language models
DrEureka requires only the physics simulation for the target task and automatically creates appropriate reward functions and configurations for domain randomization.
Domain randomization is a method of training robot controllers in simulations to improve their transferability to the real world. Physical parameters are randomly varied in the simulation in order to develop a robust control system that works in the presence of disturbances in the real world.
With DrEureka, the team now proposes to improve the selection of the right parameter distribution for domain randomization using AI language models. Until now, this step has often been performed manually by humans, as it is a complex optimization task.
According to the researchers, this requires a good understanding of physics, such as the influence of friction on different surfaces on the movement of a robot, as well as knowledge of the specific robot system.
This is where the scientists see potential for the use of language models - they have a broad knowledge of physics and the ability to generate hypotheses. The language models can independently solve complex search and optimization problems and thus find the appropriate training parameters, the researchers explain. They are therefore useful to design suitable parameters.
DrEureka relies on OpenAI's GPT-4
DrEureka uses GPT-4 to automatically generate effective reward functions for the given robotic task. GPT-4 receives a task description and safety instructions. It generates several candidate reward functions as code.
The generated reward function is then used to train an initial policy in the simulation. This policy is then evaluated in simulations with different parameter settings (e.g., higher friction). Based on the performance of the initial policy in the different simulations, DrEureka then generates a prior for the distribution of the simulation parameters. This prior specifies the value ranges in which the parameters should lie.
GPT-4 then takes the parameter prior as context and generates parameters for domain randomization. Finally, the final policies are trained in the simulation using the reward function and the domain randomization parameters from DrEureka. These are then ready to be applied to the real robot.
DrEureka outperforms human experts
The researchers evaluated DrEureka on legged and manipulator robot platforms. For the walking robots, policies trained with DrEureka outperformed those with human-designed reward functions and domain randomization by 34% in forward speed and 20% in distance traveled on different surfaces.
During manipulation, the best DrEureka policy performed nearly 300% more rotations of a cube in the hand in a given time than the human-designed policy.
To demonstrate how DrEureka can accelerate sim-to-real transfer in a previously unsolved task, the researchers tested DrEureka in the task of walking on a ball. The walking robot attempts to balance and walk on a yoga ball for as long as possible. The policy trained with DrEureka was able to balance on a real yoga ball for several minutes in different environments, both indoors and outdoors.
The study demonstrates the potential for automating difficult design aspects of sim-to-real learning through the use of foundation models. This could significantly accelerate future research in robot learning, bringing us closer to the vision of humanoid robots on the factory floor. However, there is still room for improvement, for example, in dynamically adapting parameters during training and in selecting the most promising policies for use in the real world, the team writes.
More information and examples can be found on GitHub.