Nvidia researchers have built a small neural network that controls humanoid robots more effectively than specialized systems, even though it uses far fewer resources. The system works with multiple input methods, from VR headsets to motion capture.
The new system, called HOVER, needs only 1.5 million parameters to handle complex robot movements. For context, typical large language models use hundreds of billions of parameters.
The team trained HOVER in Nvidia's Isaac simulation environment, which speeds up robot movements 10,000 times. According to Nvidia researcher Jim Fan, this means that a full year of training in the virtual space takes just 50 minutes of actual computing time on one GPU.
Small and versatile
HOVER moves zero-shot from simulation to physical robots without the need for fine-tuning, says Fan. The system accepts input from multiple sources, including head and hand tracking from XR devices such as Apple Vision Pro, full-body positions from motion capture or RGB cameras, joint angles from exoskeletons, and standard joystick controls.
The system performs better at each control method than systems built specifically for just one type of input. Lead author Tairan He speculates that this may be due to the system's broad understanding of physical concepts such as balance and precise limb control, which it applies across all control types.
The system builds on the open-source H2O & OmniH2O project and works with any humanoid robot that can run in the Isaac simulator. Nvidia has posted examples and code on GitHub.