Content
summary Summary

A new optimization technique called DisTrO reduces communication between GPUs during AI training by up to 10,000 times. This breakthrough could make it possible to train large language models over standard Internet connections.

Ad

Researchers have created DisTrO, a new family of optimizers that dramatically reduces data exchange between GPUs when training large AI models, including language models (LLMs) and diffusion models.

Traditional distributed training requires synchronizing full gradients between all participating accelerators (GPUs, TPUs) after each training step. This process demands extremely high bandwidth and specialized high-speed connections.

DisTrO slashes these communication requirements by four to five orders of magnitude. During the pre-training of a 1.2 billion parameter language model, the required bandwidth per training step dropped from 74.4 GB to just 86.8 MB - a 857-fold reduction.

Ad
Ad

The team reports that reductions of up to 10,000 times are possible during fine-tuning. DisTrO works independently of network topology and neural network architecture.

DisTrO aims to make AI training more accessible

The researchers believe DisTrO could democratize the training of large AI models. The drastically reduced bandwidth requirements could enable model training via normal internet connections, eliminating the need for specialized high-speed links.

This advancement could allow researchers and organizations with limited resources to participate in developing state-of-the-art AI models. Until now, this capability has been limited to governments and large tech companies in wealthy countries with the necessary funding and infrastructure.

The team suggests DisTrO could enable a fully decentralized network. The method is highly resilient to node failures or degradation and can easily adapt to new nodes.

The researchers also see great potential for applications like federated learning, where models are trained collaboratively while keeping training data private and decentralized. DisTrO could make federated learning practical for efficiently training LLMs over the internet.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have developed a new optimization technique called DisTrO that reduces data exchange between GPUs by up to 10,000 times when training large AI models.
  • DisTrO reduces the bandwidth required to pre-train a 1.2 billion-parameter language model from 74.4 GB to 86.8 MB per training step. This enables training over standard Internet connections without the need for dedicated high-speed connections.
  • The method could democratize the training of large AI models by enabling researchers and organizations with limited resources to participate in the development of state-of-the-art models. The researchers also see potential for applications such as federated learning.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.