Content
summary Summary

Google Deepmind researchers have introduced "Mixture-of-Depths", a method to use the computing power of transformer models more efficiently.

Traditional transformer models use the same amount of computing power for each token in a sequence. In contrast, Google Deepmind's "Mixture-of-Depths" (MoD) allows the model to flexibly and selectively distribute this computing power to the tokens that need it most.

This is done by setting a fixed upper limit on the amount of computation per pass. For example, a maximum of 50% of the tokens are allowed to go through the computationally intensive calculations. In each block, there is a "router" that calculates a weight value for each token. Tokens with high weights are selected for computation, while tokens with low weights are skipped. During training, a set of tokens with the highest weights are selected for computation.

The remaining tokens are passed on unchanged. In this way, computationally intensive steps can be skipped for tokens that don't need them. The model learns which tokens require more or less computation.

Ad
Ad

Mixture-of-Deeps achieves baseline model performance

Despite the significant reduction in FLOPs required per prediction, the MoD models were able to maintain or exceed the performance of the baseline models after training. According to the team, this suggests that the traditional distribution of computational resources in transformer models is not always optimal and that a more targeted allocation of computations can improve model performance. A query to the fully trained AI model now requires only a fraction of the computing power and can be up to 50 percent faster.

The method can also be combined with the now widely used Mixture-of-Experts architecture. MoD can be seen as complementary to MoE, as both approaches aim to optimize different dimensions of the model.

The ability to dynamically allocate computing power and thus use it more efficiently could be particularly valuable in application areas with high demands on computing time and resources - but the additional FLOPs could also be used for training larger models. According to the researchers, the latter also results in memory savings. Some MoD variants require fewer accelerators, suggesting that this efficiency could be significant when scaling to larger models.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google Deepmind introduces Mixture-of-Depths (MoD), a method that allows Transformer models to flexibly allocate available computing power to the tokens they need most.
  • A router in each block calculates weight values for the tokens. Only tokens with high weights are compute-intensive, while the rest are passed on unchanged. The model independently learns which tokens require more or less computation.
  • MoD models match or exceed the performance of baseline models despite reduced computational requirements. The method can be combined with the Mixture-of-Experts architecture and could be particularly important in computationally intensive applications or when training larger models.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.