Google Deepmind researchers have introduced "Mixture-of-Depths", a method to use the computing power of transformer models more efficiently.
Traditional transformer models use the same amount of computing power for each token in a sequence. In contrast, Google Deepmind's "Mixture-of-Depths" (MoD) allows the model to flexibly and selectively distribute this computing power to the tokens that need it most.
This is done by setting a fixed upper limit on the amount of computation per pass. For example, a maximum of 50% of the tokens are allowed to go through the computationally intensive calculations. In each block, there is a "router" that calculates a weight value for each token. Tokens with high weights are selected for computation, while tokens with low weights are skipped. During training, a set of tokens with the highest weights are selected for computation.
The remaining tokens are passed on unchanged. In this way, computationally intensive steps can be skipped for tokens that don't need them. The model learns which tokens require more or less computation.
Mixture-of-Deeps achieves baseline model performance
Despite the significant reduction in FLOPs required per prediction, the MoD models were able to maintain or exceed the performance of the baseline models after training. According to the team, this suggests that the traditional distribution of computational resources in transformer models is not always optimal and that a more targeted allocation of computations can improve model performance. A query to the fully trained AI model now requires only a fraction of the computing power and can be up to 50 percent faster.
The method can also be combined with the now widely used Mixture-of-Experts architecture. MoD can be seen as complementary to MoE, as both approaches aim to optimize different dimensions of the model.
The ability to dynamically allocate computing power and thus use it more efficiently could be particularly valuable in application areas with high demands on computing time and resources - but the additional FLOPs could also be used for training larger models. According to the researchers, the latter also results in memory savings. Some MoD variants require fewer accelerators, suggesting that this efficiency could be significant when scaling to larger models.