Scientists at ETH Zurich have developed a technique that can drastically speed up large language models.
By specifically modifying the computational process of the BERT model, the researchers were able to reduce the number of neurons required for inference to 0.3 percent of the original value. Only 12 out of 4,095 neurons per layer inference were sufficient to achieve similarly good results with UltraFastBERT as with the unmodified model.
In language models based on transformer architectures, a large part of the parameters and computations are bundled in so-called feedforward networks. The researchers replace these layers with special fast feedforward networks, which are structured differently.
As a result of this component change, the usual dense matrix multiplications become conditional matrix multiplications. Instead of multiplying each input parameter with each neuron, only the neurons required for a task are identified and computed using binary decision trees.
Experiments using the GLUE language comprehension benchmark showed that UltraFastBERT achieved up to 96 % of the performance of the original BERT model, despite the greatly reduced number of neurons.
Huge speed-up potential for large models
According to the researchers, the approach has enormous speed-up potential, especially for very large language models. For example, in the OpenAI language model GPT-3, the number of neurons required per inference could theoretically be reduced to just 0.03 % of the previous amount.
However, to exploit the theoretical possibilities in practice, optimizations at the software level are still necessary. According to the scientists, an efficient implementation of conditional matrix multiplication would make it possible to speed up the process by a factor of 341.
However, the necessary knowledge is not freely available:
Dense matrix multiplication is the most optimized mathematical operation in the history of computing. A tremendous effort has been put into designing memories, chips, instruction sets, and software routines that execute it as fast as possible. Many of these advancements have been – be it for their complexity or for competitive advantage – kept confidential and exposed to the end user only through powerful but restrictive programming interfaces. Therefore, despite having no need for new hardware, we are still forced to rely on combining high-level linear-algebraic routines to implement CMM, hence the reduction in the speedup.
From the paper
As a first step, however, the researchers wrote working CPU-level code that achieved a 78-fold speedup over the optimised baseline feedforward implementation.
The researchers have also made breakthroughs in the area of image models, drastically reducing the computation time required. Stability AI's SDXL Turbo reduces the number of image generation steps from 50 to one while maintaining almost the same quality.