summary Summary

Scientists at ETH Zurich have developed a technique that can drastically speed up large language models.

By specifically modifying the computational process of the BERT model, the researchers were able to reduce the number of neurons required for inference to 0.3 percent of the original value. Only 12 out of 4,095 neurons per layer inference were sufficient to achieve similarly good results with UltraFastBERT as with the unmodified model.

In language models based on transformer architectures, a large part of the parameters and computations are bundled in so-called feedforward networks. The researchers replace these layers with special fast feedforward networks, which are structured differently.

As a result of this component change, the usual dense matrix multiplications become conditional matrix multiplications. Instead of multiplying each input parameter with each neuron, only the neurons required for a task are identified and computed using binary decision trees.


Experiments using the GLUE language comprehension benchmark showed that UltraFastBERT achieved up to 96 % of the performance of the original BERT model, despite the greatly reduced number of neurons.

Huge speed-up potential for large models

According to the researchers, the approach has enormous speed-up potential, especially for very large language models. For example, in the OpenAI language model GPT-3, the number of neurons required per inference could theoretically be reduced to just 0.03 % of the previous amount.

However, to exploit the theoretical possibilities in practice, optimizations at the software level are still necessary. According to the scientists, an efficient implementation of conditional matrix multiplication would make it possible to speed up the process by a factor of 341.

However, the necessary knowledge is not freely available:

 Dense matrix multiplication is the most optimized mathematical operation in the history of computing. A tremendous effort has been put into designing memories, chips, instruction sets, and software routines that execute it as fast as possible. Many of these advancements have been – be it for their complexity or for competitive advantage – kept confidential and exposed to the end user only through powerful but restrictive programming interfaces. Therefore, despite having no need for new hardware, we are still forced to rely on combining high-level linear-algebraic routines to implement CMM, hence the reduction in the speedup.

From the paper

As a first step, however, the researchers wrote working CPU-level code that achieved a 78-fold speedup over the optimised baseline feedforward implementation.


The researchers have also made breakthroughs in the area of image models, drastically reducing the computation time required. Stability AI's SDXL Turbo reduces the number of image generation steps from 50 to one while maintaining almost the same quality.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • ETH Zurich scientists present a method that drastically accelerates AI language models using fast feedforward networks.
  • With "UltraFastBERT", they achieve similar results to the unmodified model while using only 0.3 percent of the original neurons per inference.
  • The method has enormous speed-up potential, especially for large language models such as OpenAI's GPT-3, but software optimizations are still needed.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.