UltraFastBERT: ETH Zurich develops technology to dramatically speed up LLMs

Scientists at ETH Zurich have developed a technique that can drastically speed up large language models.

By specifically modifying the computational process of the BERT model, the researchers were able to reduce the number of neurons required for inference to 0.3 percent of the original value. Only 12 out of 4,095 neurons per layer inference were sufficient to achieve similarly good results with UltraFastBERT as with the unmodified model.

In language models based on transformer architectures, a large part of the parameters and computations are bundled in so-called feedforward networks. The researchers replace these layers with special fast feedforward networks, which are structured differently.

As a result of this component change, the usual dense matrix multiplications become conditional matrix multiplications. Instead of multiplying each input parameter with each neuron, only the neurons required for a task are identified and computed using binary decision trees.

Experiments using the GLUE language comprehension benchmark showed that UltraFastBERT achieved up to 96 % of the performance of the original BERT model, despite the greatly reduced number of neurons.

Huge speed-up potential for large models

According to the researchers, the approach has enormous speed-up potential, especially for very large language models. For example, in the OpenAI language model GPT-3, the number of neurons required per inference could theoretically be reduced to just 0.03 % of the previous amount.

However, to exploit the theoretical possibilities in practice, optimizations at the software level are still necessary. According to the scientists, an efficient implementation of conditional matrix multiplication would make it possible to speed up the process by a factor of 341.

However, the necessary knowledge is not freely available:

Dense matrix multiplication is the most optimized mathematical operation in the history of computing. A tremendous effort has been put into designing memories, chips, instruction sets, and software routines that execute it as fast as possible. Many of these advancements have been – be it for their complexity or for competitive advantage – kept confidential and exposed to the end user only through powerful but restrictive programming interfaces. Therefore, despite having no need for new hardware, we are still forced to rely on combining high-level linear-algebraic routines to implement CMM, hence the reduction in the speedup.

From the paper

As a first step, however, the researchers wrote working CPU-level code that achieved a 78-fold speedup over the optimised baseline feedforward implementation.

Recommendation

AI research

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

The researchers have also made breakthroughs in the area of image models, drastically reducing the computation time required. Stability AI's SDXL Turbo reduces the number of image generation steps from 50 to one while maintaining almost the same quality.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

UltraFastBERT: ETH Zurich develops technology to dramatically speed up LLMs

Huge speed-up potential for large models

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

A conversation with Max Tegmark inspired AI co-founder Igor Babuschkin shift to safer AI

Google has launched a user-focused memory function for Gemini

Reddit says AI companies misused the Wayback Machine to scrape its content

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

UltraFastBERT: ETH Zurich develops technology to dramatically speed up LLMs

Huge speed-up potential for large models

Share

Bank details