UltraFastBERT: ETH Zurich develops technology to dramatically speed up LLMs

Midjourney prompted by THE DECODER

Scientists at ETH Zurich have developed a technique that can drastically speed up large language models.

By specifically modifying the computational process of the BERT model, the researchers were able to reduce the number of neurons required for inference to 0.3 percent of the original value. Only 12 out of 4,095 neurons per layer inference were sufficient to achieve similarly good results with UltraFastBERT as with the unmodified model.

In language models based on transformer architectures, a large part of the parameters and computations are bundled in so-called feedforward networks. The researchers replace these layers with special fast feedforward networks, which are structured differently.

As a result of this component change, the usual dense matrix multiplications become conditional matrix multiplications. Instead of multiplying each input parameter with each neuron, only the neurons required for a task are identified and computed using binary decision trees.

Experiments using the GLUE language comprehension benchmark showed that UltraFastBERT achieved up to 96 % of the performance of the original BERT model, despite the greatly reduced number of neurons.

Huge speed-up potential for large models

According to the researchers, the approach has enormous speed-up potential, especially for very large language models. For example, in the OpenAI language model GPT-3, the number of neurons required per inference could theoretically be reduced to just 0.03 % of the previous amount.

However, to exploit the theoretical possibilities in practice, optimizations at the software level are still necessary. According to the scientists, an efficient implementation of conditional matrix multiplication would make it possible to speed up the process by a factor of 341.

However, the necessary knowledge is not freely available:

Dense matrix multiplication is the most optimized mathematical operation in the history of computing. A tremendous effort has been put into designing memories, chips, instruction sets, and software routines that execute it as fast as possible. Many of these advancements have been – be it for their complexity or for competitive advantage – kept confidential and exposed to the end user only through powerful but restrictive programming interfaces. Therefore, despite having no need for new hardware, we are still forced to rely on combining high-level linear-algebraic routines to implement CMM, hence the reduction in the speedup.

From the paper

As a first step, however, the researchers wrote working CPU-level code that achieved a 78-fold speedup over the optimised baseline feedforward implementation.

Recommendation

AI research

New foundation model "Evo" unlocks sequence modeling and design at the genomic scale

The researchers have also made breakthroughs in the area of image models, drastically reducing the computation time required. Stability AI's SDXL Turbo reduces the number of image generation steps from 50 to one while maintaining almost the same quality.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

UltraFastBERT: ETH Zurich develops technology to dramatically speed up LLMs

Huge speed-up potential for large models

New foundation model "Evo" unlocks sequence modeling and design at the genomic scale

Microsoft Bing copies Google's Search Generative Experience with new AI search feature

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

AI-powered medical diagnosis gets a transparency boost with new 'Chain of Diagnosis' method

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

Meta takes on OpenAI's GPT-4o with Llama 3 405B, its largest open-source LLM to date

AI models might need to scale down to scale up again

UltraFastBERT: ETH Zurich develops technology to dramatically speed up LLMs

Huge speed-up potential for large models

Share

Bank details