Large language models require a lot of memory and computing power. Sparsification can reduce both.
Large language models from the GPT family have become the standard in natural language machine processing. However, their usability is limited due to their size and the required computing power.
GPT-175B, for example, comprises 175 billion parameters that take up at least 320 gigabytes of memory. Therefore, a minimum of five A100 GPUs with 80 gigabytes of memory each are required for operation.
Almost all existing approaches to compression rely on quantization, which reduces the accuracy of the numerical representation of the individual weights. This reduces the size of the networks, but also their performance, since they do not represent the information precisely.
One-shot pruning without loss of accuracy
Pruning is an alternative method. Here, the model becomes more compact by removing redundant or less important information. The approach is not new and is considered useful, but accuracy usually suffers.
This loss must first be recovered by costly retraining of the model. Previous one-shot pruning methods are too time-consuming to apply to large models with billions of parameters.
SparseGPT could be a solution to this problem. This method is presented by Elias Frantar and Dan Alistarh from the Institute of Science and Technology Austria in a new paper titled "Massive Language Models Can Be Accurately Pruned in One-Shot".
According to the authors, SparseGPT is the first precise one-shot pruning method that works efficiently for models with ten to 100 billion parameters.
50 to 60 percent smaller, even with 175 billion parameters
Pruning with SparseGPT takes only about four hours with a single GPU even on the largest publicly available GPT models, namely OPT-175B and BLOOM-176B, the team said.
It became clear that larger models were easier to downsize: The researchers were able to slim down the models by 50 to 60 percent using SparseGPT. Even with such a high level of sparsity, there would be virtually no loss of accuracy in OPT-175B, for example, compared to the dense model. That is, about 100 billion parameters could be ignored during inference.
Pruning and fine-tuning could achieve up to 90 percent sparsification
The team says that they suspect that progressive pruning and fine-tuning can achieve at least 80 to 90 percent sparsification. They also plan to investigate the applicability of their approaches during training to reduce the computational cost of pre-training these massive models.
Sparse modeling could therefore make large models run more efficiently in the future - and enable even larger models.
This view is also shared by the German AI startup Aleph Alpha and the British AI chip manufacturer Graphcore. The two companies had demonstrated a sparsification approach for leaner language models as recently as November 2022.