- Bitnet.cpp added
Update from 17.10.24:
The team behind BitNet has released Bitnet.cpp, a new inference framework for 1-bit language models like BitNet b1.58. It offers optimized kernels for fast, lossless inference on CPUs. According to the developers, bitnet.cpp achieves speed increases of 1.37x to 5.07x on ARM CPUs and 2.37x to 6.17x on x86 CPUs. Energy consumption is reduced by 55.4 % to 82.2 %. Bitnet.cpp currently supports three 1-bit models of Hugging Face:
More to follow. BitNet is available on GitHub.
Original article from 02 March 2024
Researchers at Microsoft Research and the University of the Chinese Academy of Sciences have unveiled BitNet b1.58, a 1-bit language model that promises high performance at significantly reduced cost and power consumption.
The development of large-scale language models, such as GPT-4, has made significant progress in recent years, but the high energy and memory consumption and associated costs continue to pose significant challenges to the environment and widespread use of AI. However, a recent study by Shuming Ma and colleagues at Microsoft Research and the University of Chinese Academy of Sciences may provide a breakthrough to this problem: They have presented a 1-bit language model called BitNet b1.58 that provides similar performance to traditional 16-bit models (FP16 or BF16) with significantly reduced latency, memory requirements, and power consumption.
These 1-bit models work with ternary parameters that can take the values -1, 0, and 1, and were introduced in the study with BitNet b1.58, an evolution of the original BitNet. The highlight here is that the parameters are no longer limited to the two values -1 and 1, but also include zero, resulting in a representation with an average of 1.58 bits, which offers a higher modeling capability and thus better reflects the performance of classical language models.
The researchers showed that from a size of 3 billion parameters, BitNet b1.58 achieves comparable performance to classical language models in terms of perplexity and task performance - with up to 2.71 times faster processing and 3.55 times lower memory consumption. A 3.9 billion parameter variant of BitNet b1.58 is said to perform significantly better than Meta's Llama 3B.
1-bit models could continue to benefit from special hardware
A key advantage of these new 1-bit models is their efficiency in matrix multiplication, which mainly requires the addition of integers - an operation that consumes significantly less energy than the usual floating-point operations. The researchers suggest that the energy savings of these models could also translate into faster computation since the performance of many chips is limited by the energy available.
Another positive aspect is the reduction in memory requirements. Because BitNet b1.58 uses fewer bits, the amount of memory required to transfer model parameters from DRAM to the memory of an on-chip accelerator is also reduced. This results in faster and more efficient inference processes.
The study also provides comparisons to current models and shows that BitNet b1.58 with 70 billion parameters could achieve up to 11 times higher batch size and 8.9 times higher token throughput than a comparable LLaMA 70B model.
The researchers also point out that to fully exploit the potential of 1-bit language models, specialized hardware for these models needs to be developed. They call for further research and development in this direction to take full advantage of these new models.