A new study by researchers from Harvard University, Stanford University, and other institutions shows that precision—the number of bits used to represent numbers in models—plays a more significant role in scaling laws than previously thought.
The study, titled "Scaling Laws for Precision," demonstrates that precision significantly affects language model performance. According to the researchers, previous scaling laws describing how model performance changes with parameter count and training data volume largely ignored precision.
The research team conducted over 465 training runs to test their hypotheses. They trained language models with precisions ranging from 3 to 16 bits and quantized them to various precision levels after training. The models contained up to 1.7 billion parameters and were trained on up to 26 billion tokens.
A key finding shows that over-trained language models become more sensitive to quantization after training. A model is considered over-trained when its ratio of training tokens to parameters significantly exceeds the "Chinchilla-optimal" value of about 20. The researchers examined ratios up to 1000.
The experiments revealed that performance degradation from post-training quantization increases with training data volume. When a model is quantized after training, additional training with more data can actually be harmful, as it amplifies quantization errors.
New precision scaling laws emerge
Based on their experiments, the researchers developed new scaling laws that incorporate precision into the equations. Another important finding concerns the compute-optimal precision during pre-training. According to the study, this is generally independent of the compute budget when jointly optimizing parameter count, data, and precision.
The common practice of training models at 16 bits is suboptimal, since many bits are unnecessary. However, training at 4 bits requires a disproportionate model size increase to maintain loss scaling. The researchers' calculations suggest that 7-8 bits are compute-optimal for larger models.
The situation changes when model size is fixed from the start: larger and better-trained models should be trained with higher precision—for example, models like Llama 3.1 8B with 16 bits.
However, actual compute savings also depend on hardware support for lower precisions. Additionally, the models studied here (up to 1.7 billion parameters) haven't been tested at the largest practical scale. The general trends should still apply to larger models.
As hardware development increasingly supports low-precision computing, these new scaling laws can help developers find the optimal balance between model size, data volume, and precision.
"The perfect storm for the end of scale"
For AI researcher Tim Dettmers from Carnegie Mellon University and Allen AI, this work is "the most important paper in a long time." He says it clearly shows that the community has reached the limits of quantization—with implications for AI research and GPUs.
Combined with physical limitations, he sees a "perfect storm" for the end of scalability. Efficient low-precision methods like 8-bit training are reaching their limits, especially for large models like LLaMA 3.1 with 405 billion parameters. Dettmers sees few remaining options for efficiency gains, such as larger data centers, specialized models, or knowledge distillation. He believes the paradigm will soon shift from pure scaling toward human-centered applications. "Many of us efficiency researchers had some hunch that our data reflects this trend, but we had no hard evidence. Predictive trends that are verified by more experiments (scaling laws) is as robust evidence as you can get. So now it is very clear where we are."