Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

A new study by researchers from Harvard University, Stanford University, and other institutions shows that precision—the number of bits used to represent numbers in models—plays a more significant role in scaling laws than previously thought.

The study, titled "Scaling Laws for Precision," demonstrates that precision significantly affects language model performance. According to the researchers, previous scaling laws describing how model performance changes with parameter count and training data volume largely ignored precision.

The research team conducted over 465 training runs to test their hypotheses. They trained language models with precisions ranging from 3 to 16 bits and quantized them to various precision levels after training. The models contained up to 1.7 billion parameters and were trained on up to 26 billion tokens.

A key finding shows that over-trained language models become more sensitive to quantization after training. A model is considered over-trained when its ratio of training tokens to parameters significantly exceeds the "Chinchilla-optimal" value of about 20. The researchers examined ratios up to 1000.

The experiments revealed that performance degradation from post-training quantization increases with training data volume. When a model is quantized after training, additional training with more data can actually be harmful, as it amplifies quantization errors.

New precision scaling laws emerge

Based on their experiments, the researchers developed new scaling laws that incorporate precision into the equations. Another important finding concerns the compute-optimal precision during pre-training. According to the study, this is generally independent of the compute budget when jointly optimizing parameter count, data, and precision.

The common practice of training models at 16 bits is suboptimal, since many bits are unnecessary. However, training at 4 bits requires a disproportionate model size increase to maintain loss scaling. The researchers' calculations suggest that 7-8 bits are compute-optimal for larger models.

The situation changes when model size is fixed from the start: larger and better-trained models should be trained with higher precision—for example, models like Llama 3.1 8B with 16 bits.

However, actual compute savings also depend on hardware support for lower precisions. Additionally, the models studied here (up to 1.7 billion parameters) haven't been tested at the largest practical scale. The general trends should still apply to larger models.

Recommendation

AI research

Automated research: The AI Scientist generates papers for 15 dollars each

As hardware development increasingly supports low-precision computing, these new scaling laws can help developers find the optimal balance between model size, data volume, and precision.

"The perfect storm for the end of scale"

For AI researcher Tim Dettmers from Carnegie Mellon University and Allen AI, this work is "the most important paper in a long time." He says it clearly shows that the community has reached the limits of quantization—with implications for AI research and GPUs.

Combined with physical limitations, he sees a "perfect storm" for the end of scalability. Efficient low-precision methods like 8-bit training are reaching their limits, especially for large models like LLaMA 3.1 with 405 billion parameters. Dettmers sees few remaining options for efficiency gains, such as larger data centers, specialized models, or knowledge distillation. He believes the paradigm will soon shift from pure scaling toward human-centered applications. "Many of us efficiency researchers had some hunch that our data reflects this trend, but we had no hard evidence. Predictive trends that are verified by more experiments (scaling laws) is as robust evidence as you can get. So now it is very clear where we are."

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

New precision scaling laws emerge

Automated research: The AI Scientist generates papers for 15 dollars each

"The perfect storm for the end of scale"

AI training shifts from clickworkers to experts in physics, biology and engineering

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Researchers train AI to generate long-form text using only reinforcement learning

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

New precision scaling laws emerge

"The perfect storm for the end of scale"

Share

Bank details