Reformulating web documents into synthetic data addresses the growing limits of AI training data

Datology AI has introduced BeyondWeb, a new framework that uses synthetic data to train language models. The approach is designed to address the growing shortage of high-quality training data and claims to be far more efficient than previous methods.

While training budgets for large language models now reach trillions of tokens, good web data is getting harder to find. Datology AI sees this "wall of data" as a central challenge and positions BeyondWeb as a solution. The framework restructures existing web documents to be more information-dense, improves the educational tone, and reorganizes content for better training.

Performance gains

According to Datology AI, BeyondWeb boosts accuracy by 5.1 percentage points on 8B parameter models compared to Hugging Face's Cosmopedia and by 2.6 percentage points over Nvidia's Nemotron-CC dataset.

Diagram: BeyondWeb achieves approximately 57.4%/60.8%/63.7% accuracy at 1 B/3 B/8 B, beating four baselines. — Average accuracy values are based on 14 standard benchmarks in 0-shot and 5-shot settings. | Image: Datology AI

The study also found that BeyondWeb trains much faster: 7.7 times quicker than open web data and 2.7 times faster than Nemotron Synthetic. In one test, a 3B parameter model trained on BeyondWeb outperformed an 8B model trained on Cosmopedia using the same token budget.

Line chart: Accuracy of 8B models over tokens, BeyondWeb achieves 64% at 180 B, 2.7×–7.7× speedup vs. competition. — BeyondWeb reached about 64 percent final accuracy after just 66 billion tokens, outperforming RedPajama by a factor of 7.7 and Nemotron-Synth by 2.7. | Image: Datology AI

The researchers looked at seven core questions around synthetic data generation. One key takeaway: diversity is essential for sustained progress. Standard methods may help early in training, but their lack of stylistic variety leads to diminishing returns.

Another finding: conversational style is underrepresented in web data, making up less than 2.7 percent, even though chat is the main use case for LLMs. Adding more conversational data helps, but gains plateau quickly.

Small models can be strong at reformulating text

Testing different model sizes, the team found that small language models can be effective at generating high-quality synthetic data. Moving from 1B to 3B parameters increased data quality by 1.5 percentage points, but improvements flattened out at 8B. This suggests that organizations with fewer resources can still generate strong synthetic datasets.

Line chart: Synthetic data accuracy for Llama-3.2-1B (47.3%), 3.2-3B (48.8%), 3.1-8B (49.2%) vs. RPJ-HQ (45.5%). — As model size increases, synthetic data accuracy rises from 1B to 3B, with gains leveling off at 8B. | Image: Datology AI

The researchers also tested different families of reformulator models and found that all produced similarly strong synthetic data. In other words, a model's overall benchmark score doesn't predict how good its synthetic data will be.

Real-world use

BeyondWeb has already been used to train ArceeAI's 4.5B parameter AFM model. For this, Datology AI built a scalable pipeline that can handle trillions of tokens. The team notes that generating top-quality synthetic data is complex, with many variables to fine-tune. BeyondWeb is not currently available for free research use.

Recommendation

AI research

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

Microsoft demonstrated the potential of synthetic data with Phi-4 in December 2024, training the model on 400 billion tokens of synthetic "textbook-style" data and introducing specialized "pivotal tokens" to improve learning. Phi-4 models deliver strong benchmark results, though they have received mixed reactions in real-world use.

Six months earlier, Nvidia released Nemotron-4 340B, a full open-source pipeline for generating synthetic data, with 98 percent of the Instruct model's training data created synthetically. Around the same time, researchers debunked the popular "model collapse" theory, showing that synthetic data can push AI development forward when used properly.

OpenAI also revealed during the GPT-5 announcement that the model was trained with synthetic data, likely produced by its in-house o3 model. While many companies use synthetic data primarily to cut costs, OpenAI said it focuses on carefully preparing data to enable real learning, not just to fill in gaps. Sébastien Bubeck, who previously led the Phi project at Microsoft, explained this approach.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Reformulating web documents into synthetic data addresses the growing limits of AI training data

Performance gains

Small models can be strong at reformulating text

Real-world use

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

German Commons shows that big AI datasets don’t have to live in copyright limbo

Junk data from X makes large language models lose reasoning skills, researchers show

Researchers discover three factors that make AI agents significantly smarter

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

German court deepens the split on AI and copyright with its latest ruling

Reformulating web documents into synthetic data addresses the growing limits of AI training data

Performance gains

Small models can be strong at reformulating text

Real-world use

Share

Bank details