Sam Altman says OpenAI has enough data to train the next generation of AI

Jun 4, 2024

OpenAI (YouTube Screenshot)

Key Points

In an interview, OpenAI CEO Sam Altman emphasizes the importance of using high-quality data to train AI models, whether it is human-generated or synthetic.
OpenAI is experimenting with generating large amounts of synthetic data to explore different AI training techniques, but sees the key question as how AI systems can learn more with less data.
According to Altman, OpenAI currently has enough data to train the next iteration after GPT-4, but acknowledges that much scientific progress is still needed to find the most appropriate data and techniques for increasingly powerful AI systems.

In an interview, Sam Altman, CEO of OpenAI, stressed the importance of high-quality data for training AI models. Altman said the company currently has enough data for the next version after GPT-4.

In an interview at the AI for Good Global Summit, Altman mentioned the need for high-quality data in AI systems, whether it comes from humans or is synthetically generated. The possibility that too much AI-generated data could harm an AI system doesn't seem to concern Altman per se. He said that low-quality data from either source is a problem.

"I think what you need is high-quality data. There's low-quality synthetic data, there's low-quality human data," Altman said in an interview at the AI for Good Global Summit.

For now, OpenAI has enough data to train the next model after GPT-4, Altman said.

The OpenAI CEO also said that the company has been testing generating large amounts of synthetic data to try different ways of training AI.

But the main question is how AI systems can learn more from less data, rather than just generating massive amounts of synthetic data for training. Altman says it would be "very strange" if the best way to train a model was to "generate like a quadrillion tokens of synthetic data and feed that back in."

For Altman, the ability to learn efficiently from data is key, describing the core question as "how do you learn more from less data?" He cautions that OpenAI and other companies still need to figure out what data and methods work best to train increasingly powerful AI systems.

Science backs up Altman's comments by showing that better data leads to better AI performance. It also fits with OpenAI's strategy of spending hundreds of millions recently to license training data from major publishers.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: AI for Good