In an interview, Sam Altman, CEO of OpenAI, stressed the importance of high-quality data for training AI models. Altman said the company currently has enough data for the next version after GPT-4.
In an interview at the AI for Good Global Summit, Altman mentioned the need for high-quality data in AI systems, whether it comes from humans or is synthetically generated. The possibility that too much AI-generated data could harm an AI system doesn't seem to concern Altman per se. He said that low-quality data from either source is a problem.
"I think what you need is high-quality data. There's low-quality synthetic data, there's low-quality human data," Altman said in an interview at the AI for Good Global Summit.
For now, OpenAI has enough data to train the next model after GPT-4, Altman said.
The OpenAI CEO also said that the company has been testing generating large amounts of synthetic data to try different ways of training AI.
But the main question is how AI systems can learn more from less data, rather than just generating massive amounts of synthetic data for training. Altman says it would be "very strange" if the best way to train a model was to "generate like a quadrillion tokens of synthetic data and feed that back in."
For Altman, the ability to learn efficiently from data is key, describing the core question as "how do you learn more from less data?" He cautions that OpenAI and other companies still need to figure out what data and methods work best to train increasingly powerful AI systems.
Science backs up Altman's comments by showing that better data leads to better AI performance. It also fits with OpenAI's strategy of spending hundreds of millions recently to license training data from major publishers.