Microsoft researchers see a spark of general AI in an early version of GPT-4.
“We believe that GPT-4‘s intelligence signals a true paradigm shift in the field of computer science and beyond,” says the paper released by Microsoft.
GPT-4 shows a spark of general artificial intelligence (AGI), they say, but is only the first step on the road to increasingly general intelligent systems. The team is guided by a definition of AGI that considers such an AI system to be at the human level.
The team used an early version of GPT-4 for its research, but said it was already significantly better than models such as ChatGPT or Google’s PaLM.
Microsoft team sees ‘spark’ of general AI
“We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting,” the article says.
“Moreover, in all of these tasks, GPT-4’s performance is strikingly close to human-level performance.”
But despite these human and, in some cases, superhuman capabilities, the model still has problems such as hallucinations or with mathematical tasks, they said. Moreover, the “patterns of intelligence are decidedly not human-like.” Given the conceptual fuzziness of terms like AGI or intelligence, the team clarifies:
“Our claim that GPT-4 represents progress towards AGI does not mean that it is perfect at what it does, or that it comes close to being able to do anything that a human can do (which is one of the usual definitions of AGI), or that it has inner motivation and goals (another key aspect in some definitions of AGI).”
What remains of the “spark”
So the question remains, what is the team actually claiming here? The key message seems to be: GPT-4 shows more general intelligence than older models. Whether this can be understood as a “spark” for AGI depends on two questions:
- How many of GPT-4’s new capabilities are based on the additional training data, the exact composition of which is so far a secret of OpenAI?
- And will scaling the approach lead to further progress?
The first question is relevant because a model always performs significantly better in benchmarks if the tests it contains are part of the training data.
The second, Gary Marcus framed like this in 2012: “To paraphrase an old parable, Hinton has built a better ladder; but a better ladder doesn’t necessarily get you to the moon.”
Whether self-supervised trained transformers are building ladders or rockets, remains to be seen. Research with small networks and toy examples shows that transformers learn general and useful “circuits” to predict, say, Othello moves – an indication that they learn more than superficial statistics from the data.
Whether the same is true for large language models is still unclear. “Overall, elucidating the nature and mechanisms of AI systems such as GPT-4 is a formidable challenge that has suddenly become important and urgent,” the researchers conclude of their GPT-4 study.
Either way, the work is good publicity for GPT-4.