According to a new study, ChatGPT is a jack of all trades, master of none. But the chatbot will change artificial intelligence forever, the researchers say.
In a new paper, a team from the University of Science and Technology in Wrocław, Poland, shows how OpenAI’s ChatGPT performs on numerous natural language processing (NLP) machine learning benchmarks.
To do this, the researchers compared the chatbot to today’s best AI models in 25 different tasks. Their conclusion: ChatGPT is a “jack of all trades, master of none.”
Researchers develop a custom API to send over 38,000 requests to ChatGPT
So far, ChatGPT has mainly been tested in generative tasks, i.e. tasks that require the AI model to write or summarize text, or to answer questions, e.g. in a legal or medical context. In contrast, the Polish team is focusing on the analytical capabilities, especially the semantic and pragmatic understanding of the OpenAI chatbot.
This includes typical NLP problems such as simple text classification for humor or sarcasm, more complex ones such as grammatical correctness or sentiment analysis, and those where ambiguous words need to be correctly classified, or reasoning is tested.
Such tasks are not only relevant for research, but also for businesses, which can use them to automatically classify product reviews or moderate content with the help of AI.
For each benchmark, the team creates custom prompts that prompt ChatGPT to provide answers in the correct format. To handle the large volume of requests – over 38,000 prompts – the researchers use a custom PyGPT API and up to 20 OpenAI accounts.
ChatGPT is not yet on the level of state-of-the-art systems
In all 25 benchmarks, ChatGPT was consistently outperformed by today’s best AI models for each task. On average, the quality of the specialized models was 73.7 percent, while that of ChatGPT was 56.6 percent. ChatGPT was particularly weak on tasks involving a “very subjective problem of emotional perception and individual interpretation of the content”.
When the eight emotion-related tasks are excluded, the average quality of ChatGPT rises to 69.7 percent, while that of the other methods rises to 80 percent. In some cases, the quality of ChatGPT can be improved by a few percentage points with additional examples in the prompt.
So ChatGPT’s performance is still below the SOTA models – but apart from the emotion-related tasks, the gap is not very far, the researchers conclude. ChatGPT is thus a jack-of-all-trades, but without really mastering any task.
ChatGPT will be “life-changing” and “AI-boosting”
The researchers, therefore, expect ChatGPT to be used in classical NLP areas as well. The team sees a special advantage in the interactivity of the bot. Disadvantages are the lower accuracy and the beta status of the system.
ChatGPT also offers a unique self-explanation feature that makes it easier for people to understand what the bot is saying. This is an important part of explainable artificial intelligence (XAI), the paper says. As a result, the researchers “strongly believe that ChatGPT can accelerate the development of various AI-related technologies and profoundly change our daily lives.” They expect that ChatGPT and similar AI systems will advance AI research and spark an “economic and social AI revolution.”
In the future, the team plans to test ChatGPT in more reasoning benchmarks, as well as in a variety of prompt engineering methods.