AI models struggle with complex table questions, lagging far behind humans in new benchmark

Midjourney prompted by THE DECODER

Researchers from Beihang University in China have developed a new dataset called TableBench to evaluate the performance of AI models when answering complex questions about tabular data. The benchmark reveals that even advanced systems perform significantly worse than humans in this area.

TableBench comprises 886 question-answer pairs from 18 different categories, covering a wide range of tasks such as fact checking, numerical calculations, data analysis, and visualization. The dataset aims to bridge the gap between academic benchmarks and real-world application scenarios, with the average number of "thinking steps" required to answer a question being 6.26 - significantly higher than comparable datasets.

The research team evaluated over 30 large language models on TableBench, including both open-source and proprietary systems. Even the powerful GPT-4o model only achieved around 54 % of human performance, highlighting the considerable room for improvement needed for AI models to meet the requirements of real-world applications.

Microsoft is working on solutions

Alongside TableBench, the researchers also presented TableInstruct, a training dataset with around 20,000 examples. They used it to train their own model called TABLELLM, which achieved a performance comparable to GPT-3.5.

Microsoft researchers have also recently introduced SpreadsheetLLM, a method that can improve the performance of language models in table processing.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI models struggle with complex table questions, lagging far behind humans in new benchmark

Microsoft is working on solutions

AI tools like ChatGPT sharply reduce jobs for young workers in exposed fields, study shows

xAI claims Apple and OpenAI are shutting out competitors by making exclusive AI partnerships

CEO Arison says no single AI model will always meet Grindr’s needs

Google downplays AI's environmental impact in new study

Deepseek’s first hybrid model V3.1 surpasses its R1 reasoning model on benchmarks

Meta's human-like chatbot personas can mislead users and result in real-world harm

AI models struggle with complex table questions, lagging far behind humans in new benchmark

Microsoft is working on solutions

AI tools like ChatGPT sharply reduce jobs for young workers in exposed fields, study shows

xAI claims Apple and OpenAI are shutting out competitors by making exclusive AI partnerships

CEO Arison says no single AI model will always meet Grindr’s needs