AI models struggle with complex table questions, lagging far behind humans in new benchmark

Midjourney prompted by THE DECODER

Researchers from Beihang University in China have developed a new dataset called TableBench to evaluate the performance of AI models when answering complex questions about tabular data. The benchmark reveals that even advanced systems perform significantly worse than humans in this area.

TableBench comprises 886 question-answer pairs from 18 different categories, covering a wide range of tasks such as fact checking, numerical calculations, data analysis, and visualization. The dataset aims to bridge the gap between academic benchmarks and real-world application scenarios, with the average number of "thinking steps" required to answer a question being 6.26 - significantly higher than comparable datasets.

The research team evaluated over 30 large language models on TableBench, including both open-source and proprietary systems. Even the powerful GPT-4o model only achieved around 54 % of human performance, highlighting the considerable room for improvement needed for AI models to meet the requirements of real-world applications.

Microsoft is working on solutions

Alongside TableBench, the researchers also presented TableInstruct, a training dataset with around 20,000 examples. They used it to train their own model called TABLELLM, which achieved a performance comparable to GPT-3.5.

Microsoft researchers have also recently introduced SpreadsheetLLM, a method that can improve the performance of language models in table processing.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI models struggle with complex table questions, lagging far behind humans in new benchmark

Microsoft is working on solutions

Google launches Pomelli, an AI-powered tool that creates social media campaigns automatically

Qualcomm enters data center market with new AI accelerator chips

Reuters: Deepseek emerges as key AI partner in China’s military research

OpenAI and Microsoft call AGI pointless, then make it the linchpin of billion-dollar deals

Google leans on token metrics, not revenue, adding to bubble talk about AI growth

OpenAI restructures under new foundation, Microsoft takes 27 percent stake

AI models struggle with complex table questions, lagging far behind humans in new benchmark

Microsoft is working on solutions

Google launches Pomelli, an AI-powered tool that creates social media campaigns automatically

Qualcomm enters data center market with new AI accelerator chips

Reuters: Deepseek emerges as key AI partner in China’s military research