Content
summary Summary

Researchers from Beihang University in China have developed a new dataset called TableBench to evaluate the performance of AI models when answering complex questions about tabular data. The benchmark reveals that even advanced systems perform significantly worse than humans in this area.

Ad

TableBench comprises 886 question-answer pairs from 18 different categories, covering a wide range of tasks such as fact checking, numerical calculations, data analysis, and visualization. The dataset aims to bridge the gap between academic benchmarks and real-world application scenarios, with the average number of "thinking steps" required to answer a question being 6.26 - significantly higher than comparable datasets.

Image: Wu, Yang et al.

The research team evaluated over 30 large language models on TableBench, including both open-source and proprietary systems. Even the powerful GPT-4o model only achieved around 54 % of human performance, highlighting the considerable room for improvement needed for AI models to meet the requirements of real-world applications.

Microsoft is working on solutions

Alongside TableBench, the researchers also presented TableInstruct, a training dataset with around 20,000 examples. They used it to train their own model called TABLELLM, which achieved a performance comparable to GPT-3.5.

Ad
Ad

Microsoft researchers have also recently introduced SpreadsheetLLM, a method that can improve the performance of language models in table processing.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Beihang University have developed TableBench, a new benchmark for evaluating AI models at answering complex questions about tabular data.
  • When evaluating over 30 large language models on TableBench, even the best model, GPT-4o, achieved only about 54 % of human performance.
  • At the same time, the researchers introduced TableInstruct, a training dataset of about 20,000 examples. They used it to train their own model, TABLELLM, which achieved performance comparable to GPT-3.5.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.