Content
summary Summary

Microsoft researchers have developed SpreadsheetLLM, a method to optimize language models for analyzing spreadsheets.

Ad

The researchers explain that conventional spreadsheets are often too large and complex for AI models. SpreadsheetLLM solves this problem by converting the data into a more compact format, potentially making language models useful for many scientific and financial applications.

The approach reduces the amount of data by up to 96 percent without losing important information, according to the team. This allows AI systems to analyze very large spreadsheets, which was not possible before.

Image: Microsoft

The new method is based on three main techniques:

Ad
Ad
  • Structural Anchors: Identifies heterogeneous rows and columns at potential table boundaries, removes distant, homogeneous rows and columns and creates a condensed "skeleton" version of the spreadsheet for better layout insights.
  • Inverted-Index Translation: Replaces traditional row-by-column serialization with a JSON-format inverted-index translation. Creates a dictionary indexing non-empty cell texts and merges addresses with identical text to optimize token usage while maintaining data integrity.
  • Data Format Aggregation:Extracts number format strings and data types from adjacent numerical cells and clusters cells with similar formats or types together.

Using these techniques, the system captures the essential information of a spreadsheet without needing to store every single cell.

SpreadsheetLLM improves accuracy by up to 75 percent

The researchers tested their method with various AI models, including OpenAI's GPT-4 and open-source models like Llama 2. In the task of recognizing tables in spreadsheets, the system achieved an accuracy of 79 percent - an improvement of 13 percentage points over the previous best score.

The advantage of the new method was particularly evident with very large spreadsheets. For the largest files tested, accuracy improved by 75 percentage points compared to conventional techniques, as the token limits of the language models were no longer exceeded.

Image: Microsoft

The researchers also developed a technique called "Chain of Spreadsheet" (CoS) to answer complex queries about spreadsheets. This divides the task into two steps: First, the system identifies the relevant table area, then it generates the answer. Using this method, the system achieved 74 percent accuracy on question-answering tasks related to spreadsheets.

The scientists acknowledge that their method still has limitations. Currently, formatting details such as background colors, which could provide additional information, are not considered. The researchers also see room for improvement in the semantic condensation of text cells.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft researchers have developed SpreadsheetLLM, a method that optimizes language models for analyzing spreadsheets by converting data into a more compact format, reducing the amount of data by up to 96 percent without losing important information.
  • The approach uses three main techniques: Structural Anchors to create a condensed "skeleton" version of the spreadsheet, Inverted-Index Translation to optimize token usage, and Data Format Aggregation to cluster cells with similar formats or types together.
  • In tests, SpreadsheetLLM improved accuracy by up to 75 percent for large spreadsheets and achieved 79 percent accuracy in recognizing tables, outperforming previous methods. The researchers also developed a "Chain of Spreadsheet" technique for answering complex queries about spreadsheets.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.