Supersimple fine-tunes GPT-4 and sees diminishing returns in performance leaps

Mar 22, 2024

Midjourney prompted by THE DECODER

At a Glance

Supersimple, a data analysis platform that enables complex natural language queries, has fine-tuned OpenAI's GPT-4 and GPT-3.5. The models generate domain-specific language that is translated into JSON and database queries.
In internal benchmarks, the fine-tuned GPT-4 showed a 56 percent performance improvement over GPT-3.5, but the performance jump was smaller than the jump from GPT-3 to GPT-3.5. In addition, GPT-4 continued to show weaknesses on broad and open-ended questions.
The main problems of the refined GPT-4 are its significantly higher latency (6x higher than GPT-3.5) and cost (15x higher for inference and 11x higher for training than GPT-3.5). Supersimple therefore uses a mixture of different specialized models, prompts and heuristics to improve accuracy and response time.

Data analysis platform Supersimple recently fine-tuned GPT-4, but despite considerable progress, finds the results somewhat underwhelming.

Supersimple, a data analysis platform that allows users to perform complex ad-hoc queries in natural language, received early access to OpenAI's GPT-4 fine-tuning API a few weeks ago.

The company uses large language models like GPT-3.5 and GPT-4 to answer user queries in natural language. The LLMs have been tuned for three epochs each on a proprietary dataset of tens of millions of tokens with examples of question-answer combinations.

The models output their own domain-specific language (DSL), which is then compiled into JSON and database queries. Unlike text-to-SQL, the output is an explainable no-code exploration that interacts directly with the data platform and is easily editable.

Video: Supersimple

The complex output is broken down into individual blocks representing logical steps in the thought process. The complexity of creating correct SQL queries is shifted to the platform. When generating the output, the models also consider existing dashboards and user-defined concepts.

GPT-4 fine-tuning scales less than GPT-3.5

A comparison of OpenAI-based models shows that while a fine-tuned GPT-4 outperforms GPT-3.5 by 56 percent, the performance jump is smaller than that from GPT-3 to GPT-3.5. Nevertheless, the fine-tuned GPT-4 significantly outperforms the standard GPT-4 and GPT-3.5.

Despite performance improvements, the fine-tuned GPT-4 still showed weaknesses in broad and open-ended questions when they were supposed to be solved with a single answer, according to Supersimple's internal benchmark test with 100 different questions.

"Worryingly, there is an observable trend of diminishing returns from fine-tuning. While fine-tuned Davinci showed marked improvement over its base model, fine-tuned GPT-3.5 offered lesser gains, and the progress achieved by fine-tuning GPT-4 was even smaller," the company writes.

The main issues with the fine-tuned GPT-4, according to Supersimple, are the significantly higher latency, which is six times higher than GPT-3.5, and the costs, which are 15 times higher for inference and 11 times higher for training compared to GPT-3.5.

To overcome these limitations, Supersimple rarely relies on a single model call in production. Instead, it uses a mix of specialized models, prompts, and heuristics to improve both accuracy and response time.

Also due to high latency, Supersimple uses GPT-4 only for a specific subset of questions and for some of the most critical thinking steps. For the rest, other models such as GPT-3.5 are used.

For many real-world applications that require non-trivial reasoning skills, a single model with a single answer is not sufficient, and it is important for an AI to explain its results accurately to the user, the company says.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Supersimple