Are we running out of AI benchmarks?

Benchmarks are an important way to measure progress in AI research - but artificial intelligence is constantly achieving new bests. Are we running out of AI benchmarks?

Artificial Intelligence benchmarks are closely linked to AI research: They create measurability and comparability, and often even become a research goal. AI benchmarks thus steer - at least in part - AI progress.

The successes of artificial neural networks in the ImageNet benchmark, for example, are considered to have triggered the AI boom of recent years: the results demonstrated the potential of the technology and led to increased investment in further research. They are consequently a key element for the further development of AI.

What to do when we run out of benchmarks?

Today, the ImageNet benchmark still plays a central role in research: New models, such as Google's Vision Transformer, which rely on Transformer architectures and self-supervised learning for image analysis, are compared with ImageNet methods.

This reliance on benchmarks to determine progress becomes a problem when there are consistently top scores in a benchmark and there is no high-quality successor benchmark.

An example of the rapid pace of AI research came from Google and Microsoft in early 2021: researchers from Deepmind and Facebook, among others, unveiled the SuperGLUE language benchmark in August 2019 to replace the already outdated GLUE benchmark.

Less than a year later, AI systems from Google and Microsoft achieved top scores that even surpassed human benchmarks. The SuperGLUE benchmark has thus become obsolete.

33 percent of AI benchmarks are not being used

Researchers at the Medical University of Vienna and the University of Oxford now show in a meta-study of AI benchmarks that saturated or stagnant benchmarks are common. The researchers examined 1,688 benchmarks with 406 tasks in computer vision and natural language processing since 2013, and draw the following conclusions:

In some cases, there would be continuous growth, such as in the ImageNet benchmark.
However, a majority of all benchmarks quickly reach technological stagnation or saturation.
In some cases, a lack of research interest is also a cause of stagnation. The researchers cite the UCF101 action recognition benchmark as an example of saturation.
However, the dynamics of performance improvement do not follow a clearly discernible pattern: in some cases, phases of stagnation are followed by unpredictable leaps. This is what happened in the PROTEINS benchmark.

The researchers identify three patterns: steady growth, stagnation or saturation, and stagnation followed by a jump. | Image: Barbosa-Silva et al.

Moreover, of the 1,688 benchmarks, only 66 percent have more than three results at different points in time - so in practice, 33 percent of all AI benchmarks are not used and therefore useless. This points to the trend in recent years that benchmarks tend to be dominated by datasets from established institutions and companies, the researchers say.

Recommendation

AI research

Automated research: The AI Scientist generates papers for 15 dollars each

Quality over quantity for AI benchmarks

While benchmark successes for computer vision dominated the first half of the last decade, the second half saw a boom in natural language machine processing, according to the researchers.

In 2020, the number of new benchmarks declined and new tests increasingly focused on tasks with a higher level of difficulty, e.g., those that test reasoning. Examples of such benchmarks are Google's BIG-bench and FAIR's NetHack Challenge.

On the one hand, the trend toward benchmarks from established institutions, including industry, raises concerns about the bias and representativeness of benchmarks. On the other hand, criticism of the validity of many benchmarks for capturing the performance of AI systems under real-world conditions suggests that the development of fewer but quality-assured benchmarks covering multiple AI capabilities may be desirable.

In the future, new benchmarks should be developed by large, collaborative teams from many institutions, knowledge domains, and cultures to ensure high-quality benchmarks and avoid fragmentation of the benchmark landscape, the researchers conclude.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Are we running out of AI benchmarks?

What to do when we run out of benchmarks?

33 percent of AI benchmarks are not being used

Automated research: The AI Scientist generates papers for 15 dollars each

Quality over quantity for AI benchmarks

Read more about Artificial Intelligence:

Why large AI language models don't lead to human-like AI

Meta PEER: Are large language models any good as writing assistants?

GLM-130B: The most capable AI language model currently available comes from China

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Are we running out of AI benchmarks?

What to do when we run out of benchmarks?

33 percent of AI benchmarks are not being used

Quality over quantity for AI benchmarks

Read more about Artificial Intelligence:

Share

Bank details