summary Summary

Researchers from Google DeepMind have developed a method called JEST that makes training AI models for image and text processing significantly more efficient.


Multimodal AI models learn to link images and texts by maximizing the correspondence of related image-text pairs and minimizing the correspondence of unrelated pairs. Traditionally, training examples are randomly selected or based on individual relevance for each iteration in batches.

However, the researchers argue that the quality of a batch depends not only on the sum of the individual data points but also on their composition. Therefore, they have developed an algorithm that selects subsets of data from a larger "super batch" based on their collective learnability.

JEST uses AI model for data selection

To determine which data is most learnable, JEST (Joint Example Selection Technique) uses two AI models: the model currently being trained and an already trained reference model. Data that is difficult for the model being trained but easy for the reference model is considered particularly useful.


With this method, the team was able to shorten the training time for certain tasks by a factor of 13. At the same time, ten times less computing power was needed to achieve the same performance as with conventional methods.

According to the researchers, the choice of the reference model, which is pre-trained on a small, high-quality dataset, is crucial. Its quality limits the potential improvements. By increasing the reference dataset from 100 to 600 million examples while maintaining high quality, the results could be further improved.

Flexi-JEST achieves top score with 10 percent of training data

To reduce the increased computational effort when evaluating the "super batch," the scientists also introduced a variant called Flexi-JEST. This uses a simplified version of the model with coarser image resolution to evaluate the data and trains in parallel with full and reduced resolution.

With Flexi-JEST, a model achieved better average performance on eight standard tasks after 4 billion training examples than the currently best model SigLIP after 40 billion examples. This corresponds to a saving of 90 percent of the computing operations.

According to the researchers, the results show the potential to learn from small, carefully curated datasets to filter much larger, unstructured amounts of data - a process they call "data quality bootstrapping." This could pave the way for more efficient AI models that require less computing power and training data.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Google Deepmind researchers have developed a method called JEST that makes training multimodal AI models for image and text processing more efficient by selecting subsets of data according to their joint learning ability.
  • JEST uses two AI models - the model to be trained and a pre-trained reference model - to find out which data is particularly instructive. This reduces the training time by a factor of 13 and the required computing power by 90%.
  • The Flexi-JEST variant uses a simplified version of the model for data evaluation, and achieves better performance than the current leading model with only 10% of the training data. The researchers see the potential for learning from small, carefully curated data sets to filter large, unstructured amounts of data.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.