Google releases open-source LMEval to benchmark language and multimodal models

GPT-Iamge-1 prompted by THE DECODER

LMEval aims to standardize benchmarks and streamline safety analysis for large language and multimodal models.

Google has released LMEval, an open-source framework designed to make it easier to compare large AI models from different companies. According to Google, LMEval lets researchers and developers systematically evaluate models like GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, and Llama-3.1-405B using a single, unified process.

Comparing new AI models has always been tricky. Each provider uses its own APIs, data formats, and benchmark setups, making side-by-side evaluations slow and complicated. LMEval tackles this by standardizing the process—once you set up a benchmark, you can apply it to any supported model with minimal work, regardless of which company made it.

Multimodal benchmarks and safety metrics

Besides text, LMEval supports benchmarks for images and code as well. Google says new input formats can be added easily. The system can handle a range of evaluation types, from yes/no and multiple choice questions to free-form text generation. LMEval also detects "punting strategies," where models intentionally give evasive answers to avoid generating problematic or risky content.

Bar chart: Giskard benchmark for the harmfulness of AI models (GPT, Claude, Gemini, etc.). Higher values indicate greater safety. — Giskard's safety scores show how well different AI models steer clear of potentially harmful content. Higher percentages mean greater safety. | Image: Google

All test results are stored in a self-encrypting SQLite database, which keeps them locally accessible while preventing them from being indexed by search engines.

Cross-platform compatibility

LMEval runs on the LiteLLM framework, which smooths over the differences between APIs from providers like Google, OpenAI, Anthropic, Ollama, and Hugging Face. That means the same test can be run across multiple platforms without needing to rewrite anything.

One standout feature is what Google calls incremental evaluation. Instead of having to re-run an entire test suite whenever a new model or question is added, LMEval only performs the additional tests needed. This saves time and reduces compute costs. The system also uses a multithreaded engine to speed things up by running multiple calculations in parallel.

Google includes a visualization tool called LMEvalboard for analyzing results. The dashboard can generate radar charts to show model performance across different categories, and users can drill down to take a closer look at individual models.

Video: Google

Recommendation

AI research

Study reveals AI models have hidden capabilities they can't access through normal prompts

LMEvalboard supports drill-down views, letting users zoom in on specific tasks to pinpoint where a model made mistakes. It also allows for direct model-to-model comparisons, including side-by-side graphical displays of how they differ on certain questions.

The source code and sample notebooks are available on GitHub.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Google releases open-source LMEval to benchmark language and multimodal models

Multimodal benchmarks and safety metrics

Cross-platform compatibility

Study reveals AI models have hidden capabilities they can't access through normal prompts

Bloomberg: OpenAI trains AI to take on junior banking tasks

Verbalized Sampling is a simple prompt technique meant to make AI responses less boring

SwiReasoning helps large language models switch reasoning modes to boost efficiency and accuracy

The long-predicted deepfake dystopia has arrived with Sora 2

Anthropic claims to lower the entry barrier for advanced AI models with Claude Haiku 4.5

OpenAI says GPT-5 shows 30 percent less political bias than previous models

Google releases open-source LMEval to benchmark language and multimodal models

Multimodal benchmarks and safety metrics

Cross-platform compatibility

Share

Bank details