ModernBERT is a "workhorse model" that brings faster, cheaper text processing for tasks like RAG

Midjourney prompted by THE DECODER

Answer.AI and LightOn announced ModernBERT, a new open-source language model that improves upon Google's BERT in speed, efficiency, and quality.

The encoder-only model processes text up to four times faster than its predecessor while using less memory, according to a blog post from the developers. The team trained ModernBERT on 2 trillion tokens from web documents, programming code, and scientific articles.

ModernBERT can handle texts up to 8,192 tokens long—16 times more than the typical 512-token limit of existing encoder models. It's also the first encoder model trained extensively on programming code. The model scored above 80 on the StackOverflow Q&A dataset, setting a record for encoder-only models.

Scatter plot: Pareto efficiency analysis of language models, X-axis shows runtime (ms/token), Y-axis GLUE score (84-91), Pareto front marked in yellow. — When measuring the balance between processing speed and accuracy in the General Language Understanding Evaluation (GLUE), ModernBERT-Large achieves optimal results with about 20ms per token and 90 points. | Image: Answer.AI, LightOn

The developers liken ModernBERT to a Honda Civic tuned for the racetrack: "When you get on the highway, you generally don’t go and trade in your car for a race car, but rather hope that your everyday reliable ride can comfortably hit the speed limit."

Major cost reductions for large-scale text processing

While large language models such as GPT-4 charge several cents per query and take seconds to respond, ModernBERT runs locally and is much faster and cheaper, according to the developers.

For example, filtering 15 trillion tokens in the FineWeb Edu project cost $60,000 using a BERT-based model. The same task would have cost over $1 million even with Google Gemini Flash, the cheapest decoder-based option.

The developers say ModernBERT is well-suited for many real-world applications, from retrieval-augmented generation (RAG) systems to code search and content moderation. Unlike GPT-4, which needs specialized hardware, the model runs effectively on consumer-grade gaming GPUs.

ModernBERT is available in two versions: a base model with 139 million parameters and a large version with 395 million parameters. Both models are now on Hugging Face with an Apache 2.0 license, and users can drop them in as direct replacements for their current BERT models. The team plans to release a larger version next year but has no plans for multimodal capabilities.

To promote development of new applications, the developers launched a competition that will award $100 and a six-month Hugging Face Pro subscription to each of the five best demos.

Recommendation

AI in practice

Update

US Copyright Office says fair use does not cover AI trained on "vast troves of copyrighted works

Google introduced BERT (Bidirectional Encoder Representations from Transformers) in 2018, using it primarily for Google Search. The model remains one of the most popular on HuggingFace, with more than 68 million monthly downloads.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

ModernBERT is a "workhorse model" that brings faster, cheaper text processing for tasks like RAG

Major cost reductions for large-scale text processing

US Copyright Office says fair use does not cover AI trained on "vast troves of copyrighted works

Meta sees early signs of self-improving AI, signals caution on open source plans

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

AI system aims to spot teens on YouTube, even if they lie about their age

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

ModernBERT is a "workhorse model" that brings faster, cheaper text processing for tasks like RAG

Major cost reductions for large-scale text processing

Share

Bank details