AI language models struggle to connect the dots in long texts, study finds

The latest generation of AI language models hits its limits when connecting information across long texts and drawing conclusions, according to new research from LMU Munich, the Munich Center for Machine Learning, and Adobe Research.

The team tested 12 leading models, including GPT-4o, Gemini 1.5 Pro, and Llama-3.3-70B, all capable of handling at least 128,000 tokens.

Models fail when word-matching isn't an option

The NOLIMA (No Literal Matching) benchmark tests how well AI models can link information and draw conclusions without relying on matching words. The test uses questions and text passages crafted to avoid shared vocabulary, forcing models to understand concepts and make connections.

Here's how it works: A text might include "Yuki actually lives next to the Semperoper." The related question would be: "Which character has already been to Dresden?" To answer correctly, the model needs to understand that the Semperoper is in Dresden, identifying Yuki as the answer.

Vergleichstabelle: Leistungsfähigkeit von 12 Sprachmodellen mit Basis-Scores, effektiven Längen und Performanz bei verschiedenen Kontextlängen. — The NOLIMA benchmark results reveal clear differences in performance between different language models. GPT-4o impresses with the highest effective context length of 8K, while smaller models drop off sharply with longer sequences. | Image: Modarressi et al.

The results show models struggling as text length increases. Performance drops significantly between 2,000 and 8,000 tokens. At 32,000 tokens, 10 out of 12 models perform at half their usual capability compared to shorter texts.

Even specialized reasoning models fall short

The researchers point to limitations in the models' basic attention mechanism, which gets overwhelmed by longer contexts. Without word-matching clues, models struggle to find and connect relevant information.

Performance drops further when more thinking steps (latent hops) are needed. The order of information matters too - models perform worse when the answer comes after the key information.

The team also created NOLIMA-Hard, featuring the ten toughest question-answer pairs, to test specialized reasoning models. Even purpose-built systems like o1, o3-mini, and DeepSeek-R1 score below 50 percent with 32,000-token contexts, despite near-perfect performance on shorter texts.

Chain-of-Thought-Prompting (CoT) helps Llama-3.3-70B handle longer contexts better, but doesn't solve the core problem. While word matches make the task easier, they can actually hurt performance if they appear as distractions in irrelevant contexts.

Recommendation

AI research

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

Vergleichstabelle: Deutlicher Performance-Abfall bei Llama 3.3 und Reasoning-Modellen mit steigender Kontextlänge, rote Markierungen unter 50%. — The performance of all tested models drops dramatically with increasing context length. Even the best model GPT-o1 loses almost 70 percent of its original performance at 32K. Although the Chain-of-Thought (CoT) method slightly improves the results of Llama 3.3 70b, it cannot prevent the sharp drop in performance. | Image: Modarressi et al.

This weakness could affect real-world applications, for example search engines using RAG architecture. Even when a document contains the right answer, the model might miss it if the wording doesn't exactly match the query, getting distracted by surface-level matches in less relevant texts.

NOLIMA as the new context window metric?

While recent months haven't seen major breakthroughs in foundation models, companies have focused on improving reasoning capabilities and expanding context windows. Gemini 1.5 Pro currently leads with a two-million token capacity.

As context windows grew - from GPT-3.5's 4,096 tokens to GPT-4's 8,000 - models initially struggled with basic word sequence extraction. They later showed improvement in manufacturer-published NIAH benchmark results.

NOLIMA could become a new standard for measuring how effectively models handle large context windows, potentially guiding future LLM development. Previous research suggests there's still significant room for improvement in this area.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI language models struggle to connect the dots in long texts, study finds

Models fail when word-matching isn't an option

Even specialized reasoning models fall short

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

NOLIMA as the new context window metric?

Trump advisors are pushing a regulation targeting what they call "woke" AI models in the tech sector

Anthropic appears to tighten the usage limits for Claude code

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

AI language models struggle to connect the dots in long texts, study finds

Models fail when word-matching isn't an option

Even specialized reasoning models fall short

NOLIMA as the new context window metric?

Share

Bank details