Content
summary Summary

A new study of how language models evaluate each other has uncovered a troubling pattern: as these systems become more sophisticated, they're increasingly likely to share the same blind spots.

Ad

Researchers from institutions in Tübingen, Hyderabad, and Stanford have developed a new measurement tool called CAPA (Chance Adjusted Probabilistic Agreement) to track how language models overlap in their errors beyond what you'd expect from their accuracy rates alone. Their findings suggest language models tend to favor other LLMs that make mistakes similar to their own.

Diagram: Correlation between AI model similarity and abilities, divided into evaluation and training scenarios with graphs and robot icons.
Great models think alike - according to the researchers, this makes effective external control by other AI systems more difficult. | Image: Goel et al.

When language models were tasked with judging other models' output, they consistently gave better scores to systems that shared their error patterns, even after accounting for actual performance differences. The researchers compare this behavior to "affinity bias" in human hiring, where interviewers unconsciously favor candidates who remind them of themselves.

The team also explored what happens when stronger models learn from content generated by weaker ones. They discovered that greater differences between models led to better learning outcomes, likely because dissimilar models possess complementary knowledge. This finding helps explain why performance gains in "weak-to-strong" training approaches vary across different tasks.

Ad
Ad

Common blind spots and failure modes

After analyzing more than 130 language models, the researchers identified a concerning pattern: as models become more capable, their errors grow increasingly alike. This trend raises safety concerns, especially as AI systems take on more responsibility for evaluating and controlling other AI systems.

"Our results indicate a risk of common blind-spots and failure modes when using AI oversight, which is concerning for safety," the researchers write.

The research team emphasizes the importance of paying attention to both model similarity and error diversity, noting that further research is needed to extend their metric to evaluate free-text responses and the reasoning abilities of large language models.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers developed a new metric called CAPA (Chance Adjusted Probabilistic Agreement) to measure the similarity of errors made by language models, which is crucial when AI systems are used to evaluate and control other AIs.
  • Experiments show that AI models acting as "judges" tend to favor models with similar errors, while more powerful models learn more during training from data provided by dissimilar, weaker models.
  • As language models become more advanced, their errors become more similar, which can raise security concerns when AI systems are used to supervise other AI systems, potentially leading to shared blind spots.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.