LLMs are biased and don't match human preferences when evaluating text, study finds

Large language models show cognitive biases and do not align with human preferences when evaluating text, according to a study.

Understanding bias in large language models (LLMs) is important because they are increasingly used in real-world applications, from recommending content to scoring job applications. When these models are biased, they can make decisions or predictions that are unfair or inaccurate.

Suppose an AI system is used to score job applications. The system uses a large language model to evaluate the quality of the cover letter. But if that model has an inherent bias, such as favoring longer text or certain keywords, it could unfairly favor some applicants over others, even if they are not necessarily more qualified.

Cognitive biases in LLMs

Researchers at the University of Minnesota and Grammarly have now conducted a study to measure cognitive biases in large language models (LLMs) when used to automatically evaluate text quality.

The research team assembled 15 LLMs from four different size ranges and analyzed their responses. The models were asked to evaluate the responses of other LLMs, e.g. "System Star is better than System Square".

For this purpose, the researchers introduced the "COgnitive Bias Benchmark for LLMs as EvaluatoRs" (COBBLER), a benchmark for measuring six different cognitive biases in LLM evaluations.

They used 50 question-answer examples from the BIGBENCH and ELI5 datasets, generated responses from each LLM, and asked models to evaluate their own responses and the responses of other models.

Examples of measured biases include egocentric bias, where a model favors its results when scoring, and order bias, where a model favors an option based on its order. See the table below for a complete list of measured biases.

The biases investigated in the study. | Image: Koo et al.

The study shows that LLMs are biased when judging text quality. The researchers also examined the correlation between human and machine preferences and found that machine preferences do not closely match human preferences (rank bias overlap: 49.6%).

Recommendation

AI research

DeepMind's Genie 2 generates playable 3D worlds from single images

The graphs show the degree of bias in the different models and their alignment with human judgments.| Image: Koo et al.

LLMs are not useful as automatic text evaluators based on human preferences

According to the research team, the results of the study suggest that LLMs should not be used for automatic annotation based on human preferences.

Most of the models tested showed strong signs of cognitive biases that could compromise their credibility as annotators.

Even models that had been tuned to instructions or trained with human feedback showed various cognitive biases when used as automatic annotators.

The individual biases and their contribution to the overall model bias. Metas Llama 2 has the lowest bias, GPT-4 is just average. | Image: Koo et al.

The low correlation between human and machine ratings suggests that machine and human preferences are generally not very close. This raises the question of whether LLMs are capable of giving fair ratings at all.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

With evaluation capabilities that include various cognitive biases as well as a low percentage of agreement with human preference, our findings suggest that LLMs are still not suitable as fair and reliable automatic evaluators.

From the paper

Full details of the study are available in the arXiv paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators".

LLMs are biased and don't match human preferences when evaluating text, study finds

Cognitive biases in LLMs

DeepMind's Genie 2 generates playable 3D worlds from single images

LLMs are not useful as automatic text evaluators based on human preferences

Researchers identify a "reasoning gap" in large AI models

Here is an interesting take on LLM hallucinations by Andrej Karpathy

Microsoft's Orca 2 can beat LLMs 5-10 times its size thanks to a unique training method

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

LLMs are biased and don't match human preferences when evaluating text, study finds

Cognitive biases in LLMs

LLMs are not useful as automatic text evaluators based on human preferences

Share

Bank details