Content
summary Summary

Large language models show cognitive biases and do not align with human preferences when evaluating text, according to a study.

Understanding bias in large language models (LLMs) is important because they are increasingly used in real-world applications, from recommending content to scoring job applications. When these models are biased, they can make decisions or predictions that are unfair or inaccurate.

Suppose an AI system is used to score job applications. The system uses a large language model to evaluate the quality of the cover letter. But if that model has an inherent bias, such as favoring longer text or certain keywords, it could unfairly favor some applicants over others, even if they are not necessarily more qualified.

Cognitive biases in LLMs

Researchers at the University of Minnesota and Grammarly have now conducted a study to measure cognitive biases in large language models (LLMs) when used to automatically evaluate text quality.

Ad
Ad

The research team assembled 15 LLMs from four different size ranges and analyzed their responses. The models were asked to evaluate the responses of other LLMs, e.g. "System Star is better than System Square".

For this purpose, the researchers introduced the "COgnitive Bias Benchmark for LLMs as EvaluatoRs" (COBBLER), a benchmark for measuring six different cognitive biases in LLM evaluations.

They used 50 question-answer examples from the BIGBENCH and ELI5 datasets, generated responses from each LLM, and asked models to evaluate their own responses and the responses of other models.

Examples of measured biases include egocentric bias, where a model favors its results when scoring, and order bias, where a model favors an option based on its order. See the table below for a complete list of measured biases.

The biases investigated in the study. | Image: Koo et al.

The study shows that LLMs are biased when judging text quality. The researchers also examined the correlation between human and machine preferences and found that machine preferences do not closely match human preferences (rank bias overlap: 49.6%).

Recommendation
The graphs show the degree of bias in the different models and their alignment with human judgments.| Image: Koo et al.

LLMs are not useful as automatic text evaluators based on human preferences

According to the research team, the results of the study suggest that LLMs should not be used for automatic annotation based on human preferences.

Most of the models tested showed strong signs of cognitive biases that could compromise their credibility as annotators.

Even models that had been tuned to instructions or trained with human feedback showed various cognitive biases when used as automatic annotators.

The individual biases and their contribution to the overall model bias. Metas Llama 2 has the lowest bias, GPT-4 is just average. | Image: Koo et al.

The low correlation between human and machine ratings suggests that machine and human preferences are generally not very close. This raises the question of whether LLMs are capable of giving fair ratings at all.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

With evaluation capabilities that include various cognitive biases as well as a low percentage of agreement with human preference, our findings suggest that LLMs are still not suitable as fair and reliable automatic evaluators.

From the paper

Full details of the study are available in the arXiv paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators".

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study shows that large language models (LLMs) exhibit cognitive biases and fail to match human preferences when evaluating text. This raises concerns about the fairness and accuracy of AI decisions.
  • Researchers at the University of Minnesota and Grammarly developed the COgnitive Bias Benchmark for LLMs as EvaluatoRs (COBBLER) to measure six different cognitive biases in LLM evaluation results.
  • The results of the study suggest that LLMs should not be used for automatic annotation based on human preferences. All 15 models tested show strong evidence of cognitive biases that do not closely correlate with human preferences.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.