Content
summary Summary

A systematic evaluation of leading AI chatbots reveals widespread problems with accuracy and reliability when handling news content.

Ad

The study, conducted by the BBC, tested ChatGPT, Microsoft Copilot, Google Gemini, and Perplexity on their ability to accurately report current events.

In December 2024, 45 BBC journalists evaluated how these AI systems handled 100 current news questions. They assessed responses across seven key areas: accuracy, source attribution, impartiality, fact-opinion separation, commentary, context, and proper handling of BBC content. Each response was rated from "no issues" to "significant issues."

51 percent of AI responses contained significant issues, ranging from basic factual errors to completely fabricated information. When the systems specifically cited BBC content, 19 percent of responses contained errors, while 13 percent contained either fabricated or misattributed quotes.

Ad
Ad
Several diagrams on BBC analysis of AI assistants: quality issues by category and comparison of ChatGPT, Copilot, Gemini and Perplexity.
Google Gemini had the highest rate of problematic responses at more than 60 percent. Accuracy and source support have room for improvement in all systems tested. | Image: via BBC

From health advice to current events: AI systems struggle with accuracy

Some of the errors could have real-world consequences. Google Gemini incorrectly claimed that the UK's National Health Service (NHS) advises against vaping, when in fact the health authority recommends e-cigarettes to help people quit smoking. Perplexity AI fabricated details about science journalist Michael Mosley's death, while ChatGPT failed to acknowledge the death of a Hamas leader, describing him as an active leader months after his passing.

The AI assistants regularly cited outdated information as current news, failed to separate opinions from facts, and dropped crucial context from their reporting. Microsoft Copilot, for instance, presented a 2022 article about Scottish independence as if it were current news.

Four bar charts compare AI assistants in the categories of impartiality, fact-opinion separation, editorialization, and context provision.
Among all the AI tools tested, Perplexity managed to perform most consistently across these different challenges. | Image: via BBC

The BBC set a high bar in its evaluation - even small mistakes counted as "significant issues" if they might mislead someone reading the response. And while the standards were tough, the problems they found match what other researchers have already seen about how AI stumbles when handling news.

Take one of the more striking examples: Microsoft's Bing chatbot got so confused reading court coverage that it accused a journalist of committing the very crimes he was reporting on.

The BBC says it will run this study again in the near future. Adding independent reviewers and comparing how often humans make similar mistakes could make future studies even more useful - it would help show just how big the gap is between human and AI performance.

Recommendation

Scale of AI news distortion remains unknown, BBC warns

The BBC acknowledges that their research, while revealing, only begins to uncover the full scope of the problem. The challenge of tracking these errors is complex. "The scale and scope of errors and the distortion of trusted content is unknown," the BBC report states.

AI assistants can provide answers to an almost unlimited range of questions, and different users might receive entirely different responses when asking the same question. This inconsistency makes systematic evaluation extremely difficult.

The problem extends beyond just users and journalists. Media companies and regulators lack the tools to fully monitor or measure these distortions. Perhaps most concerning, the BBC suggests that even the AI companies themselves may not know the true extent of their systems' errors.

"Regulation may have a key role to play in helping ensure a healthy information ecosystem in the AI age," the BBC writes.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study conducted by the BBC reveals that AI assistants, including ChatGPT, Microsoft Copilot, Google Gemini, and Perplexity, consistently distort news content when responding to queries.
  • The study involved 45 BBC journalists who analyzed the AI systems' responses to 100 current news questions based on seven criteria, such as accuracy and proper citation of sources.
  • The results showed that 51% of all AI-generated responses contained significant errors, ranging from incorrect facts and inadequate sourcing to a lack of context, with mistakes including inaccurate health recommendations and fabricated quotes.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.