A systematic evaluation of leading AI chatbots reveals widespread problems with accuracy and reliability when handling news content.
The study, conducted by the BBC, tested ChatGPT, Microsoft Copilot, Google Gemini, and Perplexity on their ability to accurately report current events.
In December 2024, 45 BBC journalists evaluated how these AI systems handled 100 current news questions. They assessed responses across seven key areas: accuracy, source attribution, impartiality, fact-opinion separation, commentary, context, and proper handling of BBC content. Each response was rated from "no issues" to "significant issues."
51 percent of AI responses contained significant issues, ranging from basic factual errors to completely fabricated information. When the systems specifically cited BBC content, 19 percent of responses contained errors, while 13 percent contained either fabricated or misattributed quotes.
![Several diagrams on BBC analysis of AI assistants: quality issues by category and comparison of ChatGPT, Copilot, Gemini and Perplexity.](https://the-decoder.com/wp-content/uploads/2025/02/bbc_study_ai_news-2.png)
From health advice to current events: AI systems struggle with accuracy
Some of the errors could have real-world consequences. Google Gemini incorrectly claimed that the UK's National Health Service (NHS) advises against vaping, when in fact the health authority recommends e-cigarettes to help people quit smoking. Perplexity AI fabricated details about science journalist Michael Mosley's death, while ChatGPT failed to acknowledge the death of a Hamas leader, describing him as an active leader months after his passing.
The AI assistants regularly cited outdated information as current news, failed to separate opinions from facts, and dropped crucial context from their reporting. Microsoft Copilot, for instance, presented a 2022 article about Scottish independence as if it were current news.
![Four bar charts compare AI assistants in the categories of impartiality, fact-opinion separation, editorialization, and context provision.](https://the-decoder.com/wp-content/uploads/2025/02/bbc_study_ai_news-1.png)
The BBC set a high bar in its evaluation - even small mistakes counted as "significant issues" if they might mislead someone reading the response. And while the standards were tough, the problems they found match what other researchers have already seen about how AI stumbles when handling news.
Take one of the more striking examples: Microsoft's Bing chatbot got so confused reading court coverage that it accused a journalist of committing the very crimes he was reporting on.
The BBC says it will run this study again in the near future. Adding independent reviewers and comparing how often humans make similar mistakes could make future studies even more useful - it would help show just how big the gap is between human and AI performance.
Scale of AI news distortion remains unknown, BBC warns
The BBC acknowledges that their research, while revealing, only begins to uncover the full scope of the problem. The challenge of tracking these errors is complex. "The scale and scope of errors and the distortion of trusted content is unknown," the BBC report states.
AI assistants can provide answers to an almost unlimited range of questions, and different users might receive entirely different responses when asking the same question. This inconsistency makes systematic evaluation extremely difficult.
The problem extends beyond just users and journalists. Media companies and regulators lack the tools to fully monitor or measure these distortions. Perhaps most concerning, the BBC suggests that even the AI companies themselves may not know the true extent of their systems' errors.
"Regulation may have a key role to play in helping ensure a healthy information ecosystem in the AI age," the BBC writes.