A new study reveals that large language models often possess accurate information internally, even when their outputs are incorrect. This finding could pave the way for more dependable AI systems.
Researchers from Technion University, Google, and Apple have demonstrated in a study that large language models have greater awareness of their own errors than previously believed. The study, titled "LLMs Know More Than They Show," offers insights into how AI models process correct and incorrect information internally.
The research team developed a novel method to analyze the inner workings of AI in greater detail. They focused particularly on "exact answer tokens" - the specific parts of an AI response that contain the crucial information. For instance, in answering "What is the capital of France?", the word "Paris" would be the exact answer token in the response "The capital of France is Paris, a world-renowned city."
The study found that these tokens hold the most information about whether a response is accurate or not. Surprisingly, the AI models sometimes "knew" the correct answer internally, but still provided an incorrect output. This suggests that the models possess more information than they reveal in their responses.
New approaches to combat hallucinations?
The researchers also examined the AI models' ability to transfer their error detection across various tasks. They discovered this works especially well for similar types of tasks, indicating that AI develops specific abilities to handle certain kinds of information.
According to the research team, these findings could lead to new strategies for enhancing the reliability and accuracy of AI systems. In particular, the fact that models often "know" more internally than they show in their outputs opens up possibilities for improved error detection and correction mechanisms.