Honest LLaMA: New method could make ChatGPT more truthful

Midjourney prompted by THE DECODER

A new method significantly increases the truthfulness of large language models and shows that the models know more than they reveal.

Researchers at Harvard University have developed a technique called Inference-Time Intervention (ITI) to improve the truthfulness, or factuality, of large language models - and create a "Honest LLaMA", as it's called on GitHub. The work is motivated by the fact that ChatGPT and other chatbots provide correct information in some contexts, but hallucinate in others - so the facts are there, but sometimes lost in the model's inference.

The team uses linear probes to identify sections in the neural network that have high accuracy in factuality tests using parts of the TruthfulQA benchmark. Once the team has identified these sections in some of the transformer's attention heads, ITI shifts model activations along these attention heads during text generation.

ITI significantly increases Alpaca's truthfulness

The researchers show that with ITI, the accuracy of the open-source Alpaca model in the TruthfulQA benchmark increases from 32.5 to 65.1 percent, with similar jumps for Vicuna and LLaMA. However, too large a shift in model activations can also have negative consequences: The model denies answers and thus becomes less useful. This trade-off between factuality and helpfulness can be balanced by adjusting the intervention strength of ITI.

The team uses ITI for Honest-LLaMA (blue), which provides more correct answers than LLaMA (red). | Image: Li, Patel et al.

ITI has some overlap with reinforcement learning, where human feedback can also increase factuality. However, RLHF can also encourage misleading behavior as the model tries to match human expectations. ITI does not have this problem and is also minimally invasive, requiring little training data and computational power, the researchers say.

Studies of large language models could lead to a better understanding of "truth"

The team now wants to understand how the method can be generalized to other datasets in a real-world chat setting, and develop a deeper understanding of the trade-off between factuality and helpfulness. In addition, it may be possible in the future to learn the manually identified network segments in a self-supervised manner to make the method more scalable.

Finally, the researchers point out that the topic could also make a wider contribution: "From a scientific perspective, it would be interesting to better understand the multidimensional geometry of representations of complex attributes such as 'truth'."

The code and more information are available on GitHub.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Honest LLaMA: New method could make ChatGPT more truthful

ITI significantly increases Alpaca's truthfulness

Studies of large language models could lead to a better understanding of "truth"

100 experts call for more research into the control of AI systems

Safety assessments show that OpenAI's o3 is probably the company's riskiest AI model to date

Training AI on bad code makes it admire Hitler and want to harm humans, study finds

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Honest LLaMA: New method could make ChatGPT more truthful

ITI significantly increases Alpaca's truthfulness

Studies of large language models could lead to a better understanding of "truth"

100 experts call for more research into the control of AI systems

Safety assessments show that OpenAI's o3 is probably the company's riskiest AI model to date

Training AI on bad code makes it admire Hitler and want to harm humans, study finds