summary Summary

Hallucinations are one of the central problems of large language models. An OpenAI co-founder explains why a TruthGPT is so difficult.

Elon Musk’s X.AI wants to build “TruthGPT”, an honest language model – referring not only to classic cases of hallucination where systems like ChatGPT generate false outputs but also to reports that ChatGPT favors certain political beliefs.

While the latter could be solved by giving users more control over language models, hallucinations remain a central problem that OpenAI, Google, and, in the future, Musk’s AI company will have to deal with.

In his talk, “RL and Truthfulness – Towards TruthGPT,” OpenAI co-founder and researcher John Schulman discussed these challenges and how they might be addressed.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer

What causes hallucinations in ChatGPT?

According to Schulman, hallucinations can be roughly divided into two types: (1) “pattern completion behavior,” in which the language model fails to express its own uncertainty, fails to question a premise in a prompt, or continues a mistake it made earlier, and (2) cases in which the model is guessing wrong.

Since the language model represents a kind of knowledge graph with facts from the training data in its own network, fine-tuning can be understood as learning a function that operates on this knowledge graph and outputs token predictions. For example, a fine-tuning data set might contain the question “What is the genre of Star Wars?” and the answer “Sci-Fi”. If this information is already in the original training data, i.e. it is part of the knowledge graph, the model does not learn new information but it learns a behavior – outputting correct answers. Such fine-tuning is also called “behavior cloning”.

The problem: If, for example, the question “What was the name of the spin-off movie about Han Solo?” appears in the fine-tuning dataset, but the answer “Solo” is not part of the original training dataset – and thus not part of the knowledge graph – the network learns to answer even though it does not know the answer. Fine-tuning with answers that are actually correct but not in the knowledge graph thus teaches the network to make up answers – i.e. to hallucinate. Conversely, training with incorrect answers can cause the network to withhold information.

Ideally, therefore, behavior cloning should always be based on knowledge of the network – but this knowledge is usually unknown to the human workers who create or evaluate the data sets, e.g. for instruction tuning. According to Schulman, this problem also exists when other models create fine-tuning data sets, as is the case with the Alpaca formula. A smaller network with a smaller knowledge graph not only learns to give answers and follow instructions using ChatGPT’s output, but also learns to hallucinate more often, he predicts.

How OpenAI aims to combat hallucinations

The good news is that, at least for simple questions, language models seem to be able to estimate whether they know an answer – and could theoretically express their uncertainty. So, Schulman says, a fine-tuning data set needs to include examples in which uncertainty is communicated, a premise is challenged, or an error is admitted. These behaviors could then be taught to the model through behavior cloning.


What the model does not learn is when to exhibit this behavior. This is where reinforcement learning comes in, according to Schulman, for example with human feedback (RLHF). With RL, he says, the model can learn the “behavioral boundaries,” such as when to express uncertainty or safety.

Another piece of the puzzle is the ability to retrieve and cite sources, e.g. via mechanisms as shown in WebGPT or recently in the browser plugin for ChatGPT, Schulman says.

Why does ChatGPT hallucinate despite behavior cloning and RLHF?

While the presented approach is relatively straightforward for short questions and answers, other problems arise for the long-form setting common in ChatGPT. For one thing, completely wrong answers are extremely rare, often you will find a mixture of wrong and correct information in the output, in extreme cases, it might be a single error in 100 lines of code.

In other cases, the information is not wrong in the classical sense, but misleading. In systems like ChatGPT, it is therefore difficult to measure the quality of responses in terms of informativeness or correctness. However, such measurements are important for an RL algorithm that is intended to train complex behavioral boundaries.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Currently, OpenAI relies on a ranking-based reward model for the RLHF that predicts which of two answers it thinks is better but does not provide an effective signal to distinguish how much better, more informative, or correct an answer is. Thus, it lacks the ability to give feedback to the model to learn the fine behavioral boundaries that Schulman sees as a possible solution to the hallucination problem.

In addition, this process is complicated by human error in the RLHF labeling process, such as not having all the relevant information to verify a particular statement.

TruthGPT and knowledge generation: open problems.

So while Schulman sees reinforcement learning as an important building block for reducing hallucinations, he believes there are still many open problems. Aside from the aforementioned question of what exactly a reward model needs to look like to encourage correct behavior, RLHF currently relies only on human approval, or “what’s convincing, what sounds right,” as Schulman says.

This could make it difficult to generate knowledge because predictions about the future sometimes lead to statements that are initially unconvincing. For Schulman, however, generating knowledge is the next important step for language models, and he sees theory-building for predicting the future and giving rules of reasoning as open problems that he wants to explore.

One possible solution, he says, is to use other AI models to train language models – an approach OpenAI also believes is important for AI alignment. “It’s all pretty new and I think we still have yet to see really good practical implementations of this stuff, but it’s starting to become necessary because it’s getting really hard for labelers to keep up with the models.”

Watch the full talk on YouTube.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • On the way to a "TruthGPT", OpenAI and, in the future, Elon Musk are struggling with hallucinations. OpenAI co-founder John Schulman explains how hallucinations occur and how reinforcement learning can reduce them.
  • The recipe: behavior cloning for behavior patterns, reinforcement learning for behavior limits. In practice, however, there are numerous problems, such as how to create an appropriate reward model.
  • Schulman sees the generation of new knowledge as the next big challenge and believes that other AI models could train language models with and without humans.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.