A research team at the University of Oxford has introduced a new method called "Semantic Entropy Probes" (SEPs) to efficiently detect uncertainty and hallucinations in large language models. The method could make the practical application of AI systems safer.
Oxford University scientists have developed a new technique that can cost-effectively detect hallucinations and uncertainties in large language models such as GPT-4. The Semantic Entropy Probes (SEPs) build on previous work on detecting hallucinations, in which some of the authors were involved.
In a paper published in Nature, the team demonstrated that it is possible to measure the "semantic entropy" from the responses of several large language models to identify arbitrary or false answers. The method generates multiple possible answers to a question and groups similar meanings. High entropy indicates uncertainty and potential errors. In tests, the method was able to distinguish between correct and false AI responses in 79 percent of cases - about 10 percent better than previous methods. Integration into language models could increase reliability but would come at a higher cost for providers.
The new SEPs method solves a central problem of semantic entropy measurement: high computational effort. Instead of generating multiple model responses for each query, the researchers train linear probes on the hidden states of language models when answering questions. These hidden states are internal representations that the model generates during text processing. The linear probes are simple mathematical models that learn to predict semantic entropy from these internal states.
In practice, this means that SEPs require only a single model response to estimate the model's uncertainty after training. This significantly reduces the computational effort for uncertainty quantification. The researchers show that SEPs are capable of accurately predicting semantic entropy as well as detecting hallucinations in model responses.
"Semantic entropy probes could be further improved with more training
The researchers examined the performance of SEPs across different model architectures, tasks, and model layers. They show that the hidden states in middle to late model layers best capture semantic entropy. SEPs can even predict semantic uncertainty even before the model begins to generate a response.
While SEPs do not quite achieve the performance of more computationally intensive methods such as direct calculation of semantic entropy, the team says they offer a balanced trade-off between accuracy and computational efficiency. This makes them a promising technique for practical use in scenarios where computational resources are limited. In the future, the team wants to further improve the performance of SEPs, for example with larger training datasets for the SEPs.