"Highlighted Chain of Thought" prompting boosts LLM accuracy and verifiability

A novel prompting method called "Highlighted Chain of Thought" (HoT) helps large language models better explain their reasoning and makes their answers easier for humans to verify.

The approach works in two steps: First, the AI reformulates the original question and marks important facts using XML tags. Then, it generates an answer that references these highlighted facts, creating clear connections between the question and response.

This structured approach forces models to more carefully consider the facts presented, which may reduce hallucinations, according to the researchers. The color-coded highlights also make it faster for humans to verify the AI's reasoning.

Three-step example: transformation of a text task using XML tags and highlighting for better readability. — HoT structures math problems by marking key information with XML tags, making the content easier to read and understand. | Image: Nguyen et al.

The research team used 15 human-annotated question-answer pairs to train AI models to independently generate highlights through prompting. Testing shows HoT improves AI accuracy across various tasks. At its best, the technique achieved improvements up to 15 percent, varying by model and benchmark.

Compared to the traditional chain-of-thought (CoT) method used to train current reasoning models like OpenAI o1, HoT increased accuracy by 1.6 percentage points for arithmetic tasks, 2.58 points for question-answering, and 2.53 points for logical reasoning.

Comparison tables and examples: HoT vs CoT performance on various LLMs, with detailed task examples to visualize the methodology. — Test results show HoT outperforming traditional CoT methods, with the biggest gains on AQUA (+14.64) and StrategyQA (+15.07) benchmarks. | Image: Nguyen et al.

The researchers tested HoT across five AI models: GPT-4o, Gemini-1.5-Pro, Gemini-1.5-Flash, Llama-3.1-70B, and Llama-3.1-405B. They evaluated 17 different task types covering arithmetic, reading comprehension, and logical thinking.

Reasoning models showed little to no benefit from HoT in testing, and in some cases performed worse, with Deepseek-R1 actually showing slightly decreased performance. The researchers attribute this to the example-based prompting approach, which can lead to poorer results with reasoning models.

Mixed results for human verification

Human testers completed verification tasks 25 percent faster with highlighted answers. However, the highlighting had an unexpected effect on trust: Users became more likely to accept AI answers, even incorrect ones.

With highlighting, humans correctly identified accurate answers 84.5 percent of the time, compared to 78.8 percent without highlighting. However, their ability to spot wrong answers dropped from 72.2 percent to 54.8 percent when highlighting was present. Tests using AI models as verifiers showed no clear improvement.

Recommendation

AI in practice

Deepseek’s first hybrid model V3.1 surpasses its R1 reasoning model on benchmarks

While HoT speeds up human verification of AI responses, it also makes people more likely to miss AI mistakes. | Image: Nguyen et al.

The researchers remain optimistic about HoT's potential to make AI systems more transparent and comprehensible, though they acknowledge more research is needed on how highlighting affects user trust.

The method also has technical limitations. Smaller models such as Llama-3.1-8B and Qwen-2.5-Coder-32B struggle to follow tagging instructions, often tagging results incorrectly or simply repeating examples. The research also found that moving tags to random phrases significantly affects accuracy, highlighting the importance of consistent tagging between questions and answers.

Looking ahead, the team plans to train AI models to generate HoT answers directly rather than using prompt examples, which could make the method more effective and widely applicable.

The research paper is available on the preprint server arXiv and on a project page. The researchers make their code and data available on Github.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

"Highlighted Chain of Thought" prompting boosts LLM accuracy and verifiability

Mixed results for human verification

Deepseek’s first hybrid model V3.1 surpasses its R1 reasoning model on benchmarks

OpenAI publishes prompting guide for GPT-5.1

Verbalized Sampling is a simple prompt technique meant to make AI responses less boring

Shopify CEO and ex-OpenAI researcher agree that context engineering beats prompt engineering

Frustrated authors withdraw papers after realizing their reviewers are just lazy LLMs

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

"Highlighted Chain of Thought" prompting boosts LLM accuracy and verifiability

Mixed results for human verification

Share

Bank details