summary Summary

Researchers at the École Polytechnique Fédérale de Lausanne (EPFL) have conducted a thorough analysis comparing in-context learning (ICL) and instruction fine-tuning (IFT) for adapting large language models (LLMs).


The study used the industry standard MT-Bench benchmark to measure how well models follow instructions. Surprisingly, ICL and IFT performed similarly when fewer training examples (up to 50) were used in the first run of the MT-Bench test.

The study authors suggest that when only a few examples are available, ICL with high-quality data could be a viable alternative to IFT.

Instruction fine-tuning is better for more complex tasks

Despite similarities in simple tasks, clear differences emerged between the two methods in more complex scenarios. In multi-round conversations, IFT significantly outperformed ICL.


The researchers hypothesize that this is because ICL models overfit to the style of individual examples and struggle to respond to more complex conversations. Even base models were able to outperform ICL in this second round.

Line chart: MT benchmark comparison of different training methods for Mistral-7B-v0.2, two runs, increasing number of examples.
The graph shows a similar performance of ICL and IFT on a small number of training examples in the MT-Bench test for Mistral-7B-v0.2. However, as the number of examples increases, the IFT method shows clear superiority, especially in the second run. | Image: Zhao et al.

The study also investigated the URIAL method, which trains base language models with just three examples and instruction-following rules. Although URIAL produced good results, it fell short of models adapted through instruction fine-tuning.

The EPFL researchers improved URIAL's performance to approach that of fine-tuned models by selecting additional optimized examples. This was done using a greedy search, which selects examples that incrementally improve the model's performance the most. The result underscores the general importance of high-quality training data for both ICL and IFT, and even for training the base models.

Table: Performance comparison of URIAL vs. URIAL+greedy search for Mistral-7B-v0.2 and Llama-3.1-8B, various in-context prompts.
With additional optimized examples, URIAL approaches the performance of Instruct models, underscoring the importance of high-quality training data for both approaches. | Image: Zhao et al.

Another finding was the significant impact of decoding parameters on model performance. These parameters, which determine how the model generates text, played a critical role in both base LLMs and models using URIAL. With the right decoding parameters, even base models can follow instructions to some extent, the researchers note.

Implications for practice

The results show that in-context learning can effectively and quickly adjust language models, especially when few training examples are available.


However, fine-tuning remains superior for generalizing to more complex tasks such as multi-turn conversations. In addition, IFT continues to improve with larger datasets, while ICL plateaus after a certain number of examples.

The researchers emphasize that choosing between ICL and IFT depends on various factors, including available resources, data quantity, and specific application requirements. In any case, the study highlights the importance of high-quality training data for both approaches.

The study, titled "Is In-Context Learning Sufficient for Instruction Following in LLMs?" will be presented at NeurIPS 2024. The code is available on Github.

The gold standard may still be to first achieve high-quality generation as quickly as possible with examples in the prompt (ICL), which can then be further optimized and stabilized by fine-tuning (IFL).

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Researchers at the École Polytechnique Fédérale de Lausanne (EPFL) have compared In-Context Learning (ICL) and Instruction Fine-Tuning (IFT) for adapting large language models and found that both methods perform similarly with a small set of training examples.
  • However, for more complex tasks, such as multistep conversations, IFT performed significantly better than ICL. The researchers suggest that ICL models are overly tuned to single examples and have difficulty responding to complex conversations.
  • The choice between ICL and IFT depends on several factors, including available resources, amount of data, and specific requirements. High-quality training data is essential for both approaches.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.