In-context learning proves competitive with LLM fine-tuning when data is scarce

Researchers at the École Polytechnique Fédérale de Lausanne (EPFL) have conducted a thorough analysis comparing in-context learning (ICL) and instruction fine-tuning (IFT) for adapting large language models (LLMs).

The study used the industry standard MT-Bench benchmark to measure how well models follow instructions. Surprisingly, ICL and IFT performed similarly when fewer training examples (up to 50) were used in the first run of the MT-Bench test.

The study authors suggest that when only a few examples are available, ICL with high-quality data could be a viable alternative to IFT.

Instruction fine-tuning is better for more complex tasks

Despite similarities in simple tasks, clear differences emerged between the two methods in more complex scenarios. In multi-round conversations, IFT significantly outperformed ICL.

The researchers hypothesize that this is because ICL models overfit to the style of individual examples and struggle to respond to more complex conversations. Even base models were able to outperform ICL in this second round.

Line chart: MT benchmark comparison of different training methods for Mistral-7B-v0.2, two runs, increasing number of examples. — The graph shows a similar performance of ICL and IFT on a small number of training examples in the MT-Bench test for Mistral-7B-v0.2. However, as the number of examples increases, the IFT method shows clear superiority, especially in the second run. | Image: Zhao et al.

The study also investigated the URIAL method, which trains base language models with just three examples and instruction-following rules. Although URIAL produced good results, it fell short of models adapted through instruction fine-tuning.

The EPFL researchers improved URIAL's performance to approach that of fine-tuned models by selecting additional optimized examples. This was done using a greedy search, which selects examples that incrementally improve the model's performance the most. The result underscores the general importance of high-quality training data for both ICL and IFT, and even for training the base models.

Table: Performance comparison of URIAL vs. URIAL+greedy search for Mistral-7B-v0.2 and Llama-3.1-8B, various in-context prompts. — With additional optimized examples, URIAL approaches the performance of Instruct models, underscoring the importance of high-quality training data for both approaches. | Image: Zhao et al.

Another finding was the significant impact of decoding parameters on model performance. These parameters, which determine how the model generates text, played a critical role in both base LLMs and models using URIAL. With the right decoding parameters, even base models can follow instructions to some extent, the researchers note.

Implications for practice

The results show that in-context learning can effectively and quickly adjust language models, especially when few training examples are available.

Recommendation

AI research

Automated research: The AI Scientist generates papers for 15 dollars each

However, fine-tuning remains superior for generalizing to more complex tasks such as multi-turn conversations. In addition, IFT continues to improve with larger datasets, while ICL plateaus after a certain number of examples.

The researchers emphasize that choosing between ICL and IFT depends on various factors, including available resources, data quantity, and specific application requirements. In any case, the study highlights the importance of high-quality training data for both approaches.

The study, titled "Is In-Context Learning Sufficient for Instruction Following in LLMs?" will be presented at NeurIPS 2024. The code is available on Github.

The gold standard may still be to first achieve high-quality generation as quickly as possible with examples in the prompt (ICL), which can then be further optimized and stabilized by fine-tuning (IFL).

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

In-context learning proves competitive with LLM fine-tuning when data is scarce

Instruction fine-tuning is better for more complex tasks

Implications for practice

Automated research: The AI Scientist generates papers for 15 dollars each

AI agents can be easily tricked into doing stupid things, study says

Want to understand ChatGPT? Watch Andrej Karpathy's explanation of how LLMs work

LLM-generated questions differ from human questions, study finds

OpenAI launches Codex: Autonomous AI agents for software development

AlphaEvolve is Google DeepMind's new AI system that autonomously creates better algorithms

US Copyright Office says fair use does not cover AI trained on "vast troves of copyrighted works

In-context learning proves competitive with LLM fine-tuning when data is scarce

Instruction fine-tuning is better for more complex tasks

Implications for practice

Share

Bank details