Stanford's Alpaca was just the beginning: In recent weeks, several AI models have been unveiled using training data generated by ChatGPT.
In mid-March, Stanford researchers unveiled the Alpaca language model, a variant of Meta's LLaMA 7B that was fine-tuned with AI-generated data. The team trained the LLaMA model with 52,000 example statements generated by OpenAI's GPT-3.5 (text-davinci-003).
Alpaca achieved partially comparable results in the team's benchmarks, and was so cheap at $600 that alignment researcher Eliezer Yudkowsky, for example, sees it as a real challenge for companies like OpenAI.
Alpaca formula is open source, but may not be used commercially
However, the LLaMA model used for Alpaca is not released for commercial use, and the OpenAI GPT-3.5 terms of use prohibit using the model to develop AI models that compete with OpenAI. Stanford has therefore not yet released the model, only the training data and the code to generate the data and fine-tune the model.
Shortly after Alpaca, several open-source models were released that build directly on or are heavily inspired by this work. Alpaca-LoRA uses the resource-efficient low-rank adaptation (LoRA) method, also widely used in Stable Diffusion, with Meta's LLaMA to achieve results comparable to Alpaca.
Startup Nomic AI released GPT4All, a LLaMA variant trained with 430,000 GPT-3.5-turbo outputs selected from a dataset of one million outputs in total.
ChatDoctor, on the other hand, is a LLaMA model specialized for medical chats. The authors of the scientific paper trained LLaMA first with the 52,000 Alpaca training examples and then with 5,000 real conversations between medical professionals and patients.
The startup Databricks relied on EleutherAI's GPT-J-6B instead of LLaMA for its chatbot Dolly, which also used the Alpaca training dataset. "We find that even years-old open source models with much earlier architectures exhibit striking behaviors when fine tuned on a small corpus of instruction training data," Databricks said of training with the GPT-generated data.
ChatGPT clones and data gold
Stanford has provided a recipe for better open-source models with Alpaca, usually based on the powerful LLaMA models with special datasets generated by ChatGPT. Now that the larger LLaMA models have been leaked, we can expect much better models - but without a commercial license for now.
That ChatGPT's output is suitable as first-class training data is also shown by a report that Google employees wanted to use ChatGPT dialogs as training data for Bard. The process was stopped when an employee brought it to management's attention.
However, high-quality, human-generated data remains relevant for high-performance models-at least for companies like OpenAI that want to improve their own models. This is evidenced by the fact that OpenAI reportedly employs many human experts to verify or create specialized data, such as for code tasks. However, for the open-source community looking to create a free and efficient alternative to ChatGPT, the AI results may be enough for now.