TikTok's parent company ByteDance has been suspended from ChatGPT by OpenAI after it was revealed that the company secretly used OpenAI technology to develop a competing AI model called Project Seed.
According to internal ByteDance documents leaked to The Verge editor Alex Heath, ByteDance used OpenAI's API at nearly every stage of Project Seed's development, including training and evaluating the model.
Employees were aware of the implications, and discussed on Lark, ByteDance's internal communication platform, how they could obfuscate the evidence through "data desensitization."
Using training data to train competing AI models with OpenAI's AI technology is a direct violation of OpenAI's terms of service. Bytedance had access to GPT-4 through Microsoft's Azure service, which is subject to the same rules.
Such data sourcing could help competitors get high-quality data, and thus better AI models, much faster. But it also risks spreading errors and biases in the generating model to other AI models, affecting the quality of the overall generation and training data.
OpenAI investigates possible terms of service violation by Bytedance
OpenAI spokesperson Niko Felix confirmed to Heath that Bytedance's account has been suspended and that the allegations are being investigated. Bytedance has made minimal use of the API to date, Felix said. If Bytedance's use of the API is found to be outside the rules, it will have to make changes or its account will be deleted.
Bytedance spokeswoman Jodi Seth told Heath that GPT-generated data was used to annotate the model early in Project Seed's development and that this data was removed from Bytedance's training data in the middle of the year. Bytedance is a licensed Microsoft partner and uses GPT models for products outside of China, she said.
In Project Seed, ByteDance is developing language models for the Doubao chatbot and a business chatbot that is to be commercialized as a cloud product.
The main goal of Project Seed is to become China's ChatGPT as soon as possible. The team has been tasked with achieving GPT 3.5 performance by the end of this year and GPT 4 performance by mid-2024, Heath reported.
The current seed model reportedly has 200 billion parameters. GPT-3 had 175 billion parameters, while the combined GPT-4 model is estimated to have approximately 1.8 trillion parameters. However, the number of parameters as the sole indicator of a model's performance has become less important since the release of GPT-3.