OpenAI has launched a new initiative called OpenAI Data Partnerships. The goal is to build AI models that deeply understand all subjects, industries, cultures, and languages.
Big AI models learn skills and aspects of the world by interpreting the data they are trained on. To create an AGI that is safe and useful for all of humanity, AI models need a rich training dataset, OpenAI writes.
By incorporating diverse content, AI models could be better able to understand specific domains, which is crucial for their practical applications.
Data diversity is crucial
OpenAI is already working with several partners, including the Icelandic government and the non-profit Free Law Project, who are interested in representing data from their country or sector. The Free Law Project's goal is to improve access to legal knowledge.
OpenAI is particularly interested in large datasets that reflect human society and are not already easily accessible to the public. The data can be text, images, audio, or video. Of particular interest is data that expresses human intent, regardless of language, subject, or format.
There are currently two ways to work with OpenAI:
1. Open-source archive: the goal is to create an open-source language training dataset that is publicly available and can be used to train AI models. OpenAI will investigate how this dataset can be used to safely train other open-source models.
2. Private datasets: For organizations that want to keep their data private but still want AI models to better understand their domain, OpenAI prepares private datasets for training proprietary AI models, including base models and fine-tuned custom models. The company says it handles the data with the level of sensitivity and access controls desired by the partner.