Non-profit AI organization LAION has released DISCO-12M, which it calls the largest publicly available music dataset.
The collection aims to support open audio AI model development, with 12 million links to YouTube music tracks and their associated metadata. The organization provides only links to publicly available YouTube videos and their metadata, not the actual music files, and explicitly disclaims responsibility for the linked content.
LAION explains that DISCO-12M is a step up from DISCO-10M. By pulling data directly from YouTube Music instead of Spotify, they've eliminated the errors that occurred when they had to manually match Spotify metadata to YouTube videos. They've also expanded the dataset's artist selection to 250,516 by analyzing country charts and genre playlists.
LAION suggests the DISCO-12M dataset can help researchers advance several areas, such as building better audio AI models, identifying key musical features, creating content-based music searches, and improving music recommendation systems.
Restricted to academic research
The dataset, released under the Apache 2.0 license, is strictly for academic research. LAION specifically discourages industrial applications or commercial product development. This aligns with a recent Hamburg Regional Court ruling that deemed such data collection legal when used for non-commercial scientific research.
Founded in Germany in 2021, LAION promotes open AI development and is known for LAION-5B, a dataset used to train well-known AI models like Stable Diffusion. However, the organization has faced criticism over some datasets containing links to copyrighted material or private content not intended for AI training. In one case, LAION had to remove links to child sexual abuse material from its LAION-5B dataset.