Content
summary Summary

Non-profit AI organization LAION has released DISCO-12M, which it calls the largest publicly available music dataset.

Ad

The collection aims to support open audio AI model development, with 12 million links to YouTube music tracks and their associated metadata. The organization provides only links to publicly available YouTube videos and their metadata, not the actual music files, and explicitly disclaims responsibility for the linked content.

LAION explains that DISCO-12M is a step up from DISCO-10M. By pulling data directly from YouTube Music instead of Spotify, they've eliminated the errors that occurred when they had to manually match Spotify metadata to YouTube videos. They've also expanded the dataset's artist selection to 250,516 by analyzing country charts and genre playlists.

LAION suggests the DISCO-12M dataset can help researchers advance several areas, such as building better audio AI models, identifying key musical features, creating content-based music searches, and improving music recommendation systems.

Ad
Ad

Restricted to academic research

The dataset, released under the Apache 2.0 license, is strictly for academic research. LAION specifically discourages industrial applications or commercial product development. This aligns with a recent Hamburg Regional Court ruling that deemed such data collection legal when used for non-commercial scientific research.

Founded in Germany in 2021, LAION promotes open AI development and is known for LAION-5B, a dataset used to train well-known AI models like Stable Diffusion. However, the organization has faced criticism over some datasets containing links to copyrighted material or private content not intended for AI training. In one case, LAION had to remove links to child sexual abuse material from its LAION-5B dataset.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • LAION, an AI organization, has released DISCO-12M, the largest publicly available music dataset for training audio AI models, containing 12 million links to YouTube music tracks with associated metadata.
  • DISCO-12M improves on its predecessor, DISCO-10M, by collecting data directly from YouTube Music, bypassing Spotify, and increasing the number of recorded artists to over 250,000
  • While LAION is committed to open AI development, the new dataset is intended for academic research only, and commercial use is not recommended due to potential copyright and privacy concerns associated with some of LAION's datasets.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.