OpenProteinSet provides open source training data for structural biology at scale

Aug 14, 2023

Midjourney prompted by THE DECODER

OpenProteinSet provides a massive dataset of the same quality as the one used to train AlphaFold 2, which was not made available to the research community.

Proteins are the workhorses of life. Understanding their sequences and structures is key to tackling challenges ranging from designing new enzymes to developing life-saving drugs. In recent years, Deepmind's AlphaFold 2 AI system in particular has revolutionized the field, predicting protein structures with unprecedented accuracy. But according to a new paper from researchers at Harvard University, Harvard Medical School, Columbia University, New York University, and the Flatiron Institute, progress has been hampered by a lack of open training data.

Now, an open-source database called OpenProteinSet aims to change that by providing protein alignment data on a massive scale.

OpenProteinSet provides 16 million multiple sequence alignments

A protein's function is encoded in its amino acid sequence. Through evolution, small changes in these sequences accumulate, while the overall structure and function remain the same. Multiple sequence alignments (MSAs) are sets of evolutionarily related protein sequences aligned by inserting gaps so that matching amino acids end up in the same columns. Analysis of patterns in these MSAs provides rich insights into a protein's structure and function.

MSAs have long been essential to protein research, but their usefulness exploded in 2021 with AlphaFold2, which predicts protein structures with near-experimental accuracy based on massive amounts of MSA data. While AlphaFold 2 is open source, its training data remained private.

OpenProteinSet now provides 16 million MSAs and associated data, all open source. It includes MSAs for all 140,000 proteins in the Protein Data Bank (PDB), the definitive database of experimentally determined protein structures. It also includes sequences from the UniProt knowledge base, clustered by similarity.

For PDB proteins, OpenProteinSet provides raw MSAs from multiple sequence databases. It also includes structurally similar proteins identified by searching the PDB. Predicted structures from AlphaFold2 are included for 270,000 different UniProt clusters.

Researchers recreate AlphaFold 2 with open-source dataset

The developers also used OpenProteinSet to train OpenFold, an open recreation of AlphaFold 2. According to them, OpenFold performs on par with the original, proving the sufficiency of this open data.

"With OpenProteinSet, we have greatly increased the quantity and quality of precomputed MSAs available to the molecular machine learning communities," the team said. "The dataset has immediate applications to diverse tasks in structural biology."

The OpenProteinSet is hosted and available on AWS.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

OpenProteinSet provides open source training data for structural biology at scale

OpenProteinSet provides 16 million multiple sequence alignments

Researchers recreate AlphaFold 2 with open-source dataset

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.