AI research

OpenProteinSet provides open source training data for structural biology at scale

Maximilian Schreiner

Midjourney prompted by THE DECODER

OpenProteinSet provides a massive dataset of the same quality as the one used to train AlphaFold 2, which was not made available to the research community.

Proteins are the workhorses of life. Understanding their sequences and structures is key to tackling challenges ranging from designing new enzymes to developing life-saving drugs. In recent years, Deepmind's AlphaFold 2 AI system in particular has revolutionized the field, predicting protein structures with unprecedented accuracy. But according to a new paper from researchers at Harvard University, Harvard Medical School, Columbia University, New York University, and the Flatiron Institute, progress has been hampered by a lack of open training data.

Now, an open-source database called OpenProteinSet aims to change that by providing protein alignment data on a massive scale.

OpenProteinSet provides 16 million multiple sequence alignments

A protein's function is encoded in its amino acid sequence. Through evolution, small changes in these sequences accumulate, while the overall structure and function remain the same. Multiple sequence alignments (MSAs) are sets of evolutionarily related protein sequences aligned by inserting gaps so that matching amino acids end up in the same columns. Analysis of patterns in these MSAs provides rich insights into a protein's structure and function.

MSAs have long been essential to protein research, but their usefulness exploded in 2021 with AlphaFold2, which predicts protein structures with near-experimental accuracy based on massive amounts of MSA data. While AlphaFold 2 is open source, its training data remained private.

OpenProteinSet now provides 16 million MSAs and associated data, all open source. It includes MSAs for all 140,000 proteins in the Protein Data Bank (PDB), the definitive database of experimentally determined protein structures. It also includes sequences from the UniProt knowledge base, clustered by similarity.

For PDB proteins, OpenProteinSet provides raw MSAs from multiple sequence databases. It also includes structurally similar proteins identified by searching the PDB. Predicted structures from AlphaFold2 are included for 270,000 different UniProt clusters.

Researchers recreate AlphaFold 2 with open-source dataset

The developers also used OpenProteinSet to train OpenFold, an open recreation of AlphaFold 2. According to them, OpenFold performs on par with the original, proving the sufficiency of this open data.

"With OpenProteinSet, we have greatly increased the quantity and quality of precomputed MSAs available to the molecular machine learning communities," the team said. "The dataset has immediate applications to diverse tasks in structural biology."

The OpenProteinSet is hosted and available on AWS.

Sources: