A research team has developed Evo 2, which they describe as the largest AI model yet created for biological applications. The system can generate complete chromosomes and understand complex genetic variations across different forms of life.
Evo 2 builds on an extensive genome atlas containing 9.3 trillion DNA base pairs from bacteria, archaea and eukaryotes - representing more than 100,000 species. Researchers from the Arc Institute, Stanford University, UC Berkeley, UC San Francisco and Nvidia say this broad training allows the model to predict and design biological sequences from molecular to genomic scales across all life forms.
The team developed two versions of Evo 2, featuring 7 and 40 billion parameters respectively. Both can process sequence contexts up to 1 million base pairs long. According to the researchers, the model learns to precisely predict how genetic variants affect function just by analyzing DNA sequences, without requiring additional task-specific training.
Model demonstrates ability to generate complex genetic structures
Testing shows that Evo 2 independently grasps various biological characteristics and can generate complete mitochondrial genomes, prokaryotic genomes, and eukaryotic chromosomes matching the length and complexity of natural ones. When analyzing mutations in the breast cancer gene BRCA1, the system nearly matched the accuracy of the best existing AI models in identifying disease-causing changes.
The researchers discovered that using inference time search - where Evo 2 generates multiple possible sequences and filters them through an evaluation function - allows precise control over complex epigenomic structures like chromatin accessibility. This marks the first demonstration of scaling results for inference time computing in biology.
The ability to control chromatin accessibility - how tightly DNA is packed in the cell nucleus - is particularly significant. This packaging determines whether genes can be accessed and activated by cellular proteins or remain silent. Through its combined use of generative modeling and inference time search, Evo 2 can design DNA sequences with specific epigenetic regulatory patterns, precisely defining which regions should be accessible or inactive.
Open source release aims to accelerate biological research
To help advance biological research and design, the team has made Evo 2 completely open source, including model parameters, training and inference code, and the OpenGenome2 dataset. This makes it one of the largest fully open models in the field. Like its predecessor Evo 1, it uses a hybrid architecture from the StripedHyena series.
Evo 2 represents a major leap forward from Evo 1. The new model trained on 30 times more data and covers a much wider range of life forms by including eukaryotes. Its sequence context expanded from 8,000 to 1 million base pairs, enabled partly by the new "StripedHyena 2" architecture. While Evo 1 could only work with prokaryotes, Evo 2 makes genome-wide predictions across all life domains with improved accuracy.
Still a lot of work to do
Stanford computational biologist Anshul Kundaje praised the model's technical architecture but questioned whether it truly understands the remote non-coding sequences that regulate gene activity.
Brian Hie from Stanford and Arc Institute acknowledges that while Evo 2's generated genomes improve on its predecessor's work, they likely wouldn't function in living cells yet. The team deliberately excluded human and complex organism pathogens from the training data for ethical and safety reasons, and ensured the model won't provide useful responses about these pathogens.