Content
summary Summary

A team from TogtherAI and the Arc Institute presents Evo, an AI model for biological research that can interpret DNA, RNA, and proteins and enable generative design at the molecular and genomic level.

Developed by a team of experts consisting of Eric Nguyen, Michael Poli, Matthew Durrant, Patrick Hsu and Brian Hie, the model represents a milestone in the processing and analysis of biological data. Using a modified version of the StripedHyena architecture, Evo is unique in its ability to interpret the fundamental biological "languages" - DNA, RNA, and proteins - to make predictions and enable generative design from the molecular to the genomic level.

The new architecture enables Evo to model long contexts and process more than 650,000 tokens. This is particularly important for biological AI models because DNA sequences can be extremely long (up to billions of nucleotides) and high sensitivity is required to understand the effects of evolution based on single nucleotide changes. Evo works at the nucleotide level, recognizing and interpreting the smallest building blocks of DNA and RNA. Evo can process sequences up to 131 kilobases (131,000 bases) in length.

"Evo tries to show a path forward toward unified and foundation modeling on biology," says Michael Poli, co-author of Evo and StripedHyena. Like language models, Evo uses a next-token prediction objective, which is the prediction of the next token during training - in this case at the nucleotide level. "The problem up until now, why this hasn't been done, is that sequences are extremely long if you want to capture meaningful properties about DNA and also learning at high resolution is quite challenging for transformers," says Poli. He is alluding to tokenizers, which convert text into tokens in language models, for example, and are often responsible for issues in LLM performance because they usually do not work at the character level, but rather convert parts of words or multiple numbers into a token.

Ad
Ad

The team was also able to reproduce this in their experiments when training Transfomer models and other architectures such as Mamba. "Well, the amazing thing is that these deep signal processing architectures seem to scale better," Poli says. "It's not just that they can process these longer sequences and then do about as well as transformers. It's as if they scale better per flop. They're just better architectures, I believe, than transformers."

Evo is a foundation model for biology

Evo was trained on a large database of 2.7 million prokaryotic genomes, a fraction of the publicly available genomic data. The model was trained in two stages. In the first phase, it was trained with a context length of 8,000 base pairs; in the second phase, the context length was increased to 131,000 base pairs. This allows the model to recognize patterns and make predictions about a much longer DNA sequence than previous methods. The corresponding training dataset, OpenGenome, will be made publicly available shortly.

Early experiments with Evo show the potential for several applications, including predicting an organism's vital genes based on small DNA mutations. This capability could replace traditional laboratory experiments, which the team says can often take months.

Image: Nguyen, Poli, Durrant et al.

In tests, it was able to compete with leading protein-specific language models to predict the effects of mutations on the function of E. coli proteins. Evo can also predict the functional properties of non-coding RNAs (ncRNAs) and infer gene expression from regulatory DNA.

In addition, Evo can generate complex molecular systems such as CRISPR-Cas complexes and transposable elements. Evo can also generate DNA sequences longer than 650 kilobases, an order of magnitude larger than previous methods. In addition, while previous generative models typically focus on a single modality, Evo is capable of designing large functional complexes of proteins and ncRNAs.

Recommendation

Evo is capable of developing generative designs from molecular to genomic scale. | Video: Together AI

Evo raises ethical questions that need to be answered

The Evo team sees their model as a potential milestone in the modeling of biological sequences, with potential applications in fields as diverse as chemistry, materials science, drug discovery, agriculture, and sustainability. However, the practical application of the generated sequences will require further validation, according to the team.

Evo is the first system of its kind that can predict and generate DNA sequences at the level of the entire genome, with single-nucleotide resolution. "Future capabilities that emerge from large-scale DNA models like Evo also require additional work to ensure that these capabilities are deployed safely and for the benefit of humanity," the blog post states.

There are concerns about potential misuse, social and health injustice, and environmental degradation. The team suggests developing comprehensive guidelines for ethical practices, promoting transparency, and supporting international collaborations and partnerships that could contribute to the responsible use and development of tools such as Evo.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Investment in education and capacity building, as well as collaboration with organizations such as the Global Alliance for Genomics and Health (GA4GH), could also contribute to a future in which advances in genetic engineering are consistent with ethical principles and societal values.

The team provides code and model via GitHub.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from Togther.AI and the Arc Institute released Evo, an AI model for biological research that can interpret DNA, RNA, and proteins and enable generative design at the molecular and genomic scale.
  • Evo can accurately analyze long genetic sequences and has been trained on an extensive database of 2.7 million complete prokaryotic genomes.
  • Potential applications of Evo include the prediction of essential genes, protein functions, and regulatory DNA sequences, and the design of new CRISPR systems for gene editing.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.