Genomics, Proteomics, Bioinformatics

Genome Modeling And Design Across All Domains Of Life With Evo 2

By Keith Cowing
Status Report
biorxiv.org
February 21, 2025
Filed under , , , , , , , , , , , , , , , , , , , , ,
Genome Modeling And Design Across All Domains Of Life With Evo 2
Overview of model architecture, training procedure, datasets, and evaluations for Evo 2. (A) Evo 2 models DNA sequence and enables applications across the central dogma, spanning molecular and cellular scales. (B) Evo 2 is trained on data encompassing trillions of nucleotide sequences from all domains of life. Each UMAP point indicates a single genome. (C) A two-phase training strategy optimizes model performance while expanding up to 1 million base pairs to capture wide ranging biological patterns. (D) Novel data augmentation and weighting approaches prioritize functional genetic elements during pretraining and long-sequence composition during midtraining. (E) The number of tokens used to train Evo 2 40B and 7B, split into the short phase pretraining and the long context midtraining. (F) Schematic of the new multi-hybrid StripedHyena 2 architecture, showing the efficient block layout of short explicit (SE), medium regularized (MR), and long implicit (LI) hyena operators. (G) Comparison of iteration time at 1024 GPU, 40B scale between StripedHyena 2, StripedHyena 1 and Transformers, showing improved throughput. (H) Validation perplexity of Evo 2 midtraining comparing the model size and context length, showing benefits with scale and increasing context length. (I) A modified needle-in-a-haystack task evaluates Evo 2’s long context recall ability up to 1 million sequence length, showing the model performs effective recall at 1 million token context. — biorxiv.org

All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes.

See With Evo 2, AI Can Model And Design The Genetic Code For All Domains Of Life, Nature 7 March 2026

We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning.

Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods.

Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology.

We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.

Genome modeling and design across all domains of life with Evo 2, biorxiv.org

See With Evo 2, AI Can Model And Design The Genetic Code For All Domains Of Life, Nature 7 March 2026

Astrobiology, genomics, omics, bioinformatics, evolution,

Explorers Club Fellow, ex-NASA Space Station Payload manager/space biologist, Away Teams, Journalist, Lapsed climber, Synaesthete, Na’Vi-Jedi-Freman-Buddhist-mix, ASL, Devon Island and Everest Base Camp veteran, (he/him) 🖖🏻