Genomics, Proteomics, Bioinformatics

Planetary Bioinformatics: Breaking Through Biology’s Data Wall On Earth

By Keith Cowing
Status Report
Basecamp Research
June 27, 2025
Filed under , , , , , , , ,
Planetary Bioinformatics: Breaking Through Biology’s Data Wall On Earth
The data wall and performance plateau for foundation models in biology – BaseCamp Research

Editor’s note: If we aspire to mount expeditions to new worlds and then embrace the task of characterizing and quantifying whatever life forms we find, the ability to map and understand whatever metabolic and genomic systems are in operation is important. Not only do we need to know how alien biota function, but also how they evolved – what differences and similarities they may have with the origin and evolution of life on Earth. Increasing in situ capabilities like this can allow much more preliminary analysis to be done on site – or back on Earth.

And once we have collected this information we will need to find a way to understand the totality of another world’s bioinformatics so as to understand how its biosphere and its component species operate, how they evolved, and how all of this can be compared to what happened on Earth. Is there one basic way for life to occur or many? Learning how to do this on our home world first is the best way to practice for the onslaught of alien genomics information we will eventually collect on the worlds we will one day visit.

The folks at BaseCamp Research are making that first big step – often seeking out genomics in places never visited before and then integrating it all together. That’s how you start to truly understand a living world.


Abstract from Breaking Through Biology’s Data Wall: Expanding the Known Tree of Life by Over 10x using a Global Biodiscovery Pipeline, BaseCamp Research (open access)

Advancements in the life sciences have always been built upon our collective understanding of life on Earth. Now, the rise of generative biology – the use of AI foundation models to design, generate, and annotate proteins, pathways and therapeutics – is creating unprecedented demand for large, diverse biological sequence datasets. While a limited subset of such data can be generated in clinical or laboratory settings, the vast majority of the training data for unsupervised models must be sourced from the natural world – the product of nearly four billion years of evolutionary history.

However, the public databases that currently supply this data, while foundational to research, were established to aggregate results from academic experiments, not as training datasets for machine learning. Their human-centric data structure limits model performance due to redundancy, taxonomic and geographic bias, limited biological context, and inconsistent provenance. With 68% of all sequence data in the SRA database coming from just 5 species, this is one of the most severe class imbalance problems ever encountered in AI. Legal and infrastructural constraints further exacerbate this bottleneck.

To address these limitations and support scalable model training, we introduce BaseData TM: the largest and fastest-growing biological sequence database ever built, and the first purpose-built for training foundation models. As of late 2024, BaseData TM contained 9.8 billion novel genes, representing more than a 10-fold expansion in known protein diversity after accounting for redundancy.

BaseData TM also contains more than 1 million species not represented in other genomic databases. Its partnership-driven data supply chain across 26 countries and autonomous regions enables growth of up to 2 billion novel genes per month, far exceeding public repositories. All data is collected under benefit sharing agreements using standardized protocols and structured using graph-based, ontology-rich metadata that preserves evolutionary context. BaseData™ represents a new, ethically-grounded infrastructure for training biological foundation models, complementing public efforts and enabling the next era of generative biology

Full paper Breaking Through Biology’s Data Wall: Expanding the Known Tree of Life by Over 10x using a Global Biodiscovery Pipeline, BaseCamp Research (open access)

Astrobiology, genomics, bioinformatics,

Explorers Club Fellow, ex-NASA Space Station Payload manager/space biologist, Away Teams, Journalist, Lapsed climber, Synaesthete, Na’Vi-Jedi-Freman-Buddhist-mix, ASL, Devon Island and Everest Base Camp veteran, (he/him) 🖖🏻