Life on Earth would not exist as we know it, if not for the protein molecules that enable critical processes from photosynthesis and enzymatic degradation to sight and our immune system. And like most facets of the natural world, humanity has only just begun to discover the multitudes of protein types that actually exist. But rather scour the most inhospitable parts of the planet in search of novel microorganisms that might have a new flavor of organic molecule, Meta researchers have developed a first-of-its-kind metagenomic database, the ESM Metagenomic Atlas, that could accelerate existing protein-folding AI performance by 60x.
Metagenomics is just coincidentally named. It is a relatively new, but very real, scientific discipline that studies “the structure and function of entire nucleotide sequences isolated and analyzed from all the organisms (typically microbes) in a bulk sample.” Often used to identify the bacterial communities living on our skin or in the soil, these techniques are similar in function to gas chromatography, wherein you’re trying to identify what’s present in a given sample system.
Similar databases have been launched by the NCBI, the European Bioinformatics Institute, and Joint Genome Institute, and have already cataloged billions of newly uncovered protein shapes. What Meta is bringing to the table is “a new protein-folding approach that harnesses large language models to create the first comprehensive view of the structures of proteins in a metagenomics database at the scale of hundreds of millions of proteins,” according to a Tuesday release from the company. The problem is that, while advances of genomics have revealed the sequences for slews of novel proteins, just knowing what those sequences are doesn’t actually tell us how they fit together into a functioning molecule and going figuring it out experimentally takes anywhere from a few months to a few years. Per molecule. Ain’t nobody got time for that.