DNA Language Model for metagenomic exploration

Mon, 08 Jun 2026 00:00:00 +0000

Creating functional maps of protein sequences

Microbial communities drive essential global processes, yet much of their functional potential remains unexplored. Metagenomics stands to elucidate this microbial “dark matter” by directly sequencing the microbial community DNA from environmental samples. However, the exploration of metagenomic sequences is mostly limited to establishing their similarity to curated reference sequences. A paradigm shift—language model (LM)-based methods—offers promising avenues for reference-free analysis of metagenomic reads. Here, we introduce two LMs, a pretrained foundation model REMME (Read EMbedder for Metagenomic Exploration), aimed at understanding the DNA context of metagenomic reads, and the fine-tuned REBEAN (Read Embedding-Based Enzyme ANnotator) for predicting the enzymatic potential encoded within the read-corresponding genes. By emphasizing function recognition over gene identification, REBEAN labels gene-encoded molecular functions of previously explored and new (orphan) sequences. Even though it was not trained to do so, REBEAN identifies the gene’s function-relevant parts. It thus expands enzymatic annotation of unassembled metagenomic reads.

Measuring representational uncertainty

Mon, 08 Jun 2026 00:00:00 +0000

Creating functional maps of protein sequences

Biomolecular embeddings serve as efficient representations of sequence and structure, enabling tasks such as similarity searches, structure and function prediction and estimation of biophysical properties. However, relying on embeddings without assessing their ability to accurately represent biomolecules is a critical flaw—akin to using a scalpel in surgery without verifying its sharpness. Here we propose a means to evaluate the capacity of protein language models to encode biologically meaningful information. For each protein, representation uncertainty is scored as the fraction of non-biological ‘synthetic’ sequences among its nearest neighbors in latent space. Our analysis reveals that low-quality embeddings often fail to capture meaningful biology, displaying vector properties indistinguishable from those of randomly generated sequences. Our model-agnostic scoring framework is, to our knowledge, the first to quantify protein sequence embedding reliability. It enables embedding screening prior to downstream applications and inferences, significantly improving their reliability. We propose that embedding evaluation should be undertaken for other uses of language models in science as well.

prabakaran-ram | BrombergLab

DNA Language Model for metagenomic exploration

Creating functional maps of protein sequences

Measuring representational uncertainty

Creating functional maps of protein sequences