language model | BrombergLab

ProtTale

Mon, 08 Jun 2026 00:00:00 +0000

Reliability-aware Generative Annotation of Protein Function

Genome sequencing and corresponding gene/protein discovery vastly outpaces functional characteri- zation, leaving much of protein space functionally dark. Generative protein to text models annotate sequences with free text, but offer no reliability signal, and surface metrics cannot tell whether two descriptions refer to the same molecular function. Here we present ProtTale, which couples sequence to text generation with a built-in reliability head, and an LLM-as-judge protocol that scores functional equivalence at the semantic level. On 1,031 unseen Swiss-Prot proteins held out at 40% identity, ProtTale and four baselines reach similar accuracy but cover orthogonal slices, with ProtTale uniquely recovering 60 proteins missed by every other method. The reliability head raises ProtTale’s confident match rate from 26.5% to 44.4% under a discrete filter and to 90% under a continuous score. By providing a per- prediction reliability score, ProtTale enables users to selectively retain only trustworthy annotations, making generative function annotation practically useful even when accuracy saturates.

REBEAN - DNA Language Model for metagenomic exploration

Mon, 08 Jun 2026 00:00:00 +0000

Deciphering enzymatic potential in metagenomic reads through DNA language models

Microbial communities drive essential global processes, yet much of their functional potential remains unexplored. Metagenomics stands to elucidate this microbial “dark matter” by directly sequencing the microbial community DNA from environmental samples. However, the exploration of metagenomic sequences is mostly limited to establishing their similarity to curated reference sequences. A paradigm shift—language model (LM)-based methods—offers promising avenues for reference-free analysis of metagenomic reads. Here, we introduce two LMs, a pretrained foundation model REMME (Read EMbedder for Metagenomic Exploration), aimed at understanding the DNA context of metagenomic reads, and the fine-tuned REBEAN (Read Embedding-Based Enzyme ANnotator) for predicting the enzymatic potential encoded within the read-corresponding genes. By emphasizing function recognition over gene identification, REBEAN labels gene-encoded molecular functions of previously explored and new (orphan) sequences. Even though it was not trained to do so, REBEAN identifies the gene’s function-relevant parts. It thus expands enzymatic annotation of unassembled metagenomic reads.

Life & Earth - Deep Transfer Learning

Wed, 17 Apr 2019 00:00:00 +0000

Looking Glass

The network of interactions linking the biosphere and the geosphere is vastly complex. This complexity obscures the biogeochemically relevant features underlying the noisy and diverse microbial communities inhabiting our planet. It thus presents both a significant challenge and exciting opportunity to apply new computational approaches to modeling microbial interactions with the Earth system. Deep learning techniques are ideal for high complexity systems. Here I propose to train a deep neural network on all publicly available metagenomes to learn the complex features underlying microbial communities, and using transfer learning, to leverage these features to link environmental microbes to their associated geochemistry and mineralogy. These models will also allow me to validate the presence of key protein motifs under variable geochemical regimes, under both modern and ancient Earth conditions.

To this end, I will introduce a novel data augmentation approach to produce the millions of metagenomic inputs necessary for deep learning to be most effective. I will train the first deep learning model on all publicly available metagenomes, deriving those functionally relevant features underlying all microbial communities. Using transfer learning, I will leverage the complex features learned on this metagenome corpus to predict geochemical and mineralogical compositions from a smaller curated set of metagenomes with known mineralogy and geochemistry. This effort will create models that capture and predict the complexity of the bio-geosphere at a depth that is currently intractable.