Concepts
UMAP visualization
This is a way of taking an enormous, complex dataset and compressing it into two dimensions so you can see patterns visually. Each dot represents a cluster of proteins, positioned using a 2D projection of their biological feature profiles. Clusters that appear close together share similar biological signals. Think of it less like a precise coordinate system and more like a neighborhood map: proximity is meaningful, but exact distance isn’t. Two clusters sitting next to each other share biological signals. Two clusters on opposite ends of the map are functionally very different. Note that the UMAP is showing only cluster centers for clusters with at least 50 protein members . What the colors mean: Clusters are colored by the proportion of member proteins that match a known, annotated protein family. Brightly colored clusters on the yellow end of the spectrum tend to contain unknown proteins; while darker clusters contain more known protein families.
Searching Atlas
You can explore the Atlas by using the agent. Describe what you’re looking for (e.g., “mitochondrial membrane transporter involved in calcium signaling”) and the Atlas agent will interpret your query, search UniProt, and return candidate proteins for you to explore. When you search for a protein, you are searching against the clustered dataset (those with at least 50 members). This is designed to capture most large functional groups and families, but note there may be rare functions or edge cases that are not shown in the UI.
Once a protein is retrieved, you’ll see:
Its predicted 3D structure, colored by per-residue confidence (pLDDT)
Its top activated SAE features with interpretable labels (e.g., “Folate cofactor-binding pocket”)
Its cluster membership and a link to explore similar proteins
The option to view its neighborhood of similar SAE features
Understanding SAE Features
What are SAE features
When ESMC processes a protein sequence, it encodes everything it learns about that protein into a dense numerical representation that represents structural, functional, and evolutionary information all at once. The problem is that this representation is difficult to interpret since each number reflects a mix of many biological concepts tangled together.
Sparse Autoencoders (SAEs) are simple neural networks trained to untangle that signal. Think of it like separating the instruments in a piece of music. Instead of hearing everything blended together, you can isolate the violin, the cello, the piano. SAEs decompose ESMC’s representation into thousands of individual, interpretable features with each one corresponding to a specific biological concept.
The Atlas uses an SAE with ~16,000 features. Each feature has been mapped to a biological concept by examining which proteins activate it and what those proteins have in common. SAE features capture a wide range of biological concepts, from broad patterns like aromatic residues to specific functional motifs like folate cofactor-binding pockets, and their interpretability varies accordingly.
Examples include:
Folate cofactor-binding pocket
Hydrophobic transmembrane helices
Acidic juxtamembrane segments
Extended polar low-complexity IDRs
What does “activation” mean?
For any given protein, only a small number of features will activate (typically fewer than 1% of the full set). A feature activates when ESMC detects that biological signal in your protein. Higher activation scores indicate stronger, more confident signals.
When you view a protein in the Atlas, the top activated features are shown ranked by activation score. You can click on any feature to see its description and view which residues on the structure are driving the activation. Features are ranked by a normalized activation score where each feature’s activation is scaled by its maximum observed activation across 208 million UniRef90 proteins, then weighted by how rarely and selectively the feature appears. Features that are both strongly activated and rare across the broader protein space are surfaced as the most informative features for interpreting a given protein.
Are SAE features the same as database annotations?
No, and this distinction is important. Standard database annotations (like UniProt function entries or Pfam domain labels) reflect what scientists have experimentally observed and curated over decades. SAE features, by contrast, emerge from the model learning patterns across billions of sequences. They often align closely with known biology, but they can also detect functional signals that haven’t been formally annotated.
What Do We Mean by “Similarity”?
The difference between homology and feature similarity
Traditional protein similarity relies on sequence homology where two proteins are considered related if their amino acid sequences are similar enough. This works well for well-studied protein families, but it breaks down for proteins that have arrived at the same function through different evolutionary paths (convergent evolution), for the vast regions of protein space that have simply never been studied, and for remote homologs, sequences that share a common ancestor, but over time have diverged beyond the detection limit for sequence homology.
The Atlas uses a different approach: SAE feature similarity. Two proteins are considered similar if they share a similar set of activated biological features according to the ESMC world model, regardless of whether their sequences or even their overall structures look alike.
How is similarity calculated?
Similarity is measured using cosine similarity. Two proteins score as similar if they activate the same biological features in similar proportions.
What is a cluster?
The Atlas groups proteins into clusters based on SAE feature similarity. Each cluster is a group of proteins that share a highly similar set of activated biological features — meaning they likely share functional characteristics, even if they don’t share obvious sequence identity.
How are clusters calculated?
Clusters are built using a specialized linear-time hash-based algorithm inspired by Linclust, adapted to operate directly in SAE feature space rather than sequence space. The algorithm uses MinHash signatures and Locality-Sensitive Hashing to efficiently identify candidate protein pairs, which are then verified using exact Jaccard similarity. Proteins are grouped into clusters using a greedy algorithm where each cluster member is guaranteed to share at least 60% feature overlap (Jaccard similarity ≥ 0.6) with its cluster representative, meaning the number of SAE features active in both the member and representative is at least 60% of the total unique features active. Each cluster is automatically labeled by a language model, which reads the functional annotations of cluster members and generates a concise 2–5 word description of the shared theme (e.g., “Zinc finger proteins,” “Mitochondrial transporters”).
What can I learn from a cluster?
The Cluster Report for each cluster includes:
Number of proteins: how many proteins belong to this cluster globally
Top PFAM domains: known domain annotations found in cluster members, giving a sense of what’s characterized
Taxonomy distribution: which organisms and lineages are represented, useful for understanding evolutionary conservation
Top SAE features: the biological signals most strongly shared across cluster members
What does “partially characterized” mean?
Characterized: the protein matches a known Pfam domain with a defined function
Partially characterized: the protein doesn’t match a characterized domain itself, but shares a cluster with proteins that do, suggesting it may share functional characteristics via convergent evolution
Uncharacterized: the entire cluster contains only proteins with no known functional annotation.
Interpreting Results
pLDDT (per-residue confidence score)
pLDDT stands for predicted Local Distance Difference Test. It’s a per-residue confidence score that tells you how confident ESMFold2 is in the predicted position of each amino acid in the 3D structure. Scores range from 0 to 100:
Score |
Interpretation |
|---|---|
90–100 |
Very high confidence; backbone position is highly reliable |
70–90 |
Confident; good for most structural analyses |
50–70 |
Low confidence; treat with caution; may reflect genuine disorder |
< 50 |
Very low confidence; region is likely intrinsically disordered |
In the structure viewer, residues are colored by pLDDT. High-confidence regions are shown in blue; low-confidence regions shade toward yellow and orange. Low pLDDT doesn’t always mean the prediction is wrong and many disordered regions are functionally important and are simply not expected to fold into a fixed structure.
pTM (predicted TM-score) pTM is a single global confidence score for the entire predicted structure, ranging from 0 to 1. A pTM above ~0.5 is generally considered a confident prediction. It reflects the model’s estimate of how well the predicted structure would superimpose with the true structure if it were known.
Mean pLDDT The average pLDDT across all residues in the protein, giving a quick overall sense of structural confidence. Proteins with very low mean pLDDT are likely intrinsically disordered across most of their length.
SAE Feature Activation Score Activation scores reflect how strongly a given feature was detected in the protein. Higher scores mean the biological signal associated with that feature is more prominent. The specific numeric range isn’t directly comparable across different features, focus on which features are activated and their relative ranking for your protein rather than comparing raw scores across proteins.
Atlas Data
The full Atlas data is openly available for download.
Explorable dataset (~1.1 billion proteins). All source databases were concatenated and deduplicated, then clustered at 70% sequence identity to reduce redundancy. The clustered dataset is used as input for ESMC and resulting embeddings are used to compute SAE features. ESMFold2 is used to predict 3D structures.
Full dataset (~6.5 billion proteins). All source databases are concatenated and deduplicated without clustering. ESMC is used to compute SAE features across the full set.
Data sources
Database |
Dataset |
Source of sequences |
|---|---|---|
Public nucleic acid databases |
||
Nonredundant protein sequences from microbial isolate genomes |
||
Viral genomes, cultivated and uncultivated |
||
MAG-binned prokaryotic sequences |
||
MAG-binned eukaryotic sequences |
||
Metagenomic sequences from various sources |
||
Gut microbiomes |
||
Metagenomic sequences from various sources |
What data is available for download?
The Atlas dataset is available for download from AWS S3 at no cost. There are several download options depending on what you need:
Download |
Description |
S3 Link |
Size |
|---|---|---|---|
Clusters and metadata |
The clustered, explorer-scale subset of cluster representative proteins |
7.5 M |
|
Structures only |
Predicted 3D structures for the 1.1 billion proteins in the clustered dataset, generated by ESMFold2 |
1 B |
|
SAE features per residue |
Full residue-level SAE activations |
6 B |
|
SAE features per protein |
One SAE feature vector per protein, aggregated across residues |
6 B |
|
Protein sequences |
6 B |
||
HMM results |
6 B |
Guardrails
The Atlas has multiple guardrails that detect and restrict the use of queries related to controlled pathogens and toxins. If you query the Atlas agent using keywords (such as name, accession, function) or sequences corresponding to these, you will encounter our guardrails and the agent will refuse to continue. If this occurs, you should refresh the Atlas and begin a new conversation, otherwise the prior refusal may impact how the agent answers future queries.
We recognize that there are many legitimate reasons to use AI models to understand and model these sequences and proteins. If you are a researcher whose work is impacted by our guardrails, you can request elevated access to our platform here. Elevated access has no additional costs.
If you have feedback on how our guardrails function, please share it with us by filling out our Feedback Form.
Next Steps
Learn how to analyze your own data using ESMC and ESMC Fold. A great place to start is our tutorials page.