pip install esm@git+https://github.com/Biohub/esm.git@mainESMC
ESMC is the latest in the ESM family of protein language models, establishing a new frontier in representation learning for protein biology. Trained on billions of evolutionary sequences, it learns representations that reflect a mechanistic reduction of protein structure and function.
Get Started
Quickstart Guide
Install the esm Python package
Create an API key
Connect to the Biohub Platform API
from esm.sdk.forge import ESMCForgeInferenceClient
client = ESMCForgeInferenceClient(model="esmc-6b-2024-12", url="https://biohub.ai", token="<your API token>")Run your inference
Model Tutorials
Explore All TutorialsEmbedding sequences with ESMC
Embed protein sequences and explore how different transformer layers encode structural and functional information.
Zero-shot entropy and mutation analysis
Compute per-position entropy and log-likelihood ratios to identify constrained vs. mutation-tolerant sites.
Layer sweep for enzyme function classification
Learn how to sweep all layers to find which one is best using enzyme classification as a task.
Understanding proteins with SAE features
Extract and visualize sparse autoencoder features, rank by peak activation, and map activations onto 3D structures.
Model Details
Model Card
Version
2026-04
Architecture
Transformer
Supported Modalities
Sequence
Training Data
Up to 6 billion proteins
Intended Use
ESMC is designed for protein science research including structure prediction, function annotation, protein design, and understanding evolutionary relationships between proteins. It can generate novel proteins given partial sequence, structure, or functional constraints.
Limitations & Risks
Outputs should be validated experimentally. The model may generate proteins that are not synthesizable or functional. Not intended for clinical or therapeutic applications without further validation.
Explore the Model
ESM Atlas Data
Dataset | Size | CLI Command |
|---|---|---|
SequencesProtein sequences (6.8B proteins) | 2.2 TB | |
StructuresProtein structures (1B proteins) | 68.9 TB | |
SAE featuresPer protein and per-residue feature vectors (6.8B proteins) | 306 TB | |
SAE ClustersCluster-level organization based on SAE features (7.5M clusters) | 26 GB | |
HMM ResultsPredicted pfam and taxonomy (6.8B proteins) | 653 MB | |
Protein_to_accessionMapping of protein IDs to accession numbers (6.8B proteins) | 162 GB | |
NormalizationSAE feature normalization | 192 KB | |
All DataComplete set of sequences, structures, features, and clusters | 377 TB |