Example: Retrieve Protein and Cluster Details
Walk from a single protein’s hash to its full metadata, structure, and cluster context.
See also
Starting from an amino acid sequence? Use Sequence Similarity Search first — it accepts a sequence and returns the hashes of similar Atlas proteins. The steps below pick up from one of those hashes.
Note
Atlas API endpoints are keyed by a 32-character MD5 hash of the protein’s
amino acid sequence (protein_hash). The supported way to obtain a hash is
via Sequence Similarity Search, which accepts a sequence and
returns the hashes of similar Atlas proteins. If you already know the exact
sequence is in the Atlas, you can also derive the hash locally:
import hashlib
protein_hash = hashlib.md5(b"FVNQHLCGSHLVEALYLVCGERGFFYTPKT").hexdigest()
Once you have a hash, follow the steps below to fetch protein details and walk to the protein’s cluster.
Step 1 — Get protein details
Endpoint: GET /esm/protein/api/v1alpha1/proteins/{protein_hash}
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
10 |
Number of top SAE features to return (max 100) |
|
bool |
true |
Scale activations by per-feature idf/max for ranking |
|
int[] |
— |
Return values for these feature indices instead of the top-K ranking |
|
bool |
true |
If the protein has no stored structure, fold it on demand via ESMFold2 |
curl
curl "https://biohub.ai/esm/protein/api/v1alpha1/proteins/<protein_hash>?topk_features=5"
Python
import httpx
protein_hash = "<protein_hash>" # 32-char hex MD5
resp = httpx.get(
f"https://biohub.ai/esm/protein/api/v1alpha1/proteins/{protein_hash}",
params={"topk_features": 5},
)
protein = resp.json()
print(protein["accession"], protein["sequence_length"])
cluster_rep_hash = protein["cluster_rep_protein_hash"]
Response (abbreviated)
{
"protein_hash": "def456...",
"accession": "uniprotkb:P01308",
"source": "uniprotkb",
"sequence": "MALWMRLLPLL...",
"sequence_length": 110,
"ptm": 0.42,
"mean_plddt": 0.71,
"residues_plddt": [0.62, 0.71, 0.84],
"pdb": "REMARK 0 LICENSE\n...",
"sae_features": [...],
"protein_activations": {...},
"per_residue_activations": {...},
"cluster_rep_protein_hash": "abc123...",
"folded_on_demand": false
}
The cluster_rep_protein_hash field is the hash of the protein representing
this protein’s cluster. Pass it to the clusters endpoint in Step 2 to get
the rest of the cluster.
Step 2 — Get the cluster
Endpoint: GET /esm/protein/api/v1alpha1/clusters/{cluster_rep_protein_hash}
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
10 |
Number of top SAE features for the representative (max 100) |
curl
curl "https://biohub.ai/esm/protein/api/v1alpha1/clusters/<cluster_rep_protein_hash>"
Python
resp = httpx.get(
f"https://biohub.ai/esm/protein/api/v1alpha1/clusters/{cluster_rep_hash}",
)
cluster = resp.json()
print(cluster["cluster_size"], "members")
print(cluster["cluster_taxonomy_info"]) # LCA rank + name
for hash_ in cluster["member_protein_hashes"][:5]:
print(hash_)
Response (abbreviated)
{
"protein_hash": "abc123...",
"protein_name": "Insulin",
"source": "uniprotkb",
"accession": "uniprotkb:P01308",
"cluster_size": 12,
"cluster_pct_characterized": 92,
"cluster_mean_domain_coverage": 0.71,
"member_protein_hashes": ["abc123...", "def456...", ...],
"cluster_top_pfam_domains": {
"PF00049": {"count": 11, "name": "Insulin"}
},
"cluster_representative_features": [...],
"cluster_taxonomy_info": {"rank": "family", "name": "Hominidae"},
"top_phyla": {"Chordata": 11, "Arthropoda": 1},
"mean_plddt": 0.71,
"ptm": 0.42
}
Step 3 (optional) — Inspect another cluster member
Loop back to Step 1 with any hash from member_protein_hashes to fetch the
full details of another protein in the same cluster.
for member_hash in cluster["member_protein_hashes"][:3]:
resp = httpx.get(
f"https://biohub.ai/esm/protein/api/v1alpha1/proteins/{member_hash}",
params={"topk_features": 5},
)
print(resp.json()["accession"])