Example: Retrieve Protein and Cluster Details

Walk from a single protein’s hash to its full metadata, structure, and cluster context.

See also

Starting from an amino acid sequence? Use Sequence Similarity Search first — it accepts a sequence and returns the hashes of similar Atlas proteins. The steps below pick up from one of those hashes.

Note

Atlas API endpoints are keyed by a 32-character MD5 hash of the protein’s amino acid sequence (protein_hash). The supported way to obtain a hash is via Sequence Similarity Search, which accepts a sequence and returns the hashes of similar Atlas proteins. If you already know the exact sequence is in the Atlas, you can also derive the hash locally:

import hashlib
protein_hash = hashlib.md5(b"FVNQHLCGSHLVEALYLVCGERGFFYTPKT").hexdigest()

Once you have a hash, follow the steps below to fetch protein details and walk to the protein’s cluster.

Step 1 — Get protein details

Endpoint: GET /esm/protein/api/v1alpha1/proteins/{protein_hash}

Parameters

Parameter	Type	Default	Description
`topk_features`	int	10	Number of top SAE features to return (max 100)
`normalize_features`	bool	true	Scale activations by per-feature idf/max for ranking
`feature_indices`	int[]	—	Return values for these feature indices instead of the top-K ranking
`fold_on_miss`	bool	true	If the protein has no stored structure, fold it on demand via ESMFold2

curl

curl "https://biohub.ai/esm/protein/api/v1alpha1/proteins/<protein_hash>?topk_features=5"

Python

import httpx

protein_hash = "<protein_hash>"  # 32-char hex MD5

resp = httpx.get(
    f"https://biohub.ai/esm/protein/api/v1alpha1/proteins/{protein_hash}",
    params={"topk_features": 5},
)
protein = resp.json()

print(protein["accession"], protein["sequence_length"])
cluster_rep_hash = protein["cluster_rep_protein_hash"]

Response (abbreviated)

{
  "protein_hash": "def456...",
  "accession": "uniprotkb:P01308",
  "source": "uniprotkb",
  "sequence": "MALWMRLLPLL...",
  "sequence_length": 110,
  "ptm": 0.42,
  "mean_plddt": 0.71,
  "residues_plddt": [0.62, 0.71, 0.84],
  "pdb": "REMARK   0 LICENSE\n...",
  "sae_features": [...],
  "protein_activations": {...},
  "per_residue_activations": {...},
  "cluster_rep_protein_hash": "abc123...",
  "folded_on_demand": false
}

The cluster_rep_protein_hash field is the hash of the protein representing this protein’s cluster. Pass it to the clusters endpoint in Step 2 to get the rest of the cluster.

Step 2 — Get the cluster

Endpoint: GET /esm/protein/api/v1alpha1/clusters/{cluster_rep_protein_hash}

Parameters

Parameter	Type	Default	Description
`topk_features`	int	10	Number of top SAE features for the representative (max 100)

curl

curl "https://biohub.ai/esm/protein/api/v1alpha1/clusters/<cluster_rep_protein_hash>"

Python

resp = httpx.get(
    f"https://biohub.ai/esm/protein/api/v1alpha1/clusters/{cluster_rep_hash}",
)
cluster = resp.json()

print(cluster["cluster_size"], "members")
print(cluster["cluster_taxonomy_info"])  # LCA rank + name
for hash_ in cluster["member_protein_hashes"][:5]:
    print(hash_)

Response (abbreviated)

{
  "protein_hash": "abc123...",
  "protein_name": "Insulin",
  "source": "uniprotkb",
  "accession": "uniprotkb:P01308",
  "cluster_size": 12,
  "cluster_pct_characterized": 92,
  "cluster_mean_domain_coverage": 0.71,
  "member_protein_hashes": ["abc123...", "def456...", ...],
  "cluster_top_pfam_domains": {
    "PF00049": {"count": 11, "name": "Insulin"}
  },
  "cluster_representative_features": [...],
  "cluster_taxonomy_info": {"rank": "family", "name": "Hominidae"},
  "top_phyla": {"Chordata": 11, "Arthropoda": 1},
  "mean_plddt": 0.71,
  "ptm": 0.42
}

Step 3 (optional) — Inspect another cluster member

Loop back to Step 1 with any hash from member_protein_hashes to fetch the full details of another protein in the same cluster.

for member_hash in cluster["member_protein_hashes"][:3]:
    resp = httpx.get(
        f"https://biohub.ai/esm/protein/api/v1alpha1/proteins/{member_hash}",
        params={"topk_features": 5},
    )
    print(resp.json()["accession"])