Example: Sequence Similarity Search

Find proteins in the Atlas similar to a query amino acid sequence, ranked by SAE feature embedding similarity.

Endpoint: GET /esm/protein/api/v1alpha1/similarity-search

Parameters

Parameter	Type	Default	Description
`sequence`	string	required	Amino acid sequence (max 800 residues)
`topk_results`	int	10	Number of similar proteins to return (max 100)
`topk_features`	int	20	Number of common SAE features to summarize across results (max 100)
`min_similarity`	float	—	Drop results below this cosine similarity
`cluster_pct_characterized_max`	int	—	Restrict to clusters whose Pfam-characterized fraction is at or below this percentage
`include_cluster_info`	bool	false	Include cluster size and human-readable cluster name on each result

curl

curl "https://biohub.ai/esm/protein/api/v1alpha1/similarity-search?sequence=FVNQHLCGSHLVEALYLVCGERGFFYTPKT&topk_results=5&include_cluster_info=true"

Python

import httpx

response = httpx.get(
    "https://biohub.ai/esm/protein/api/v1alpha1/similarity-search",
    params={
        "sequence": "FVNQHLCGSHLVEALYLVCGERGFFYTPKT",
        "topk_results": 5,
        "include_cluster_info": True,
    },
)
data = response.json()

for protein in data["similar_proteins"]:
    print(protein["protein_accession"], protein["similarity_score"])

Response

{
  "query_sequence": "FVNQHLCGSHLVEALYLVCGERGFFYTPKT",
  "protein_hash": "abc123...",
  "similar_proteins": [
    {
      "protein_hash": "def456...",
      "protein_accession": "uniprotkb:P01308",
      "sequence_length": 110,
      "similarity_score": 0.97,
      "pdb": "REMARK   0 LICENSE\n...",
      "ptm": 0.42,
      "mean_plddt": 0.71,
      "residues_plddt": [0.62, 0.71, 0.84],
      "cluster_size": 12,
      "protein_name": "Insulin"
    }
  ],
  "top_features_across_results": [
    {
      "feature_index": 1234,
      "occurrence_count": 4,
      "min_activation": 0.42,
      "max_activation": 0.91,
      "mean_activation": 0.67
    }
  ],
  "restricted_count": 0
}

protein_hash at the top level is populated only when the query sequence is already in the Atlas. cluster_size and protein_name are populated only when include_cluster_info=true. restricted_count reports how many otherwise-similar proteins were withheld by the biosecurity filter.