ESM Atlas Logo
  • Overview
  • Concepts
  • API Reference
  • Examples
    • Example: Sequence Similarity Search
      • Parameters
      • curl
      • Python
      • Response
    • Example: Retrieve Protein and Cluster Details
    • Example: Browse SAE Features
  • FAQs
  • Changelog
ESM Atlas
  • Examples
  • Example: Sequence Similarity Search

Example: Sequence Similarity Search

Find proteins in the Atlas similar to a query amino acid sequence, ranked by SAE feature embedding similarity.

Endpoint: GET /esm/protein/api/v1alpha1/similarity-search

Parameters

Parameter

Type

Default

Description

sequence

string

required

Amino acid sequence (max 800 residues)

topk_results

int

10

Number of similar proteins to return (max 100)

topk_features

int

20

Number of common SAE features to summarize across results (max 100)

min_similarity

float

—

Drop results below this cosine similarity

cluster_pct_characterized_max

int

—

Restrict to clusters whose Pfam-characterized fraction is at or below this percentage

include_cluster_info

bool

false

Include cluster size and human-readable cluster name on each result

curl

curl "https://biohub.ai/esm/protein/api/v1alpha1/similarity-search?sequence=FVNQHLCGSHLVEALYLVCGERGFFYTPKT&topk_results=5&include_cluster_info=true"

Python

import httpx

response = httpx.get(
    "https://biohub.ai/esm/protein/api/v1alpha1/similarity-search",
    params={
        "sequence": "FVNQHLCGSHLVEALYLVCGERGFFYTPKT",
        "topk_results": 5,
        "include_cluster_info": True,
    },
)
data = response.json()

for protein in data["similar_proteins"]:
    print(protein["protein_accession"], protein["similarity_score"])

Response

{
  "query_sequence": "FVNQHLCGSHLVEALYLVCGERGFFYTPKT",
  "protein_hash": "abc123...",
  "similar_proteins": [
    {
      "protein_hash": "def456...",
      "protein_accession": "uniprotkb:P01308",
      "sequence_length": 110,
      "similarity_score": 0.97,
      "pdb": "REMARK   0 LICENSE\n...",
      "ptm": 0.42,
      "mean_plddt": 0.71,
      "residues_plddt": [0.62, 0.71, 0.84],
      "cluster_size": 12,
      "protein_name": "Insulin"
    }
  ],
  "top_features_across_results": [
    {
      "feature_index": 1234,
      "occurrence_count": 4,
      "min_activation": 0.42,
      "max_activation": 0.91,
      "mean_activation": 0.67
    }
  ],
  "restricted_count": 0
}

protein_hash at the top level is populated only when the query sequence is already in the Atlas. cluster_size and protein_name are populated only when include_cluster_info=true. restricted_count reports how many otherwise-similar proteins were withheld by the biosecurity filter.

Previous Next

© Copyright 2026, Biohub.

Built with Sphinx using a theme provided by Read the Docs.