Skip to content

theislab/embpy

Repository files navigation

embpy

Tests Documentation

embpy is a Python toolkit for generating biological embeddings with one unified API.

Use it to embed genes, proteins, small molecules, morphology perturbations, and single cells; annotate the resulting objects; and compare embeddings with scverse-friendly plotting and analysis utilities.

embpy architecture

What embpy Does

  • Embeds biological entities through BioEmbedder.embed(...).
  • Resolves biological identifiers into model-ready inputs, such as gene sequences, protein sequences, SMILES strings, and morphology images.
  • Returns AnnData, tables, or payloads with provenance and canonical IDs.
  • Stores generated embeddings outside .X, using .obsm, .varm, or .uns according to the entity type.
  • Adds real metadata annotations for genes, proteins, molecules, and cell lines.
  • Provides plotting and comparison helpers for embedding quality checks.

Install

Pixi is recommended for development and GPU work:

pixi install -e default
pixi run -e default verify

For a pip install:

pip install embpy

For optional GPU/model extras, see the technical guide.

Quick Start

from embpy import BioEmbedder

embedder = BioEmbedder(device="auto", organism="human")

Embed genes with multiple model families:

gene_adata = embedder.embed(
    ["TP53", "EGFR", "MYC"],
    entity_type="gene",
    id_type="symbol",
    model=["hyenadna_tiny_1k", "esm2_8M", "minilm_l6_v2"],
    output="anndata",
)

gene_adata.varm.keys()
gene_adata.uns["embeddings"].keys()

Embed gene perturbation labels as row-aligned action embeddings:

# pert_adata.obs["perturbation"] contains symbols such as TP53/MYC.
pert_adata = embedder.embed(
    pert_adata,
    entity_type="gene",
    obs_column="perturbation",
    id_type="symbol",
    model="esm2_650M",
    output="anndata",
    is_perturbation=True,
    key="X_pert_esm2_650M",
)

pert_adata.obsm["X_pert_esm2_650M"]

Embed proteins:

protein_adata = embedder.embed(
    ["TP53", "EGFR", "BRCA1"],
    entity_type="protein",
    id_type="symbol",
    model="esm2_8M",
    output="anndata",
)

Embed small molecules:

smiles = [
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # aspirin
    "Cn1cnc2c1c(=O)n(C)c(=O)n2C",  # caffeine
]

molecule_adata = embedder.embed(
    smiles,
    entity_type="molecule",
    id_type="smiles",
    model="morgan_fp",
    output="anndata",
    key="X_morgan_fp",
)

Embed cells from AnnData with model-aware preprocessing:

cell_adata = embedder.embed(
    adata,
    entity_type="cell",
    model="pca",
    preprocessing="auto",
    output="anndata",
    key="X_pca",
)

cell_adata.obsm["X_pca"]
cell_adata.uns["embpy_cell_embeddings"]

Annotate and plot:

from embpy import tl, pl

molecule_adata.obs["smiles"] = molecule_adata.obs_names
molecule_adata = tl.annotate_molecules(
    molecule_adata,
    column="smiles",
    sources=["structural", "bioactivity", "ontology"],
)

pl.plot_embedding_space(
    molecule_adata,
    obsm_key="X_morgan_fp",
    method="pca",
    color="mol_logp",
)

Tutorials

The tutorials are organized by biological entity:

Each notebook uses real BioEmbedder.embed(...) calls, real annotation APIs, and embpy plotting/comparison utilities.

Model Families

embpy supports models across:

  • DNA and regulatory sequence models
  • protein language and structure models
  • small-molecule fingerprints and chemical language models
  • single-cell foundation models and classical baselines
  • morphology models for HPA and JUMP-style images
  • text models for biological descriptions

Use:

embedder.list_available_models()

for the model keys available in your environment.

Output Contract

BioEmbedder.embed(...) follows a scverse-friendly output contract:

  • genes are feature-like and live in .varm by default
  • gene perturbation labels use is_perturbation=True and live in .obsm
  • proteins are feature-like and live in .varm
  • molecules, text, sequences, and cells are observation-like and live in .obsm
  • perturbation/action embeddings can be kept entity-aligned in .uns
  • .X remains expression/count-like data or a sparse placeholder

See the technical guide for the full contract.

Documentation

Citation

If you use embpy in your work, please cite the repository for now. A formal citation will be added when the package is released.

Contact

For questions, issues, or feature requests, open a GitHub issue or contact the maintainers listed in the package metadata.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors