embpy is a Python toolkit for generating biological embeddings with one unified API.
Use it to embed genes, proteins, small molecules, morphology perturbations, and single cells; annotate the resulting objects; and compare embeddings with scverse-friendly plotting and analysis utilities.
- Embeds biological entities through
BioEmbedder.embed(...). - Resolves biological identifiers into model-ready inputs, such as gene sequences, protein sequences, SMILES strings, and morphology images.
- Returns AnnData, tables, or payloads with provenance and canonical IDs.
- Stores generated embeddings outside
.X, using.obsm,.varm, or.unsaccording to the entity type. - Adds real metadata annotations for genes, proteins, molecules, and cell lines.
- Provides plotting and comparison helpers for embedding quality checks.
Pixi is recommended for development and GPU work:
pixi install -e default
pixi run -e default verifyFor a pip install:
pip install embpyFor optional GPU/model extras, see the technical guide.
from embpy import BioEmbedder
embedder = BioEmbedder(device="auto", organism="human")Embed genes with multiple model families:
gene_adata = embedder.embed(
["TP53", "EGFR", "MYC"],
entity_type="gene",
id_type="symbol",
model=["hyenadna_tiny_1k", "esm2_8M", "minilm_l6_v2"],
output="anndata",
)
gene_adata.varm.keys()
gene_adata.uns["embeddings"].keys()Embed gene perturbation labels as row-aligned action embeddings:
# pert_adata.obs["perturbation"] contains symbols such as TP53/MYC.
pert_adata = embedder.embed(
pert_adata,
entity_type="gene",
obs_column="perturbation",
id_type="symbol",
model="esm2_650M",
output="anndata",
is_perturbation=True,
key="X_pert_esm2_650M",
)
pert_adata.obsm["X_pert_esm2_650M"]Embed proteins:
protein_adata = embedder.embed(
["TP53", "EGFR", "BRCA1"],
entity_type="protein",
id_type="symbol",
model="esm2_8M",
output="anndata",
)Embed small molecules:
smiles = [
"CC(=O)OC1=CC=CC=C1C(=O)O", # aspirin
"Cn1cnc2c1c(=O)n(C)c(=O)n2C", # caffeine
]
molecule_adata = embedder.embed(
smiles,
entity_type="molecule",
id_type="smiles",
model="morgan_fp",
output="anndata",
key="X_morgan_fp",
)Embed cells from AnnData with model-aware preprocessing:
cell_adata = embedder.embed(
adata,
entity_type="cell",
model="pca",
preprocessing="auto",
output="anndata",
key="X_pca",
)
cell_adata.obsm["X_pca"]
cell_adata.uns["embpy_cell_embeddings"]Annotate and plot:
from embpy import tl, pl
molecule_adata.obs["smiles"] = molecule_adata.obs_names
molecule_adata = tl.annotate_molecules(
molecule_adata,
column="smiles",
sources=["structural", "bioactivity", "ontology"],
)
pl.plot_embedding_space(
molecule_adata,
obsm_key="X_morgan_fp",
method="pca",
color="mol_logp",
)The tutorials are organized by biological entity:
Each notebook uses real BioEmbedder.embed(...) calls, real annotation APIs,
and embpy plotting/comparison utilities.
embpy supports models across:
- DNA and regulatory sequence models
- protein language and structure models
- small-molecule fingerprints and chemical language models
- single-cell foundation models and classical baselines
- morphology models for HPA and JUMP-style images
- text models for biological descriptions
Use:
embedder.list_available_models()for the model keys available in your environment.
BioEmbedder.embed(...) follows a scverse-friendly output contract:
- genes are feature-like and live in
.varmby default - gene perturbation labels use
is_perturbation=Trueand live in.obsm - proteins are feature-like and live in
.varm - molecules, text, sequences, and cells are observation-like and live in
.obsm - perturbation/action embeddings can be kept entity-aligned in
.uns .Xremains expression/count-like data or a sparse placeholder
See the technical guide for the full contract.
- API reference: per-function reference generated from docstrings
- Technical guide: output contract, install matrix, package layout, and developer notes
- Contributing
- Changelog
If you use embpy in your work, please cite the repository for now. A formal citation will be added when the package is released.
For questions, issues, or feature requests, open a GitHub issue or contact the maintainers listed in the package metadata.
