embpy

embpy is a Python toolkit for generating biological embeddings with one unified API.

Use it to embed genes, proteins, small molecules, morphology perturbations, and single cells; annotate the resulting objects; and compare embeddings with scverse-friendly plotting and analysis utilities.

What embpy Does

Embeds biological entities through BioEmbedder.embed(...).
Resolves biological identifiers into model-ready inputs, such as gene sequences, protein sequences, SMILES strings, and morphology images.
Returns AnnData, tables, or payloads with provenance and canonical IDs.
Stores generated embeddings outside .X, using .obsm, .varm, or .uns according to the entity type.
Adds real metadata annotations for genes, proteins, molecules, and cell lines.
Provides plotting and comparison helpers for embedding quality checks.

Install

Pixi is recommended for development and GPU work:

pixi install -e default
pixi run -e default verify

For a pip install:

pip install embpy

For optional GPU/model extras, see the technical guide.

Quick Start

from embpy import BioEmbedder

embedder = BioEmbedder(device="auto", organism="human")

Embed genes with multiple model families:

gene_adata = embedder.embed(
    ["TP53", "EGFR", "MYC"],
    entity_type="gene",
    id_type="symbol",
    model=["hyenadna_tiny_1k", "esm2_8M", "minilm_l6_v2"],
    output="anndata",
)

gene_adata.varm.keys()
gene_adata.uns["embeddings"].keys()

Embed gene perturbation labels as row-aligned action embeddings:

# pert_adata.obs["perturbation"] contains symbols such as TP53/MYC.
pert_adata = embedder.embed(
    pert_adata,
    entity_type="gene",
    obs_column="perturbation",
    id_type="symbol",
    model="esm2_650M",
    output="anndata",
    is_perturbation=True,
    key="X_pert_esm2_650M",
)

pert_adata.obsm["X_pert_esm2_650M"]

Embed proteins:

protein_adata = embedder.embed(
    ["TP53", "EGFR", "BRCA1"],
    entity_type="protein",
    id_type="symbol",
    model="esm2_8M",
    output="anndata",
)

Embed small molecules:

smiles = [
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # aspirin
    "Cn1cnc2c1c(=O)n(C)c(=O)n2C",  # caffeine
]

molecule_adata = embedder.embed(
    smiles,
    entity_type="molecule",
    id_type="smiles",
    model="morgan_fp",
    output="anndata",
    key="X_morgan_fp",
)

Embed cells from AnnData with model-aware preprocessing:

cell_adata = embedder.embed(
    adata,
    entity_type="cell",
    model="pca",
    preprocessing="auto",
    output="anndata",
    key="X_pca",
)

cell_adata.obsm["X_pca"]
cell_adata.uns["embpy_cell_embeddings"]

Annotate and plot:

from embpy import tl, pl

molecule_adata.obs["smiles"] = molecule_adata.obs_names
molecule_adata = tl.annotate_molecules(
    molecule_adata,
    column="smiles",
    sources=["structural", "bioactivity", "ontology"],
)

pl.plot_embedding_space(
    molecule_adata,
    obsm_key="X_morgan_fp",
    method="pca",
    color="mol_logp",
)

Tutorials

The tutorials are organized by biological entity:

Each notebook uses real BioEmbedder.embed(...) calls, real annotation APIs, and embpy plotting/comparison utilities.

Model Families

embpy supports models across:

DNA and regulatory sequence models
protein language and structure models
small-molecule fingerprints and chemical language models
single-cell foundation models and classical baselines
morphology models for HPA and JUMP-style images
text models for biological descriptions

Use:

embedder.list_available_models()

for the model keys available in your environment.

Output Contract

BioEmbedder.embed(...) follows a scverse-friendly output contract:

genes are feature-like and live in .varm by default
gene perturbation labels use is_perturbation=True and live in .obsm
proteins are feature-like and live in .varm
molecules, text, sequences, and cells are observation-like and live in .obsm
perturbation/action embeddings can be kept entity-aligned in .uns
.X remains expression/count-like data or a sparse placeholder

See the technical guide for the full contract.

Documentation

API reference: per-function reference generated from docstrings
Technical guide: output contract, install matrix, package layout, and developer notes
Contributing
Changelog

Citation

If you use embpy in your work, please cite the repository for now. A formal citation will be added when the package is released.

Contact

For questions, issues, or feature requests, open a GitHub issue or contact the maintainers listed in the package metadata.

Name		Name	Last commit message	Last commit date
Latest commit History 275 Commits
.github		.github
.vscode		.vscode
docs		docs
err		err
logs		logs
src/embpy		src/embpy
submission_scripts		submission_scripts
tests/embpy		tests/embpy
.codecov.yaml		.codecov.yaml
.cruft.json		.cruft.json
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierrc.yaml		.prettierrc.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
embedding_job.log		embedding_job.log
environment-cpu.yml		environment-cpu.yml
environment.yml		environment.yml
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

embpy

What embpy Does

Install

Quick Start

Tutorials

Model Families

Output Contract

Documentation

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

embpy

What embpy Does

Install

Quick Start

Tutorials

Model Families

Output Contract

Documentation

Citation

Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages