Skip to content

kbase/cdm-data-loaders

cdm-data-loaders

Repo for CDM input data loading and wrangling

Environment and python management

The data loader utils package uses uv for python environment and package management. See the installation instructions to set up uv on your system.

Installation

The CDM data loaders run on python 3.13 and above.

Most python code can be run using the command

> uv run <path_to_file.py>

This will automatically launch a virtual environment and install all required dependencies.

To manually set up the virtual environment and install dependencies (including python), run

> uv sync

To activate a virtual environment with these dependencies installed, run

> uv venv
# you will now be prompted to activate the virtual environment
> source .venv/bin/activate

If you are using IDEs like VSCode, they should pick up the creation of the new environment and offer it for executing python code.

Lakehouse and Use with Jupyter notebooks

cdm-data-loaders can be installed on platforms like the KBase Lakehouse using the same installation steps:

cd cdm-data-loaders
uv sync
source .venv/bin/activate

To use the library in a Jupyter notebook, it must be registered as a Jupyter kernel. After performing the three steps above, run the following commands:

uv pip install -e .
uv pip install ipykernel
uv run python -m ipykernel install --user --name cdm-data-loaders --display-name "cdm-data-loaders"

The cdm-data-loaders kernel should now be available from the dropdown list of kernels in the Jupyter notebook interface.

Jupyter Kernel Environment Variables

If you would like to include environment variables for your kernel that are not present in your default environment, you can add them to the kernel.json for your new kernel (e.g., in cdm-data-loaders/.venv/share/jupyter/kernels/python3/kernel.json):

{
  "argv": ["..."],
  "display_name": "cdm-data-loaders",
  "language": "python",
  "env": {
    "MY_CUSTOM_VAR": "...",
    ...
  }
}

Running import pipelines

The repo provides a Docker container that can be used to run several import pipelines or to run unit tests for the repo. The entrypoint script parses the container run arguments and launches the appropriate functions.

Current endpoints include:

  • test: run the unit tests that do not require external dependencies like Spark
  • uniprot: run the UniProtKB (UniProt protein database) import pipeline; see the UniProtKB pipeline for arguments
  • uniref: run the UniRef import pipeline; the the UniRef pipeline for arguments

Development

Spark and other non-python dependencies

Some parts of this codebase rely on having a Spark instance available. Spark dependencies are pulled in by the berdl-notebook-utils package from BERDataLakehouse/spark_notebook, and the Docker container generated by the same repo should be used for development and testing to mimic the container where code will be run.

Pull the docker image:

> docker pull ghcr.io/berdatalakehouse/spark_notebook:main

Mount the current directory at /tmp/cdm and run the tests:

> docker run --rm -e NB_USER=runner -v .:/tmp/cdm ghcr.io/berdatalakehouse/spark_notebook:main /bin/bash /tmp/cdm/scripts/run_tests.sh

Run the container interactively as the user runner; current directory is mounted at /tmp/cdm:

> docker run --rm -e NB_USER=runner -it -v .:/tmp/cdm ghcr.io/berdatalakehouse/spark_notebook:main

This will launch a bash shell; the contents of the cdm-data-loaders directory are mounted at /tmp/cdm.

Run the container and sleep:

> docker run --rm -e NB_USER=runner -it -v .:/tmp/cdm ghcr.io/berdatalakehouse/spark_notebook:main sleep 100000000

The sleep command will run the container for long enough that you can then connect to it via Docker Desktop or the VSCode Containers extension.

See the BERDataLakehouse/spark_notebook repo for more information on the container and for a full docker-compose set up to mimic the BER Data Lakehouse container infrastructure.

Tests

Tests are categorised using pytest markers to allow developers to execute some or all the tests. See pyproject.toml for the markers used.

To run all tests (requires a running Spark instance), execute the command:

> uv run pytest

To run only tests that do not require Spark, run

> uv run pytest -m "not requires_spark"

To generate coverage for the tests, run

> uv run pytest --cov=src --cov-report=xml

The standard python coverage package is used and coverage can be generated as html or other formats by changing the parameters.

Integration tests (MinIO + NCBI FTP)

End-to-end integration tests for the NCBI assembly pipeline live in tests/integration/. They exercise the full flow — manifest diffing, FTP download, S3 promote/archive — against a locally running MinIO container and the real NCBI FTP server.

Requirements:

  • Docker (for MinIO)
  • Network access to ftp.ncbi.nlm.nih.gov

Running with Docker Compose (easiest)

The docker-compose.yml at the repo root defines both a MinIO service and the integration test runner. To build the image, start MinIO, and run the integration tests in one command:

docker compose up --build --abort-on-container-exit

Compose will stream test output to the terminal and exit with the pytest exit code. To clean up afterwards:

docker compose down --volumes

Running manually

If you prefer to run the tests directly against a local MinIO instance (e.g. for faster iteration during development), follow the steps below.

1. Start MinIO locally:

docker run -d \
  --name minio \
  -p 9000:9000 \
  -p 9001:9001 \
  -e MINIO_ROOT_USER=minioadmin \
  -e MINIO_ROOT_PASSWORD=minioadmin \
  minio/minio:RELEASE.2025-02-28T09-55-16Z server /data --console-address ":9001"

2. Run the integration tests:

> uv run pytest tests/integration/ -m integration -v

Tests are automatically skipped when MinIO is not reachable, so the default uv run pytest will never fail due to a missing MinIO instance.

3. Inspect results:

Buckets are not cleaned up after tests. Browse the MinIO console at http://localhost:9001 (login: minioadmin / minioadmin) to inspect the final state of each test bucket. Each test method creates its own bucket (e.g. integ-test-promote-dry-run).

4. Stop MinIO when done:

docker stop minio && docker rm minio

Note: These tests download real assemblies from NCBI FTP and are inherently slow (~30–60s per assembly). They are also marked slow_test so you can exclude them independently: uv run pytest -m "not slow_test".

Loading genomes, contigs, and features

The genome loader can be used to load and integrate data from related GFF and FASTA files. Currently, the loader requires a GFF file and two FASTA files (one for amino acid seqs, one for nucleic acid seqs) for each genome. The list of files to be processed should be specified in the genome paths file, which has the following format:

{
    "FW305-3-2-15-C-TSA1.1": {
        "fna": "tests/data/FW305-3-2-15-C-TSA1/FW305-3-2-15-C-TSA1_scaffolds.fna",
        "gff": "tests/data/FW305-3-2-15-C-TSA1/FW305-3-2-15-C-TSA1_genes.gff",
        "protein": "tests/data/FW305-3-2-15-C-TSA1/FW305-3-2-15-C-TSA1_genes.faa"
    },
    "FW305-C-112.1": {
        "fna": "tests/data/FW305-C-112.1/FW305-C-112.1_scaffolds.fna",
        "gff": "tests/data/FW305-C-112.1/FW305-C-112.1_genes.gff",
        "protein": "tests/data/FW305-C-112.1/FW305-C-112.1_genes.faa"
    }
}

Running bbmap stats and checkm2 on genome or contigset files

run_tools.sh runs the stats script from bbmap and checkm2 on files with the suffix "fna". These tools can be installed using conda:

conda env create -f env.yml
conda activate genome_loader_env
# download the checkm2 database
checkm2 database --download

Run the stats and checkm2 tools with the following command:

bash scripts/run_tools.sh path/to/genome_paths_file.json output_dir

where path/to/genome_paths_file.json specifies the path to the genome paths file (format specified above) and output_dir is the directory for the results.

About

Repo for CDM input data loading and wrangling

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors