From 770865d788a668109a73ba64748df939d89a5301 Mon Sep 17 00:00:00 2001 From: "marcin p. joachimiak" <4625870+realmarcin@users.noreply.github.com> Date: Mon, 11 May 2026 21:40:53 -0700 Subject: [PATCH] Selective per-CURIE NCIT/MESH stub-import transform MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a new sibling transform `ontologies_stubs` that imports just the NCIT and MESH terms referenced by mappings/ — with full label, exact synonyms, and dbxrefs — without loading the rest of those ontologies (those belong to the sibling kg-microbe-biomedical pipeline). Today the chemical-mapping consolidator and the BacDive isolation-source mapper reference 70 NCIT and 92 MESH IDs as canonical xrefs for ingredients (e.g. NCIT:C29298 'Oatmeal', mesh:D011136 'Tween'). The existing STUB_ONTOLOGY_PREFIXES mechanism in isolation_source_mapping_utils.py was emitting label-only nodes for these inline from the BacDive transform, but: (a) the chemical-mapping- driven NCIT/MESH refs were producing dangling node ids in the merged KG (no node row, label, or xrefs), and (b) even where stubs existed they carried only the BacDive object_label, no synonyms, no xrefs. This transform fixes both. For each NCIT/MESH CURIE referenced anywhere under mappings/, OAK queries the local SemSQL DB (data/raw/{ncit,mesh}.db) for rdfs:label, exact synonyms, and dbxrefs, then writes a labelled stub node to data/transformed/ontologies_stubs/ {ncit,mesh}_nodes.tsv. The DBs themselves are never loaded into the merged KG. Components: - kg_microbe/utils/stub_curie_collection.py (new): collect_stub_curies() scans an explicit list of mapping TSVs (unified SSSOM, isolation- source, MIM, canonical/*) and returns per-prefix CURIE sets with case normalization. Currently surfaces 70 NCIT + 92 MESH CURIEs. - kg_microbe/transform_utils/ontologies_stubs/ (new): per-CURIE OAK SemSQL fetch following the same pattern as the chemical-mapping consolidator's enrich_with_chebi_synonyms (label, entity_aliases, entity_metadata_map for dbxrefs). Auto-decompresses .db.gz on first run if the unzipped .db is missing. Fails loudly with an actionable message if neither is present (no silent dangling-xref fallback). - kg_microbe/transform.py: registers ONTOLOGIES_STUBS in DATA_SOURCES, ordered after OntologiesTransform so the SemSQL DBs are present. - download.yaml: adds ncit.db.gz and mesh.db.gz from s3.amazonaws.com/ bbop-sqlite (the standard SemSQL distribution; same source the OAK `sqlite:obo:` shim hits). - merge.yaml + merge.no_metatraits.yaml + merge_bakta.yaml: declare the new ontologies_stubs source so the merged KG picks up the stub nodes. merge.minimal.yaml unchanged (it skips ontology nodes entirely; not a target for stub enrichment). - kg_microbe/transform_utils/bacdive/bacdive.py: BacDive's inline stub-emit at lines 2990-3003 now skips NCIT and mesh prefixes (deferred to the new transform). Long-tail prefixes (PRIDE, PCO, GENEPIO, FAO, BTO, SNOMED — 1-3 IDs each) keep the inline label- only fallback. Build-time prefix validator unchanged. - kg_microbe/utils/isolation_source_mapping_utils.py: STUB_ONTOLOGY_PREFIXES docstring rewritten to document the two stub-import paths (SemSQL-enriched for NCIT/mesh, inline label-only for the long-tail). Tests: - tests/test_stub_curie_collection.py: collector unit tests covering CURIE discovery, case normalization, missing files, SSSOM YAML header skipping, and a smoke test against the real committed mappings. - tests/test_ontologies_stubs.py: transform unit tests with an in-memory fake adapter — round-trips label/synonyms/xrefs, falls back to CURIE when label missing, drops alias-equals-label duplicates, writes header-only file when no CURIEs, and raises loudly when neither .db nor .db.gz is present. Plus an integration test (skipped when stub output absent) that asserts every collector-discovered CURIE has a corresponding stub-node row. Verified: 13 unit tests pass; ruff clean. End-to-end verification (requires `poetry run kg download` to fetch ncit.db.gz + mesh.db.gz, ~370 MB total, one-time): poetry run kg transform -s ontologies_stubs wc -l data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv poetry run pytest tests/test_ontologies_stubs.py -v # integration test no longer skipped Plan: /Users/marcin/.claude/plans/are-all-of-these-glittery-origami.md Co-Authored-By: Claude Opus 4.7 (1M context) --- download.yaml | 19 ++ kg_microbe/transform.py | 8 + kg_microbe/transform_utils/bacdive/bacdive.py | 16 +- kg_microbe/transform_utils/constants.py | 1 + .../ontologies_stubs/__init__.py | 7 + .../ontologies_stubs_transform.py | 253 ++++++++++++++++++ .../utils/isolation_source_mapping_utils.py | 30 ++- kg_microbe/utils/stub_curie_collection.py | 138 ++++++++++ merge.no_metatraits.yaml | 8 + merge.yaml | 10 + merge_bakta.yaml | 8 + tests/test_ontologies_stubs.py | 228 ++++++++++++++++ tests/test_stub_curie_collection.py | 93 +++++++ 13 files changed, 812 insertions(+), 7 deletions(-) create mode 100644 kg_microbe/transform_utils/ontologies_stubs/__init__.py create mode 100644 kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py create mode 100644 kg_microbe/utils/stub_curie_collection.py create mode 100644 tests/test_ontologies_stubs.py create mode 100644 tests/test_stub_curie_collection.py diff --git a/download.yaml b/download.yaml index c330fd77..991b2700 100644 --- a/download.yaml +++ b/download.yaml @@ -423,3 +423,22 @@ - url: https://raw.githubusercontent.com/biolink/kgx/master/docs/kgx_format.md local_name: kgx-format.md + + +# +# **** Selective stub-import ontologies (NCIT, MESH) **** +# +# KG-Microbe does NOT load the full NCIT or MESH ontologies — those belong to +# kg-microbe-biomedical. But the chemical-mapping consolidator and BacDive +# isolation-source mapper reference ~150 NCIT/MESH IDs as canonical xrefs for +# ingredients (e.g. NCIT:C29298 'Oatmeal', mesh:D011136 'Tween'). The +# OntologiesStubsTransform queries these SemSQL DBs to harvest just the +# referenced IDs (label + synonyms + xrefs), emitting one labelled stub node +# each. The DBs themselves are never loaded into the merged KG. +# +- + url: https://s3.amazonaws.com/bbop-sqlite/ncit.db.gz + local_name: ncit.db.gz +- + url: https://s3.amazonaws.com/bbop-sqlite/mesh.db.gz + local_name: mesh.db.gz diff --git a/kg_microbe/transform.py b/kg_microbe/transform.py index e03ebd8d..c1c346be 100644 --- a/kg_microbe/transform.py +++ b/kg_microbe/transform.py @@ -20,6 +20,7 @@ METATRAITS, METATRAITS_GTDB, ONTOLOGIES, + ONTOLOGIES_STUBS, RHEAMAPPINGS, ) from kg_microbe.transform_utils.gtdb.gtdb import GTDBTransform @@ -32,6 +33,9 @@ ONTOLOGIES_MAP, OntologiesTransform, ) +from kg_microbe.transform_utils.ontologies_stubs.ontologies_stubs_transform import ( + OntologiesStubsTransform, +) from kg_microbe.transform_utils.rhea_mappings.rhea_mappings import RheaMappingsTransform DATA_SOURCES = { @@ -44,6 +48,10 @@ # "ProteinAtlasTransform": ProteinAtlasTransform, # "STRINGTransform": STRINGTransform, ONTOLOGIES: OntologiesTransform, + # Run ontologies_stubs after ontologies so the SemSQL DBs are present and + # so the stub-node TSVs land in data/transformed/ontologies_stubs/ before + # the merge step picks them up. + ONTOLOGIES_STUBS: OntologiesStubsTransform, BACDIVE: BacDiveTransform, BAKTA: BaktaTransform, COG: COGTransform, diff --git a/kg_microbe/transform_utils/bacdive/bacdive.py b/kg_microbe/transform_utils/bacdive/bacdive.py index 7b61e0f1..72407576 100644 --- a/kg_microbe/transform_utils/bacdive/bacdive.py +++ b/kg_microbe/transform_utils/bacdive/bacdive.py @@ -2987,8 +2987,22 @@ def run(self, data_file: Union[Optional[Path], Optional[str]] = None, show_statu # emit a thin node row here instead of pulling in the full # ontology. Loaded-ontology targets (UBERON, ENVO, ...) get # their canonical node from the ontologies transform. + # + # NCIT and mesh stub nodes are NOT emitted here — the + # OntologiesStubsTransform (kg_microbe/transform_utils/ + # ontologies_stubs/) writes label+synonym+xref-enriched + # stubs from the SemSQL DBs, which is strictly richer + # than the label-only fallback below. Emitting both + # here and there would produce duplicate node rows + # that the merge would have to dedupe. The PRIDE/PCO/ + # GENEPIO/FAO/BTO/SNOMED prefixes stay on the inline + # path because each has 1-3 IDs in the whole repo — + # not worth a SemSQL fetch. stub_prefix = subject_id.split(":", 1)[0] if ":" in subject_id else "" - if stub_prefix in STUB_ONTOLOGY_PREFIXES: + if stub_prefix in STUB_ONTOLOGY_PREFIXES and stub_prefix not in { + "NCIT", + "mesh", + }: node_writer.writerow( self._create_node_row( subject_id, diff --git a/kg_microbe/transform_utils/constants.py b/kg_microbe/transform_utils/constants.py index 6a2ca1cb..c8f35b89 100644 --- a/kg_microbe/transform_utils/constants.py +++ b/kg_microbe/transform_utils/constants.py @@ -13,6 +13,7 @@ KEGG = "kegg" RHEAMAPPINGS = "rhea_mappings" ONTOLOGIES = "ontologies" +ONTOLOGIES_STUBS = "ontologies_stubs" WALLEN_ETAL = "wallen_etal" CTD = "ctd" DISBIOME = "disbiome" diff --git a/kg_microbe/transform_utils/ontologies_stubs/__init__.py b/kg_microbe/transform_utils/ontologies_stubs/__init__.py new file mode 100644 index 00000000..fc2e8c6d --- /dev/null +++ b/kg_microbe/transform_utils/ontologies_stubs/__init__.py @@ -0,0 +1,7 @@ +"""Ontologies-stubs transform package.""" + +from kg_microbe.transform_utils.ontologies_stubs.ontologies_stubs_transform import ( + OntologiesStubsTransform, +) + +__all__ = ["OntologiesStubsTransform"] diff --git a/kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py b/kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py new file mode 100644 index 00000000..48f33f47 --- /dev/null +++ b/kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py @@ -0,0 +1,253 @@ +""" +Ontologies-stubs transform. + +KG-Microbe deliberately does NOT load the full NCIT or MESH ontologies — those +belong to the sibling ``kg-microbe-biomedical`` pipeline. But the +chemical-mapping consolidator and the BacDive isolation-source mapper reference +~150 NCIT and MESH IDs as canonical xrefs for ingredients (e.g. +``NCIT:C29298 'Oatmeal'``, ``mesh:D011136 'Tween'``). Without this transform +those CURIEs would appear as dangling node ids in the merged KG: edges point at +them but no node row carries the label. + +This transform: + +1. Calls :func:`~kg_microbe.utils.stub_curie_collection.collect_stub_curies` to + discover every NCIT and MESH CURIE referenced anywhere under ``mappings/``. +2. For each CURIE, queries the local SemSQL DB (``data/raw/ncit.db``, + ``data/raw/mesh.db``) via OAK to fetch its ``rdfs:label``, exact synonyms, + and dbxrefs. The same pattern is used by the chemical-mapping consolidator + for ChEBI in ``scripts/consolidate_chemical_mappings.py``. +3. Writes one KGX node TSV per stub ontology to + ``data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv`` carrying + ``id, category, name, synonym, xref, provided_by, knowledge_source``. + No edges file — stubs are isolated nodes; edges arrive from the source + transforms (BacDive, MediaDive ingredients via the chemical-mapping path, + etc.). + +Note for downstream consumers: if a KG built with this transform is ever +merged with a kg-microbe-biomedical KG that loads NCIT/MESH fully, biolink +merge semantics will union nodes — the stub node here is a strict subset of +what the full ontology would emit (label/synonym/xref only; no edges, no +deprecated flag, no parent classes), so the union will simply pick the +fuller record. +""" + +from __future__ import annotations + +import csv +import gzip +import shutil +from pathlib import Path +from typing import Dict, Iterable, List, Optional, Set + +from kg_microbe.transform_utils.constants import ( + CATEGORY_COLUMN, + DEPRECATED_COLUMN, + DESCRIPTION_COLUMN, + ID_COLUMN, + NAME_COLUMN, + PROVIDED_BY_COLUMN, + SAME_AS_COLUMN, + SYNONYM_COLUMN, + XREF_COLUMN, +) +from kg_microbe.transform_utils.transform import Transform +from kg_microbe.utils.isolation_source_mapping_utils import STUB_ONTOLOGY_CATEGORY +from kg_microbe.utils.stub_curie_collection import collect_stub_curies + +# Stub ontologies handled by this transform. Each entry maps the canonical +# CURIE prefix (case-sensitive — must match how the prefix appears in +# existing mapping rows) to the local SemSQL DB and the InforES knowledge +# source string. +STUB_ONTOLOGY_SOURCES: Dict[str, Dict[str, str]] = { + "NCIT": { + "db_filename": "ncit.db", + "knowledge_source": "infores:ncit", + }, + "mesh": { + "db_filename": "mesh.db", + "knowledge_source": "infores:mesh", + }, +} + +ONTOLOGIES_STUBS_SOURCE_NAME = "ontologies_stubs" + + +class OntologiesStubsTransform(Transform): + + """Emit one labelled stub node per referenced NCIT / MESH CURIE.""" + + def __init__( + self, + input_dir: Optional[Path] = None, + output_dir: Optional[Path] = None, + ): + """ + Instantiate transform. + + :param input_dir: Where the SemSQL DBs live (defaults to ``data/raw/``). + :param output_dir: Where ``ontologies_stubs/{ncit,mesh}_nodes.tsv`` are + written (defaults to ``data/transformed/``). + """ + super().__init__(ONTOLOGIES_STUBS_SOURCE_NAME, input_dir, output_dir) + + def run(self, data_file=None) -> None: # noqa: D401 — base class signature + """ + Collect stub CURIEs, fetch metadata via OAK, write per-ontology node TSVs. + + :param data_file: Unused (kept for the base-class signature). The + transform discovers its inputs from the mapping TSVs and the + SemSQL DBs in ``input_base_dir``. + """ + prefixes = list(STUB_ONTOLOGY_SOURCES.keys()) + curies_by_prefix = collect_stub_curies(prefixes) + + for prefix, curies in curies_by_prefix.items(): + cfg = STUB_ONTOLOGY_SOURCES[prefix] + db_path = self.input_base_dir / cfg["db_filename"] + output_file = self.output_dir / f"{prefix.lower()}_nodes.tsv" + self._write_stub_nodes( + prefix=prefix, + curies=sorted(curies), + db_path=db_path, + knowledge_source=cfg["knowledge_source"], + output_file=output_file, + ) + + # ------------------------------------------------------------------ + # internal helpers + # ------------------------------------------------------------------ + + def _write_stub_nodes( + self, + prefix: str, + curies: List[str], + db_path: Path, + knowledge_source: str, + output_file: Path, + ) -> None: + """Fetch label/synonyms/xrefs per CURIE and write the node TSV.""" + if not curies: + print(f" [{prefix}] no CURIEs to import; skipping {output_file.name}") + # Write an empty file with header so the merge step doesn't fail + # on a missing file declared in merge.yaml. + self._write_node_file(output_file, []) + return + + adapter = self._open_adapter(prefix, db_path) + if adapter is None: + raise FileNotFoundError( + f"OAK adapter for {prefix} could not be opened (expected SemSQL DB at " + f"{db_path}). Run `poetry run kg download` to fetch it. The stub " + f"transform refuses to silently emit unlabelled nodes — that would " + f"reintroduce the dangling-xref hazard this transform exists to fix." + ) + + rows: List[List[Optional[str]]] = [] + missing: List[str] = [] + for curie in curies: + label, synonyms, xrefs = self._fetch_metadata(adapter, curie) + if not label: + # Last-resort fallback: use the CURIE as the name. Log it so + # curators can chase down obsolete or missing entries upstream. + missing.append(curie) + label = curie + row = [ + curie, # id + STUB_ONTOLOGY_CATEGORY, # category + label, # name + None, # description + _join_pipe(xrefs), # xref + ONTOLOGIES_STUBS_SOURCE_NAME, # provided_by + _join_pipe(synonyms), # synonym + None, # deprecated + None, # same_as + ] + rows.append(row) + + self._write_node_file(output_file, rows) + print( + f" [{prefix}] wrote {len(rows)} stub nodes to {output_file.name} " + f"(knowledge_source={knowledge_source}, missing labels: {len(missing)})" + ) + if missing: + print(f" [{prefix}] CURIEs with no SemSQL label (used CURIE as name): {missing}") + + def _open_adapter(self, prefix: str, db_path: Path): + """ + Open an OAK SemSQL adapter against the local DB; return None on failure. + + OBO Foundry distributes the SemSQL DBs as ``.db.gz`` and ``download.yaml`` + stores the gzipped form. If the unzipped ``.db`` is missing but a sibling + ``.db.gz`` is present, decompress it once (idempotent) and use the result. + """ + if not db_path.is_file(): + gz_path = db_path.with_suffix(db_path.suffix + ".gz") + if gz_path.is_file(): + print(f" [{prefix}] decompressing {gz_path.name} → {db_path.name}") + with gzip.open(gz_path, "rb") as src, db_path.open("wb") as dst: + shutil.copyfileobj(src, dst) + else: + return None + try: + from oaklib import get_adapter + except ImportError as exc: # pragma: no cover — oaklib is a dep + raise RuntimeError( + f"oaklib import failed while opening SemSQL adapter for {prefix}: {exc}" + ) from exc + return get_adapter(f"sqlite:{db_path}") + + def _fetch_metadata(self, adapter, curie: str): + """Return (label, synonyms_set, xrefs_set) for ``curie`` via the OAK adapter.""" + label = "" + synonyms: Set[str] = set() + xrefs: Set[str] = set() + try: + label = adapter.label(curie) or "" + except Exception: # noqa: S110 — obsolete CURIEs are expected to miss + pass + try: + synonyms = {s for s in adapter.entity_aliases(curie) if s} + except Exception: # noqa: S110 + pass + # Drop the canonical label out of the synonym set to keep them disjoint. + synonyms.discard(label) + try: + metadata = adapter.entity_metadata_map(curie) or {} + except Exception: # noqa: S110 + metadata = {} + # OAK returns metadata keyed by short-form predicate. dbxref entries + # land under "oio:hasDbXref" (or "oboInOwl:hasDbXref" on older + # adapters). Accept both. + for predicate_key in ("oio:hasDbXref", "oboInOwl:hasDbXref"): + for value in metadata.get(predicate_key, []) or []: + if value: + xrefs.add(str(value)) + return label, sorted(synonyms), sorted(xrefs) + + def _write_node_file(self, path: Path, rows: Iterable[Iterable[Optional[str]]]) -> None: + """Write ``rows`` to ``path`` using the standard Transform node header.""" + path.parent.mkdir(parents=True, exist_ok=True) + # Use the canonical 9-column node header from the Transform base class. + header = [ + ID_COLUMN, + CATEGORY_COLUMN, + NAME_COLUMN, + DESCRIPTION_COLUMN, + XREF_COLUMN, + PROVIDED_BY_COLUMN, + SYNONYM_COLUMN, + DEPRECATED_COLUMN, + SAME_AS_COLUMN, + ] + with path.open("w", newline="", encoding="utf-8") as fh: + writer = csv.writer(fh, delimiter="\t", lineterminator="\n") + writer.writerow(header) + for row in rows: + writer.writerow(["" if cell is None else cell for cell in row]) + + +def _join_pipe(values: Iterable[str]) -> str: + """Pipe-join a sequence; return ``""`` when empty (matches existing TSV convention).""" + items = [v for v in values if v] + return "|".join(items) if items else "" diff --git a/kg_microbe/utils/isolation_source_mapping_utils.py b/kg_microbe/utils/isolation_source_mapping_utils.py index 30028b3c..8c5aa8c7 100644 --- a/kg_microbe/utils/isolation_source_mapping_utils.py +++ b/kg_microbe/utils/isolation_source_mapping_utils.py @@ -86,12 +86,30 @@ # but that are NOT loaded by the ontologies transform (see ONTOLOGIES_MAP in # kg_microbe/transform_utils/ontologies/ontologies_transform.py). Each prefix # either has only a tiny number of distinct IDs in use, or its full load is -# impractical (mesh and NCIT are huge clinical thesauri), so the BacDive -# transform writes a thin node row per resolved CURIE using the object_label -# from the mapping TSV. The category is biolink:OntologyClass for all stubs -# because they're typically categorical terms (host body site, microbial -# community, abscess, etc.) rather than specific anatomy / environmental -# features whose canonical metadata would come from a loaded ontology. +# impractical (mesh and NCIT are huge clinical thesauri). +# +# Two stub-import paths exist for these prefixes: +# +# 1. NCIT and mesh: a SemSQL-backed enriched stub source. The +# OntologiesStubsTransform (kg_microbe/transform_utils/ontologies_stubs/) +# queries data/raw/ncit.db and data/raw/mesh.db via OAK to fetch +# rdfs:label, exact synonyms, and dbxrefs for every NCIT/mesh CURIE that +# appears anywhere under mappings/. Output: +# data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv. This is the +# preferred path — stubs carry full metadata, not just a label. The +# BacDive inline emit at bacdive.py defers to this transform for these +# two prefixes (see the `not in {"NCIT", "mesh"}` branch there). +# +# 2. The long-tail prefixes (PRIDE, PCO, GENEPIO, FAO, BTO, SNOMED): each +# has 1-3 IDs in the whole repo, so the BacDive transform writes a thin +# label-only node row inline at edge-emit time using the object_label +# from the mapping TSV. Setting up SemSQL DBs for these would be +# overkill. +# +# The category is biolink:OntologyClass for all stubs because they're +# typically categorical terms (host body site, microbial community, +# abscess, etc.) rather than specific anatomy / environmental features +# whose canonical metadata would come from a loaded ontology. # # Codex adversarial review #558 found that without stubs for these prefixes # the BacDive transform was emitting edges to dangling node IDs because the diff --git a/kg_microbe/utils/stub_curie_collection.py b/kg_microbe/utils/stub_curie_collection.py new file mode 100644 index 00000000..0abcfeaf --- /dev/null +++ b/kg_microbe/utils/stub_curie_collection.py @@ -0,0 +1,138 @@ +""" +Collect stub-prefix CURIEs referenced anywhere in the mapping TSVs. + +KG-Microbe deliberately does NOT load the full NCIT or MESH ontologies (those +belong to the sibling kg-microbe-biomedical pipeline), but the chemical-mapping +consolidator and the BacDive isolation-source mapper reference a small handful +of NCIT and MESH IDs. This collector finds every such CURIE so that the +downstream :class:`~kg_microbe.transform_utils.ontologies_stubs.ontologies_stubs_transform.OntologiesStubsTransform` +can fetch a labelled stub node for each one. + +It scans a fixed set of mapping files at the repo root (no glob magic — wrong +edits silently change the import set, so the file list is explicit and +auditable): + +* ``mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz`` — unified + chemical/anatomy/environment mappings (object_id, subject_id columns). +* ``mappings/isolation_source_to_ontology.tsv`` — BacDive isolation-source + mappings (object_id column). +* ``mappings/ingredient_mappings.sssom.tsv`` — vendored MIM SSSOM + (object_id, subject_id). +* ``mappings/canonical/*.tsv`` — chemical/enzyme/pathway/phenotype canonical + exports (object_id). + +Returned dict shape: ``{normalized_prefix: {curie, curie, ...}}`` where +``normalized_prefix`` matches the case used in +:data:`~kg_microbe.utils.isolation_source_mapping_utils.STUB_ONTOLOGY_PREFIXES` +(e.g. ``"NCIT"`` is uppercase, ``"mesh"`` is lowercase). Inputs in any case +are accepted and normalized. +""" + +from __future__ import annotations + +import csv +import gzip +import re +from pathlib import Path +from typing import Dict, Iterable, Set + +REPO_ROOT = Path(__file__).resolve().parent.parent.parent + +# Files explicitly scanned for stub CURIEs. Adding a new file here is an +# auditable opt-in change; missing files are silently skipped (so removing a +# mapping source from the repo doesn't break the collector). +DEFAULT_MAPPING_PATHS = ( + REPO_ROOT / "mappings" / "kgmicrobe_unified_entity_mappings.sssom.tsv.gz", + REPO_ROOT / "mappings" / "isolation_source_to_ontology.tsv", + REPO_ROOT / "mappings" / "ingredient_mappings.sssom.tsv", + REPO_ROOT / "mappings" / "canonical" / "chemical_mappings.tsv", + REPO_ROOT / "mappings" / "canonical" / "enzyme_mappings.tsv", + REPO_ROOT / "mappings" / "canonical" / "enzyme_name_to_go.tsv", + REPO_ROOT / "mappings" / "canonical" / "metpo_alias_mappings.tsv", + REPO_ROOT / "mappings" / "canonical" / "pathway_mappings.tsv", + REPO_ROOT / "mappings" / "canonical" / "phenotype_mappings.tsv", + REPO_ROOT / "mappings" / "canonical" / "special_chemical_mappings.tsv", +) + +# Columns to scan for CURIEs across all mapping shapes. Any cell whose value +# parses as ``:`` and matches one of the requested prefixes is +# collected. +_CURIE_COLUMNS = ( + "object_id", + "subject_id", +) + +_CURIE_RE = re.compile(r"^([A-Za-z][A-Za-z0-9._-]*):([A-Za-z0-9_.\-]+)$") + + +def _open_text(path: Path): + """Open a TSV / TSV.GZ for text reading after stripping any SSSOM YAML header.""" + handle = gzip.open(path, "rt", encoding="utf-8") if path.suffix == ".gz" else path.open( + "r", encoding="utf-8" + ) + # SSSOM files prefix a YAML metadata header with `# `. Skip those before + # handing the file to csv.DictReader. + while True: + pos = handle.tell() + line = handle.readline() + if not line: + break + if not line.startswith("#"): + handle.seek(pos) + break + return handle + + +def _normalize_prefix(prefix: str, canonical_prefixes: Dict[str, str]) -> str | None: + """Return the canonical-cased prefix string for ``prefix``, or ``None`` if unknown.""" + return canonical_prefixes.get(prefix.lower()) + + +def collect_stub_curies( + prefixes: Iterable[str], + mapping_paths: Iterable[Path] | None = None, +) -> Dict[str, Set[str]]: + """ + Scan the mapping TSVs and return the set of CURIEs that match each requested prefix. + + :param prefixes: Iterable of CURIE prefixes to collect. Case-insensitive on + input; the returned dict's keys preserve the case as given here, so + callers should pass them in the canonical form they want + (``"NCIT"``, ``"mesh"``, ...). + :param mapping_paths: Override the file list (mainly for tests). Defaults + to :data:`DEFAULT_MAPPING_PATHS`. + :returns: ``{canonical_prefix: {curie, ...}}`` for every prefix in + ``prefixes``, with the empty set as default for prefixes that have no + references in any mapping file. + """ + canonical_prefixes: Dict[str, str] = {p.lower(): p for p in prefixes} + result: Dict[str, Set[str]] = {p: set() for p in canonical_prefixes.values()} + + paths = list(mapping_paths) if mapping_paths is not None else list(DEFAULT_MAPPING_PATHS) + + for path in paths: + if not path.is_file(): + continue + with _open_text(path) as handle: + reader = csv.DictReader(handle, delimiter="\t") + for row in reader: + for col in _CURIE_COLUMNS: + value = (row.get(col) or "").strip() + if not value: + continue + match = _CURIE_RE.match(value) + if not match: + continue + raw_prefix, local = match.group(1), match.group(2) + canonical = _normalize_prefix(raw_prefix, canonical_prefixes) + if canonical is None: + continue + result[canonical].add(f"{canonical}:{local}") + + return result + + +__all__ = [ + "DEFAULT_MAPPING_PATHS", + "collect_stub_curies", +] diff --git a/merge.no_metatraits.yaml b/merge.no_metatraits.yaml index 4ffb9a2e..b49599f7 100644 --- a/merge.no_metatraits.yaml +++ b/merge.no_metatraits.yaml @@ -68,6 +68,14 @@ merged_graph: filename: - data/transformed/ontologies/metpo_nodes.tsv - data/transformed/ontologies/metpo_edges.tsv + # Selective per-CURIE stub-import for NCIT and MESH — see merge.yaml. + ontologies_stubs: + name: "ontologies_stubs" + input: + format: tsv + filename: + - data/transformed/ontologies_stubs/ncit_nodes.tsv + - data/transformed/ontologies_stubs/mesh_nodes.tsv bacdive: name: "bacdive" input: diff --git a/merge.yaml b/merge.yaml index 7876cc4c..ec475cf5 100644 --- a/merge.yaml +++ b/merge.yaml @@ -82,6 +82,16 @@ merged_graph: filename: - data/transformed/ontologies/metpo_nodes.tsv - data/transformed/ontologies/metpo_edges.tsv + # Selective per-CURIE stub-import for NCIT and MESH — only the IDs + # referenced by mappings/* are imported (not the full ontologies). + # See kg_microbe/transform_utils/ontologies_stubs/ for the transform. + ontologies_stubs: + name: "ontologies_stubs" + input: + format: tsv + filename: + - data/transformed/ontologies_stubs/ncit_nodes.tsv + - data/transformed/ontologies_stubs/mesh_nodes.tsv bacdive: name: "bacdive" input: diff --git a/merge_bakta.yaml b/merge_bakta.yaml index c2f4431d..da1020f0 100644 --- a/merge_bakta.yaml +++ b/merge_bakta.yaml @@ -84,6 +84,14 @@ merged_graph: filename: - data/transformed/ontologies/metpo_nodes.tsv - data/transformed/ontologies/metpo_edges.tsv + # Selective per-CURIE stub-import for NCIT and MESH — see merge.yaml. + ontologies_stubs: + name: "ontologies_stubs" + input: + format: tsv + filename: + - data/transformed/ontologies_stubs/ncit_nodes.tsv + - data/transformed/ontologies_stubs/mesh_nodes.tsv bacdive: name: "bacdive" input: diff --git a/tests/test_ontologies_stubs.py b/tests/test_ontologies_stubs.py new file mode 100644 index 00000000..6bd108d3 --- /dev/null +++ b/tests/test_ontologies_stubs.py @@ -0,0 +1,228 @@ +"""Tests for the OntologiesStubsTransform.""" + +from __future__ import annotations + +import csv +from pathlib import Path +from typing import Dict, List, Set + +import pytest + +from kg_microbe.transform_utils.ontologies_stubs.ontologies_stubs_transform import ( + STUB_ONTOLOGY_SOURCES, + OntologiesStubsTransform, +) +from kg_microbe.utils.isolation_source_mapping_utils import ( + STUB_ONTOLOGY_CATEGORY, + STUB_ONTOLOGY_PREFIXES, +) +from kg_microbe.utils.stub_curie_collection import collect_stub_curies + +REPO_ROOT = Path(__file__).resolve().parents[1] + + +# --------------------------------------------------------------------------- +# In-memory fakes +# --------------------------------------------------------------------------- + + +class _FakeAdapter: + + """Minimal stand-in for an OAK SemSQL adapter — enough for the stub transform.""" + + def __init__(self, store: Dict[str, Dict]): + self._store = store # {curie: {"label": str, "aliases": [...], "xrefs": [...]}} + + def label(self, curie: str): + return self._store.get(curie, {}).get("label", "") + + def entity_aliases(self, curie: str): + return list(self._store.get(curie, {}).get("aliases", [])) + + def entity_metadata_map(self, curie: str): + xrefs = list(self._store.get(curie, {}).get("xrefs", [])) + return {"oio:hasDbXref": xrefs} if xrefs else {} + + +class _StubbedTransform(OntologiesStubsTransform): + + """Subclass that swaps in an in-memory adapter so tests don't touch SemSQL DBs on disk.""" + + def __init__(self, *, adapters: Dict[str, _FakeAdapter], curies: Dict[str, Set[str]], + input_dir: Path, output_dir: Path): + super().__init__(input_dir=input_dir, output_dir=output_dir) + self._fake_adapters = adapters + self._fake_curies = curies + + def _open_adapter(self, prefix, db_path): # noqa: D401 — override + return self._fake_adapters.get(prefix) + + def run(self, data_file=None): # noqa: D401 — override + # Bypass collect_stub_curies (we inject a curated set instead). + for prefix, curies in self._fake_curies.items(): + if prefix not in STUB_ONTOLOGY_SOURCES: + continue + cfg = STUB_ONTOLOGY_SOURCES[prefix] + output_file = self.output_dir / f"{prefix.lower()}_nodes.tsv" + self._write_stub_nodes( + prefix=prefix, + curies=sorted(curies), + db_path=self.input_base_dir / cfg["db_filename"], + knowledge_source=cfg["knowledge_source"], + output_file=output_file, + ) + + +def _read_tsv(path: Path) -> List[Dict[str, str]]: + with path.open("r", encoding="utf-8") as fh: + return list(csv.DictReader(fh, delimiter="\t")) + + +# --------------------------------------------------------------------------- +# Static / config tests +# --------------------------------------------------------------------------- + + +def test_stub_ontology_sources_subset_of_stub_prefixes(): + """Every prefix the new transform handles must be a recognized stub prefix.""" + assert set(STUB_ONTOLOGY_SOURCES.keys()).issubset(STUB_ONTOLOGY_PREFIXES) + + +def test_stub_ontology_sources_covers_ncit_and_mesh(): + """NCIT and mesh are the two prefixes that need full enrichment.""" + assert set(STUB_ONTOLOGY_SOURCES.keys()) == {"NCIT", "mesh"} + + +# --------------------------------------------------------------------------- +# Transform behaviour with in-memory adapter +# --------------------------------------------------------------------------- + + +def test_transform_writes_label_synonyms_xrefs(tmp_path): + """A CURIE with full metadata in the fake adapter must round-trip into the TSV.""" + adapters = { + "NCIT": _FakeAdapter({ + "NCIT:C29298": { + "label": "Oatmeal", + "aliases": ["Avena sativa rolled groats", "Porridge oats"], + "xrefs": ["FOODON:00001540", "wikipedia:Oatmeal"], + }, + }), + } + curies = {"NCIT": {"NCIT:C29298"}, "mesh": set()} + t = _StubbedTransform(adapters=adapters, curies=curies, + input_dir=tmp_path / "in", output_dir=tmp_path / "out") + (tmp_path / "in").mkdir() + t.run() + rows = _read_tsv(tmp_path / "out" / "ontologies_stubs" / "ncit_nodes.tsv") + assert len(rows) == 1 + row = rows[0] + assert row["id"] == "NCIT:C29298" + assert row["category"] == STUB_ONTOLOGY_CATEGORY + assert row["name"] == "Oatmeal" + assert "Avena sativa rolled groats" in row["synonym"].split("|") + assert "FOODON:00001540" in row["xref"].split("|") + + +def test_transform_falls_back_to_curie_when_label_missing(tmp_path): + """Missing label must NOT produce an empty `name` cell — fall back to the CURIE.""" + adapters = {"NCIT": _FakeAdapter({})} # adapter knows nothing + curies = {"NCIT": {"NCIT:C99999"}, "mesh": set()} + t = _StubbedTransform(adapters=adapters, curies=curies, + input_dir=tmp_path / "in", output_dir=tmp_path / "out") + (tmp_path / "in").mkdir() + t.run() + rows = _read_tsv(tmp_path / "out" / "ontologies_stubs" / "ncit_nodes.tsv") + assert len(rows) == 1 + assert rows[0]["name"] == "NCIT:C99999" # falls back to the CURIE itself + + +def test_transform_writes_empty_tsv_when_no_curies(tmp_path): + """No CURIEs → empty file with header (so merge.yaml's filename declaration is satisfied).""" + adapters = {"mesh": _FakeAdapter({})} + curies = {"NCIT": set(), "mesh": set()} + t = _StubbedTransform(adapters=adapters, curies=curies, + input_dir=tmp_path / "in", output_dir=tmp_path / "out") + (tmp_path / "in").mkdir() + t.run() + out = tmp_path / "out" / "ontologies_stubs" + # Both files written even with empty inputs (header-only). + assert (out / "ncit_nodes.tsv").is_file() + assert (out / "mesh_nodes.tsv").is_file() + assert _read_tsv(out / "ncit_nodes.tsv") == [] + assert _read_tsv(out / "mesh_nodes.tsv") == [] + + +def test_transform_synonym_does_not_duplicate_label(tmp_path): + """If an alias equals the label, drop it from the synonym set (keep them disjoint).""" + adapters = { + "NCIT": _FakeAdapter({ + "NCIT:C29298": { + "label": "Oatmeal", + "aliases": ["Oatmeal", "Porridge oats"], + "xrefs": [], + }, + }), + } + curies = {"NCIT": {"NCIT:C29298"}, "mesh": set()} + t = _StubbedTransform(adapters=adapters, curies=curies, + input_dir=tmp_path / "in", output_dir=tmp_path / "out") + (tmp_path / "in").mkdir() + t.run() + rows = _read_tsv(tmp_path / "out" / "ontologies_stubs" / "ncit_nodes.tsv") + assert rows[0]["name"] == "Oatmeal" + assert rows[0]["synonym"] == "Porridge oats" + + +def test_transform_raises_when_db_missing(tmp_path): + """If neither the .db nor the .db.gz exists, the transform fails loudly (not silently).""" + # Use the real OntologiesStubsTransform here (not the _StubbedTransform) so the + # _open_adapter path runs against an empty input dir. + transform = OntologiesStubsTransform( + input_dir=tmp_path / "raw", # empty + output_dir=tmp_path / "out", + ) + (tmp_path / "raw").mkdir() + # Force collection to return at least one CURIE so we exercise the missing-DB branch. + object.__setattr__(transform, "run", lambda data_file=None: transform._write_stub_nodes( + prefix="NCIT", + curies=["NCIT:C29298"], + db_path=tmp_path / "raw" / "ncit.db", + knowledge_source="infores:ncit", + output_file=tmp_path / "out" / "ncit_nodes.tsv", + )) + with pytest.raises(FileNotFoundError, match="ncit.db"): + transform.run() + + +# --------------------------------------------------------------------------- +# End-to-end assertions against committed mapping data (skipped when stub +# output absent — i.e. on a fresh checkout where the transform hasn't run). +# --------------------------------------------------------------------------- + + +_STUB_OUTPUT_DIR = REPO_ROOT / "data" / "transformed" / "ontologies_stubs" + + +def _stub_outputs_present() -> bool: + return (_STUB_OUTPUT_DIR / "ncit_nodes.tsv").is_file() and ( + _STUB_OUTPUT_DIR / "mesh_nodes.tsv" + ).is_file() + + +@pytest.mark.skipif(not _stub_outputs_present(), reason="stub transform output not generated yet") +def test_every_referenced_curie_has_stub_node(): + """Every NCIT/mesh CURIE referenced under mappings/ must resolve to a stub node row.""" + expected = collect_stub_curies(["NCIT", "mesh"]) + for prefix, curies in expected.items(): + out = _STUB_OUTPUT_DIR / f"{prefix.lower()}_nodes.tsv" + rows = _read_tsv(out) + ids = {row["id"] for row in rows} + missing = curies - ids + assert not missing, ( + f"{prefix} stub TSV missing nodes for: {sorted(missing)} " + f"(re-run `poetry run kg transform -s ontologies_stubs`)" + ) + # Every emitted row must carry a non-empty name (no dangling-style placeholders). + empty_names = [row["id"] for row in rows if not (row["name"] or "").strip()] + assert not empty_names, f"{prefix} stub rows with empty name: {empty_names}" diff --git a/tests/test_stub_curie_collection.py b/tests/test_stub_curie_collection.py new file mode 100644 index 00000000..8848a780 --- /dev/null +++ b/tests/test_stub_curie_collection.py @@ -0,0 +1,93 @@ +"""Tests for kg_microbe.utils.stub_curie_collection.""" + +from __future__ import annotations + +import csv +from pathlib import Path + +import pytest + +from kg_microbe.utils.stub_curie_collection import ( + DEFAULT_MAPPING_PATHS, + collect_stub_curies, +) + + +def _write_tsv(path: Path, header: list[str], rows: list[list[str]]) -> None: + """Write a small TSV with the given header + rows.""" + with path.open("w", newline="", encoding="utf-8") as fh: + writer = csv.writer(fh, delimiter="\t", lineterminator="\n") + writer.writerow(header) + writer.writerows(rows) + + +def test_collect_finds_known_ncit_mesh_curies(tmp_path): + """A fixture mapping with one NCIT and one mesh row must surface both CURIEs.""" + fixture = tmp_path / "fixture.tsv" + _write_tsv( + fixture, + ["subject_id", "object_id", "object_label"], + [ + ["kgmicrobe.compound:oatmeal", "NCIT:C29298", "Oatmeal"], + ["kgmicrobe.compound:tween", "mesh:D011136", "Tween"], + ["kgmicrobe.compound:other", "CHEBI:15377", "water"], # ignored prefix + ], + ) + result = collect_stub_curies(["NCIT", "mesh"], mapping_paths=[fixture]) + assert result["NCIT"] == {"NCIT:C29298"} + assert result["mesh"] == {"mesh:D011136"} + + +def test_collect_normalizes_curie_case(tmp_path): + """``Mesh:D011136`` and ``ncit:C29298`` (wrong case) must collapse to the canonical case.""" + fixture = tmp_path / "fixture.tsv" + _write_tsv( + fixture, + ["subject_id", "object_id", "object_label"], + [ + ["x", "ncit:C29298", "lowercase ncit"], + ["x", "Mesh:D011136", "mixed-case mesh"], + ], + ) + result = collect_stub_curies(["NCIT", "mesh"], mapping_paths=[fixture]) + # Case is normalized to the requested-prefix case. + assert result["NCIT"] == {"NCIT:C29298"} + assert result["mesh"] == {"mesh:D011136"} + + +def test_collect_returns_empty_set_for_unreferenced_prefix(tmp_path): + """A prefix with no references in any file gets an empty set, not a missing key.""" + fixture = tmp_path / "fixture.tsv" + _write_tsv(fixture, ["object_id"], [["CHEBI:15377"]]) + result = collect_stub_curies(["NCIT", "mesh"], mapping_paths=[fixture]) + assert result == {"NCIT": set(), "mesh": set()} + + +def test_collect_skips_missing_files_silently(tmp_path): + """Removing a mapping source from the repo must not break the collector.""" + missing = tmp_path / "does-not-exist.tsv" + result = collect_stub_curies(["NCIT", "mesh"], mapping_paths=[missing]) + assert result == {"NCIT": set(), "mesh": set()} + + +def test_collect_handles_sssom_yaml_header(tmp_path): + """SSSOM-style ``# ...`` YAML metadata header lines must be skipped before the column header.""" + fixture = tmp_path / "fixture.sssom.tsv" + fixture.write_text( + "# curie_map:\n" + "# NCIT: 'http://purl.obolibrary.org/obo/NCIT_'\n" + "subject_id\tobject_id\tobject_label\n" + "x\tNCIT:C29298\tOatmeal\n", + encoding="utf-8", + ) + result = collect_stub_curies(["NCIT"], mapping_paths=[fixture]) + assert result["NCIT"] == {"NCIT:C29298"} + + +def test_default_paths_yield_real_curies(): + """Smoke-test against the committed mapping files: must find ≥1 NCIT and ≥1 mesh CURIE.""" + if not any(p.is_file() for p in DEFAULT_MAPPING_PATHS): + pytest.skip("no mapping files present in this checkout") + result = collect_stub_curies(["NCIT", "mesh"]) + assert len(result["NCIT"]) >= 1, "expected at least one NCIT CURIE in committed mappings" + assert len(result["mesh"]) >= 1, "expected at least one mesh CURIE in committed mappings"