From 770865d788a668109a73ba64748df939d89a5301 Mon Sep 17 00:00:00 2001
From: "marcin p. joachimiak" <4625870+realmarcin@users.noreply.github.com>
Date: Mon, 11 May 2026 21:40:53 -0700
Subject: [PATCH] Selective per-CURIE NCIT/MESH stub-import transform
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds a new sibling transform `ontologies_stubs` that imports just the
NCIT and MESH terms referenced by mappings/ — with full label, exact
synonyms, and dbxrefs — without loading the rest of those ontologies
(those belong to the sibling kg-microbe-biomedical pipeline).

Today the chemical-mapping consolidator and the BacDive isolation-source
mapper reference 70 NCIT and 92 MESH IDs as canonical xrefs for
ingredients (e.g. NCIT:C29298 'Oatmeal', mesh:D011136 'Tween'). The
existing STUB_ONTOLOGY_PREFIXES mechanism in
isolation_source_mapping_utils.py was emitting label-only nodes for
these inline from the BacDive transform, but: (a) the chemical-mapping-
driven NCIT/MESH refs were producing dangling node ids in the merged
KG (no node row, label, or xrefs), and (b) even where stubs existed
they carried only the BacDive object_label, no synonyms, no xrefs.

This transform fixes both. For each NCIT/MESH CURIE referenced
anywhere under mappings/, OAK queries the local SemSQL DB
(data/raw/{ncit,mesh}.db) for rdfs:label, exact synonyms, and dbxrefs,
then writes a labelled stub node to data/transformed/ontologies_stubs/
{ncit,mesh}_nodes.tsv. The DBs themselves are never loaded into the
merged KG.

Components:

- kg_microbe/utils/stub_curie_collection.py (new): collect_stub_curies()
  scans an explicit list of mapping TSVs (unified SSSOM, isolation-
  source, MIM, canonical/*) and returns per-prefix CURIE sets with case
  normalization. Currently surfaces 70 NCIT + 92 MESH CURIEs.

- kg_microbe/transform_utils/ontologies_stubs/ (new): per-CURIE OAK
  SemSQL fetch following the same pattern as the chemical-mapping
  consolidator's enrich_with_chebi_synonyms (label, entity_aliases,
  entity_metadata_map for dbxrefs). Auto-decompresses .db.gz on first
  run if the unzipped .db is missing. Fails loudly with an actionable
  message if neither is present (no silent dangling-xref fallback).

- kg_microbe/transform.py: registers ONTOLOGIES_STUBS in DATA_SOURCES,
  ordered after OntologiesTransform so the SemSQL DBs are present.

- download.yaml: adds ncit.db.gz and mesh.db.gz from s3.amazonaws.com/
  bbop-sqlite (the standard SemSQL distribution; same source the OAK
  `sqlite:obo:` shim hits).

- merge.yaml + merge.no_metatraits.yaml + merge_bakta.yaml: declare
  the new ontologies_stubs source so the merged KG picks up the stub
  nodes. merge.minimal.yaml unchanged (it skips ontology nodes
  entirely; not a target for stub enrichment).

- kg_microbe/transform_utils/bacdive/bacdive.py: BacDive's inline
  stub-emit at lines 2990-3003 now skips NCIT and mesh prefixes
  (deferred to the new transform). Long-tail prefixes (PRIDE, PCO,
  GENEPIO, FAO, BTO, SNOMED — 1-3 IDs each) keep the inline label-
  only fallback. Build-time prefix validator unchanged.

- kg_microbe/utils/isolation_source_mapping_utils.py:
  STUB_ONTOLOGY_PREFIXES docstring rewritten to document the two
  stub-import paths (SemSQL-enriched for NCIT/mesh, inline label-only
  for the long-tail).

Tests:

- tests/test_stub_curie_collection.py: collector unit tests covering
  CURIE discovery, case normalization, missing files, SSSOM YAML
  header skipping, and a smoke test against the real committed
  mappings.

- tests/test_ontologies_stubs.py: transform unit tests with an
  in-memory fake adapter — round-trips label/synonyms/xrefs, falls
  back to CURIE when label missing, drops alias-equals-label
  duplicates, writes header-only file when no CURIEs, and raises
  loudly when neither .db nor .db.gz is present. Plus an integration
  test (skipped when stub output absent) that asserts every
  collector-discovered CURIE has a corresponding stub-node row.

Verified: 13 unit tests pass; ruff clean.

End-to-end verification (requires `poetry run kg download` to fetch
ncit.db.gz + mesh.db.gz, ~370 MB total, one-time):

  poetry run kg transform -s ontologies_stubs
  wc -l data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv
  poetry run pytest tests/test_ontologies_stubs.py -v   # integration test no longer skipped

Plan: /Users/marcin/.claude/plans/are-all-of-these-glittery-origami.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 download.yaml                                 |  19 ++
 kg_microbe/transform.py                       |   8 +
 kg_microbe/transform_utils/bacdive/bacdive.py |  16 +-
 kg_microbe/transform_utils/constants.py       |   1 +
 .../ontologies_stubs/__init__.py              |   7 +
 .../ontologies_stubs_transform.py             | 253 ++++++++++++++++++
 .../utils/isolation_source_mapping_utils.py   |  30 ++-
 kg_microbe/utils/stub_curie_collection.py     | 138 ++++++++++
 merge.no_metatraits.yaml                      |   8 +
 merge.yaml                                    |  10 +
 merge_bakta.yaml                              |   8 +
 tests/test_ontologies_stubs.py                | 228 ++++++++++++++++
 tests/test_stub_curie_collection.py           |  93 +++++++
 13 files changed, 812 insertions(+), 7 deletions(-)
 create mode 100644 kg_microbe/transform_utils/ontologies_stubs/__init__.py
 create mode 100644 kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py
 create mode 100644 kg_microbe/utils/stub_curie_collection.py
 create mode 100644 tests/test_ontologies_stubs.py
 create mode 100644 tests/test_stub_curie_collection.py

diff --git a/download.yaml b/download.yaml
index c330fd77..991b2700 100644
--- a/download.yaml
+++ b/download.yaml
@@ -423,3 +423,22 @@
 -
   url: https://raw.githubusercontent.com/biolink/kgx/master/docs/kgx_format.md
   local_name: kgx-format.md
+
+
+#
+# **** Selective stub-import ontologies (NCIT, MESH) ****
+#
+# KG-Microbe does NOT load the full NCIT or MESH ontologies — those belong to
+# kg-microbe-biomedical. But the chemical-mapping consolidator and BacDive
+# isolation-source mapper reference ~150 NCIT/MESH IDs as canonical xrefs for
+# ingredients (e.g. NCIT:C29298 'Oatmeal', mesh:D011136 'Tween'). The
+# OntologiesStubsTransform queries these SemSQL DBs to harvest just the
+# referenced IDs (label + synonyms + xrefs), emitting one labelled stub node
+# each. The DBs themselves are never loaded into the merged KG.
+#
+-
+  url: https://s3.amazonaws.com/bbop-sqlite/ncit.db.gz
+  local_name: ncit.db.gz
+-
+  url: https://s3.amazonaws.com/bbop-sqlite/mesh.db.gz
+  local_name: mesh.db.gz
diff --git a/kg_microbe/transform.py b/kg_microbe/transform.py
index e03ebd8d..c1c346be 100644
--- a/kg_microbe/transform.py
+++ b/kg_microbe/transform.py
@@ -20,6 +20,7 @@
     METATRAITS,
     METATRAITS_GTDB,
     ONTOLOGIES,
+    ONTOLOGIES_STUBS,
     RHEAMAPPINGS,
 )
 from kg_microbe.transform_utils.gtdb.gtdb import GTDBTransform
@@ -32,6 +33,9 @@
     ONTOLOGIES_MAP,
     OntologiesTransform,
 )
+from kg_microbe.transform_utils.ontologies_stubs.ontologies_stubs_transform import (
+    OntologiesStubsTransform,
+)
 from kg_microbe.transform_utils.rhea_mappings.rhea_mappings import RheaMappingsTransform
 
 DATA_SOURCES = {
@@ -44,6 +48,10 @@
     # "ProteinAtlasTransform": ProteinAtlasTransform,
     # "STRINGTransform": STRINGTransform,
     ONTOLOGIES: OntologiesTransform,
+    # Run ontologies_stubs after ontologies so the SemSQL DBs are present and
+    # so the stub-node TSVs land in data/transformed/ontologies_stubs/ before
+    # the merge step picks them up.
+    ONTOLOGIES_STUBS: OntologiesStubsTransform,
     BACDIVE: BacDiveTransform,
     BAKTA: BaktaTransform,
     COG: COGTransform,
diff --git a/kg_microbe/transform_utils/bacdive/bacdive.py b/kg_microbe/transform_utils/bacdive/bacdive.py
index 7b61e0f1..72407576 100644
--- a/kg_microbe/transform_utils/bacdive/bacdive.py
+++ b/kg_microbe/transform_utils/bacdive/bacdive.py
@@ -2987,8 +2987,22 @@ def run(self, data_file: Union[Optional[Path], Optional[str]] = None, show_statu
                             # emit a thin node row here instead of pulling in the full
                             # ontology. Loaded-ontology targets (UBERON, ENVO, ...) get
                             # their canonical node from the ontologies transform.
+                            #
+                            # NCIT and mesh stub nodes are NOT emitted here — the
+                            # OntologiesStubsTransform (kg_microbe/transform_utils/
+                            # ontologies_stubs/) writes label+synonym+xref-enriched
+                            # stubs from the SemSQL DBs, which is strictly richer
+                            # than the label-only fallback below. Emitting both
+                            # here and there would produce duplicate node rows
+                            # that the merge would have to dedupe. The PRIDE/PCO/
+                            # GENEPIO/FAO/BTO/SNOMED prefixes stay on the inline
+                            # path because each has 1-3 IDs in the whole repo —
+                            # not worth a SemSQL fetch.
                             stub_prefix = subject_id.split(":", 1)[0] if ":" in subject_id else ""
-                            if stub_prefix in STUB_ONTOLOGY_PREFIXES:
+                            if stub_prefix in STUB_ONTOLOGY_PREFIXES and stub_prefix not in {
+                                "NCIT",
+                                "mesh",
+                            }:
                                 node_writer.writerow(
                                     self._create_node_row(
                                         subject_id,
diff --git a/kg_microbe/transform_utils/constants.py b/kg_microbe/transform_utils/constants.py
index 6a2ca1cb..c8f35b89 100644
--- a/kg_microbe/transform_utils/constants.py
+++ b/kg_microbe/transform_utils/constants.py
@@ -13,6 +13,7 @@
 KEGG = "kegg"
 RHEAMAPPINGS = "rhea_mappings"
 ONTOLOGIES = "ontologies"
+ONTOLOGIES_STUBS = "ontologies_stubs"
 WALLEN_ETAL = "wallen_etal"
 CTD = "ctd"
 DISBIOME = "disbiome"
diff --git a/kg_microbe/transform_utils/ontologies_stubs/__init__.py b/kg_microbe/transform_utils/ontologies_stubs/__init__.py
new file mode 100644
index 00000000..fc2e8c6d
--- /dev/null
+++ b/kg_microbe/transform_utils/ontologies_stubs/__init__.py
@@ -0,0 +1,7 @@
+"""Ontologies-stubs transform package."""
+
+from kg_microbe.transform_utils.ontologies_stubs.ontologies_stubs_transform import (
+    OntologiesStubsTransform,
+)
+
+__all__ = ["OntologiesStubsTransform"]
diff --git a/kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py b/kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py
new file mode 100644
index 00000000..48f33f47
--- /dev/null
+++ b/kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py
@@ -0,0 +1,253 @@
+"""
+Ontologies-stubs transform.
+
+KG-Microbe deliberately does NOT load the full NCIT or MESH ontologies — those
+belong to the sibling ``kg-microbe-biomedical`` pipeline. But the
+chemical-mapping consolidator and the BacDive isolation-source mapper reference
+~150 NCIT and MESH IDs as canonical xrefs for ingredients (e.g.
+``NCIT:C29298 'Oatmeal'``, ``mesh:D011136 'Tween'``). Without this transform
+those CURIEs would appear as dangling node ids in the merged KG: edges point at
+them but no node row carries the label.
+
+This transform:
+
+1. Calls :func:`~kg_microbe.utils.stub_curie_collection.collect_stub_curies` to
+   discover every NCIT and MESH CURIE referenced anywhere under ``mappings/``.
+2. For each CURIE, queries the local SemSQL DB (``data/raw/ncit.db``,
+   ``data/raw/mesh.db``) via OAK to fetch its ``rdfs:label``, exact synonyms,
+   and dbxrefs. The same pattern is used by the chemical-mapping consolidator
+   for ChEBI in ``scripts/consolidate_chemical_mappings.py``.
+3. Writes one KGX node TSV per stub ontology to
+   ``data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv`` carrying
+   ``id, category, name, synonym, xref, provided_by, knowledge_source``.
+   No edges file — stubs are isolated nodes; edges arrive from the source
+   transforms (BacDive, MediaDive ingredients via the chemical-mapping path,
+   etc.).
+
+Note for downstream consumers: if a KG built with this transform is ever
+merged with a kg-microbe-biomedical KG that loads NCIT/MESH fully, biolink
+merge semantics will union nodes — the stub node here is a strict subset of
+what the full ontology would emit (label/synonym/xref only; no edges, no
+deprecated flag, no parent classes), so the union will simply pick the
+fuller record.
+"""
+
+from __future__ import annotations
+
+import csv
+import gzip
+import shutil
+from pathlib import Path
+from typing import Dict, Iterable, List, Optional, Set
+
+from kg_microbe.transform_utils.constants import (
+    CATEGORY_COLUMN,
+    DEPRECATED_COLUMN,
+    DESCRIPTION_COLUMN,
+    ID_COLUMN,
+    NAME_COLUMN,
+    PROVIDED_BY_COLUMN,
+    SAME_AS_COLUMN,
+    SYNONYM_COLUMN,
+    XREF_COLUMN,
+)
+from kg_microbe.transform_utils.transform import Transform
+from kg_microbe.utils.isolation_source_mapping_utils import STUB_ONTOLOGY_CATEGORY
+from kg_microbe.utils.stub_curie_collection import collect_stub_curies
+
+# Stub ontologies handled by this transform. Each entry maps the canonical
+# CURIE prefix (case-sensitive — must match how the prefix appears in
+# existing mapping rows) to the local SemSQL DB and the InforES knowledge
+# source string.
+STUB_ONTOLOGY_SOURCES: Dict[str, Dict[str, str]] = {
+    "NCIT": {
+        "db_filename": "ncit.db",
+        "knowledge_source": "infores:ncit",
+    },
+    "mesh": {
+        "db_filename": "mesh.db",
+        "knowledge_source": "infores:mesh",
+    },
+}
+
+ONTOLOGIES_STUBS_SOURCE_NAME = "ontologies_stubs"
+
+
+class OntologiesStubsTransform(Transform):
+
+    """Emit one labelled stub node per referenced NCIT / MESH CURIE."""
+
+    def __init__(
+        self,
+        input_dir: Optional[Path] = None,
+        output_dir: Optional[Path] = None,
+    ):
+        """
+        Instantiate transform.
+
+        :param input_dir: Where the SemSQL DBs live (defaults to ``data/raw/``).
+        :param output_dir: Where ``ontologies_stubs/{ncit,mesh}_nodes.tsv`` are
+            written (defaults to ``data/transformed/``).
+        """
+        super().__init__(ONTOLOGIES_STUBS_SOURCE_NAME, input_dir, output_dir)
+
+    def run(self, data_file=None) -> None:  # noqa: D401 — base class signature
+        """
+        Collect stub CURIEs, fetch metadata via OAK, write per-ontology node TSVs.
+
+        :param data_file: Unused (kept for the base-class signature). The
+            transform discovers its inputs from the mapping TSVs and the
+            SemSQL DBs in ``input_base_dir``.
+        """
+        prefixes = list(STUB_ONTOLOGY_SOURCES.keys())
+        curies_by_prefix = collect_stub_curies(prefixes)
+
+        for prefix, curies in curies_by_prefix.items():
+            cfg = STUB_ONTOLOGY_SOURCES[prefix]
+            db_path = self.input_base_dir / cfg["db_filename"]
+            output_file = self.output_dir / f"{prefix.lower()}_nodes.tsv"
+            self._write_stub_nodes(
+                prefix=prefix,
+                curies=sorted(curies),
+                db_path=db_path,
+                knowledge_source=cfg["knowledge_source"],
+                output_file=output_file,
+            )
+
+    # ------------------------------------------------------------------
+    # internal helpers
+    # ------------------------------------------------------------------
+
+    def _write_stub_nodes(
+        self,
+        prefix: str,
+        curies: List[str],
+        db_path: Path,
+        knowledge_source: str,
+        output_file: Path,
+    ) -> None:
+        """Fetch label/synonyms/xrefs per CURIE and write the node TSV."""
+        if not curies:
+            print(f"  [{prefix}] no CURIEs to import; skipping {output_file.name}")
+            # Write an empty file with header so the merge step doesn't fail
+            # on a missing file declared in merge.yaml.
+            self._write_node_file(output_file, [])
+            return
+
+        adapter = self._open_adapter(prefix, db_path)
+        if adapter is None:
+            raise FileNotFoundError(
+                f"OAK adapter for {prefix} could not be opened (expected SemSQL DB at "
+                f"{db_path}). Run `poetry run kg download` to fetch it. The stub "
+                f"transform refuses to silently emit unlabelled nodes — that would "
+                f"reintroduce the dangling-xref hazard this transform exists to fix."
+            )
+
+        rows: List[List[Optional[str]]] = []
+        missing: List[str] = []
+        for curie in curies:
+            label, synonyms, xrefs = self._fetch_metadata(adapter, curie)
+            if not label:
+                # Last-resort fallback: use the CURIE as the name. Log it so
+                # curators can chase down obsolete or missing entries upstream.
+                missing.append(curie)
+                label = curie
+            row = [
+                curie,                      # id
+                STUB_ONTOLOGY_CATEGORY,     # category
+                label,                      # name
+                None,                        # description
+                _join_pipe(xrefs),          # xref
+                ONTOLOGIES_STUBS_SOURCE_NAME,  # provided_by
+                _join_pipe(synonyms),       # synonym
+                None,                        # deprecated
+                None,                        # same_as
+            ]
+            rows.append(row)
+
+        self._write_node_file(output_file, rows)
+        print(
+            f"  [{prefix}] wrote {len(rows)} stub nodes to {output_file.name} "
+            f"(knowledge_source={knowledge_source}, missing labels: {len(missing)})"
+        )
+        if missing:
+            print(f"  [{prefix}] CURIEs with no SemSQL label (used CURIE as name): {missing}")
+
+    def _open_adapter(self, prefix: str, db_path: Path):
+        """
+        Open an OAK SemSQL adapter against the local DB; return None on failure.
+
+        OBO Foundry distributes the SemSQL DBs as ``.db.gz`` and ``download.yaml``
+        stores the gzipped form. If the unzipped ``.db`` is missing but a sibling
+        ``.db.gz`` is present, decompress it once (idempotent) and use the result.
+        """
+        if not db_path.is_file():
+            gz_path = db_path.with_suffix(db_path.suffix + ".gz")
+            if gz_path.is_file():
+                print(f"  [{prefix}] decompressing {gz_path.name} → {db_path.name}")
+                with gzip.open(gz_path, "rb") as src, db_path.open("wb") as dst:
+                    shutil.copyfileobj(src, dst)
+            else:
+                return None
+        try:
+            from oaklib import get_adapter
+        except ImportError as exc:  # pragma: no cover — oaklib is a dep
+            raise RuntimeError(
+                f"oaklib import failed while opening SemSQL adapter for {prefix}: {exc}"
+            ) from exc
+        return get_adapter(f"sqlite:{db_path}")
+
+    def _fetch_metadata(self, adapter, curie: str):
+        """Return (label, synonyms_set, xrefs_set) for ``curie`` via the OAK adapter."""
+        label = ""
+        synonyms: Set[str] = set()
+        xrefs: Set[str] = set()
+        try:
+            label = adapter.label(curie) or ""
+        except Exception:  # noqa: S110 — obsolete CURIEs are expected to miss
+            pass
+        try:
+            synonyms = {s for s in adapter.entity_aliases(curie) if s}
+        except Exception:  # noqa: S110
+            pass
+        # Drop the canonical label out of the synonym set to keep them disjoint.
+        synonyms.discard(label)
+        try:
+            metadata = adapter.entity_metadata_map(curie) or {}
+        except Exception:  # noqa: S110
+            metadata = {}
+        # OAK returns metadata keyed by short-form predicate. dbxref entries
+        # land under "oio:hasDbXref" (or "oboInOwl:hasDbXref" on older
+        # adapters). Accept both.
+        for predicate_key in ("oio:hasDbXref", "oboInOwl:hasDbXref"):
+            for value in metadata.get(predicate_key, []) or []:
+                if value:
+                    xrefs.add(str(value))
+        return label, sorted(synonyms), sorted(xrefs)
+
+    def _write_node_file(self, path: Path, rows: Iterable[Iterable[Optional[str]]]) -> None:
+        """Write ``rows`` to ``path`` using the standard Transform node header."""
+        path.parent.mkdir(parents=True, exist_ok=True)
+        # Use the canonical 9-column node header from the Transform base class.
+        header = [
+            ID_COLUMN,
+            CATEGORY_COLUMN,
+            NAME_COLUMN,
+            DESCRIPTION_COLUMN,
+            XREF_COLUMN,
+            PROVIDED_BY_COLUMN,
+            SYNONYM_COLUMN,
+            DEPRECATED_COLUMN,
+            SAME_AS_COLUMN,
+        ]
+        with path.open("w", newline="", encoding="utf-8") as fh:
+            writer = csv.writer(fh, delimiter="\t", lineterminator="\n")
+            writer.writerow(header)
+            for row in rows:
+                writer.writerow(["" if cell is None else cell for cell in row])
+
+
+def _join_pipe(values: Iterable[str]) -> str:
+    """Pipe-join a sequence; return ``""`` when empty (matches existing TSV convention)."""
+    items = [v for v in values if v]
+    return "|".join(items) if items else ""
diff --git a/kg_microbe/utils/isolation_source_mapping_utils.py b/kg_microbe/utils/isolation_source_mapping_utils.py
index 30028b3c..8c5aa8c7 100644
--- a/kg_microbe/utils/isolation_source_mapping_utils.py
+++ b/kg_microbe/utils/isolation_source_mapping_utils.py
@@ -86,12 +86,30 @@
 # but that are NOT loaded by the ontologies transform (see ONTOLOGIES_MAP in
 # kg_microbe/transform_utils/ontologies/ontologies_transform.py). Each prefix
 # either has only a tiny number of distinct IDs in use, or its full load is
-# impractical (mesh and NCIT are huge clinical thesauri), so the BacDive
-# transform writes a thin node row per resolved CURIE using the object_label
-# from the mapping TSV. The category is biolink:OntologyClass for all stubs
-# because they're typically categorical terms (host body site, microbial
-# community, abscess, etc.) rather than specific anatomy / environmental
-# features whose canonical metadata would come from a loaded ontology.
+# impractical (mesh and NCIT are huge clinical thesauri).
+#
+# Two stub-import paths exist for these prefixes:
+#
+# 1. NCIT and mesh: a SemSQL-backed enriched stub source. The
+#    OntologiesStubsTransform (kg_microbe/transform_utils/ontologies_stubs/)
+#    queries data/raw/ncit.db and data/raw/mesh.db via OAK to fetch
+#    rdfs:label, exact synonyms, and dbxrefs for every NCIT/mesh CURIE that
+#    appears anywhere under mappings/. Output:
+#    data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv. This is the
+#    preferred path — stubs carry full metadata, not just a label. The
+#    BacDive inline emit at bacdive.py defers to this transform for these
+#    two prefixes (see the `not in {"NCIT", "mesh"}` branch there).
+#
+# 2. The long-tail prefixes (PRIDE, PCO, GENEPIO, FAO, BTO, SNOMED): each
+#    has 1-3 IDs in the whole repo, so the BacDive transform writes a thin
+#    label-only node row inline at edge-emit time using the object_label
+#    from the mapping TSV. Setting up SemSQL DBs for these would be
+#    overkill.
+#
+# The category is biolink:OntologyClass for all stubs because they're
+# typically categorical terms (host body site, microbial community,
+# abscess, etc.) rather than specific anatomy / environmental features
+# whose canonical metadata would come from a loaded ontology.
 #
 # Codex adversarial review #558 found that without stubs for these prefixes
 # the BacDive transform was emitting edges to dangling node IDs because the
diff --git a/kg_microbe/utils/stub_curie_collection.py b/kg_microbe/utils/stub_curie_collection.py
new file mode 100644
index 00000000..0abcfeaf
--- /dev/null
+++ b/kg_microbe/utils/stub_curie_collection.py
@@ -0,0 +1,138 @@
+"""
+Collect stub-prefix CURIEs referenced anywhere in the mapping TSVs.
+
+KG-Microbe deliberately does NOT load the full NCIT or MESH ontologies (those
+belong to the sibling kg-microbe-biomedical pipeline), but the chemical-mapping
+consolidator and the BacDive isolation-source mapper reference a small handful
+of NCIT and MESH IDs. This collector finds every such CURIE so that the
+downstream :class:`~kg_microbe.transform_utils.ontologies_stubs.ontologies_stubs_transform.OntologiesStubsTransform`
+can fetch a labelled stub node for each one.
+
+It scans a fixed set of mapping files at the repo root (no glob magic — wrong
+edits silently change the import set, so the file list is explicit and
+auditable):
+
+* ``mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz`` — unified
+  chemical/anatomy/environment mappings (object_id, subject_id columns).
+* ``mappings/isolation_source_to_ontology.tsv`` — BacDive isolation-source
+  mappings (object_id column).
+* ``mappings/ingredient_mappings.sssom.tsv`` — vendored MIM SSSOM
+  (object_id, subject_id).
+* ``mappings/canonical/*.tsv`` — chemical/enzyme/pathway/phenotype canonical
+  exports (object_id).
+
+Returned dict shape: ``{normalized_prefix: {curie, curie, ...}}`` where
+``normalized_prefix`` matches the case used in
+:data:`~kg_microbe.utils.isolation_source_mapping_utils.STUB_ONTOLOGY_PREFIXES`
+(e.g. ``"NCIT"`` is uppercase, ``"mesh"`` is lowercase). Inputs in any case
+are accepted and normalized.
+"""
+
+from __future__ import annotations
+
+import csv
+import gzip
+import re
+from pathlib import Path
+from typing import Dict, Iterable, Set
+
+REPO_ROOT = Path(__file__).resolve().parent.parent.parent
+
+# Files explicitly scanned for stub CURIEs. Adding a new file here is an
+# auditable opt-in change; missing files are silently skipped (so removing a
+# mapping source from the repo doesn't break the collector).
+DEFAULT_MAPPING_PATHS = (
+    REPO_ROOT / "mappings" / "kgmicrobe_unified_entity_mappings.sssom.tsv.gz",
+    REPO_ROOT / "mappings" / "isolation_source_to_ontology.tsv",
+    REPO_ROOT / "mappings" / "ingredient_mappings.sssom.tsv",
+    REPO_ROOT / "mappings" / "canonical" / "chemical_mappings.tsv",
+    REPO_ROOT / "mappings" / "canonical" / "enzyme_mappings.tsv",
+    REPO_ROOT / "mappings" / "canonical" / "enzyme_name_to_go.tsv",
+    REPO_ROOT / "mappings" / "canonical" / "metpo_alias_mappings.tsv",
+    REPO_ROOT / "mappings" / "canonical" / "pathway_mappings.tsv",
+    REPO_ROOT / "mappings" / "canonical" / "phenotype_mappings.tsv",
+    REPO_ROOT / "mappings" / "canonical" / "special_chemical_mappings.tsv",
+)
+
+# Columns to scan for CURIEs across all mapping shapes. Any cell whose value
+# parses as ``<prefix>:<id>`` and matches one of the requested prefixes is
+# collected.
+_CURIE_COLUMNS = (
+    "object_id",
+    "subject_id",
+)
+
+_CURIE_RE = re.compile(r"^([A-Za-z][A-Za-z0-9._-]*):([A-Za-z0-9_.\-]+)$")
+
+
+def _open_text(path: Path):
+    """Open a TSV / TSV.GZ for text reading after stripping any SSSOM YAML header."""
+    handle = gzip.open(path, "rt", encoding="utf-8") if path.suffix == ".gz" else path.open(
+        "r", encoding="utf-8"
+    )
+    # SSSOM files prefix a YAML metadata header with `# `. Skip those before
+    # handing the file to csv.DictReader.
+    while True:
+        pos = handle.tell()
+        line = handle.readline()
+        if not line:
+            break
+        if not line.startswith("#"):
+            handle.seek(pos)
+            break
+    return handle
+
+
+def _normalize_prefix(prefix: str, canonical_prefixes: Dict[str, str]) -> str | None:
+    """Return the canonical-cased prefix string for ``prefix``, or ``None`` if unknown."""
+    return canonical_prefixes.get(prefix.lower())
+
+
+def collect_stub_curies(
+    prefixes: Iterable[str],
+    mapping_paths: Iterable[Path] | None = None,
+) -> Dict[str, Set[str]]:
+    """
+    Scan the mapping TSVs and return the set of CURIEs that match each requested prefix.
+
+    :param prefixes: Iterable of CURIE prefixes to collect. Case-insensitive on
+        input; the returned dict's keys preserve the case as given here, so
+        callers should pass them in the canonical form they want
+        (``"NCIT"``, ``"mesh"``, ...).
+    :param mapping_paths: Override the file list (mainly for tests). Defaults
+        to :data:`DEFAULT_MAPPING_PATHS`.
+    :returns: ``{canonical_prefix: {curie, ...}}`` for every prefix in
+        ``prefixes``, with the empty set as default for prefixes that have no
+        references in any mapping file.
+    """
+    canonical_prefixes: Dict[str, str] = {p.lower(): p for p in prefixes}
+    result: Dict[str, Set[str]] = {p: set() for p in canonical_prefixes.values()}
+
+    paths = list(mapping_paths) if mapping_paths is not None else list(DEFAULT_MAPPING_PATHS)
+
+    for path in paths:
+        if not path.is_file():
+            continue
+        with _open_text(path) as handle:
+            reader = csv.DictReader(handle, delimiter="\t")
+            for row in reader:
+                for col in _CURIE_COLUMNS:
+                    value = (row.get(col) or "").strip()
+                    if not value:
+                        continue
+                    match = _CURIE_RE.match(value)
+                    if not match:
+                        continue
+                    raw_prefix, local = match.group(1), match.group(2)
+                    canonical = _normalize_prefix(raw_prefix, canonical_prefixes)
+                    if canonical is None:
+                        continue
+                    result[canonical].add(f"{canonical}:{local}")
+
+    return result
+
+
+__all__ = [
+    "DEFAULT_MAPPING_PATHS",
+    "collect_stub_curies",
+]
diff --git a/merge.no_metatraits.yaml b/merge.no_metatraits.yaml
index 4ffb9a2e..b49599f7 100644
--- a/merge.no_metatraits.yaml
+++ b/merge.no_metatraits.yaml
@@ -68,6 +68,14 @@ merged_graph:
         filename:
           - data/transformed/ontologies/metpo_nodes.tsv
           - data/transformed/ontologies/metpo_edges.tsv
+    # Selective per-CURIE stub-import for NCIT and MESH — see merge.yaml.
+    ontologies_stubs:
+      name: "ontologies_stubs"
+      input:
+        format: tsv
+        filename:
+          - data/transformed/ontologies_stubs/ncit_nodes.tsv
+          - data/transformed/ontologies_stubs/mesh_nodes.tsv
     bacdive:
       name: "bacdive"
       input:
diff --git a/merge.yaml b/merge.yaml
index 7876cc4c..ec475cf5 100644
--- a/merge.yaml
+++ b/merge.yaml
@@ -82,6 +82,16 @@ merged_graph:
         filename:
           - data/transformed/ontologies/metpo_nodes.tsv
           - data/transformed/ontologies/metpo_edges.tsv
+    # Selective per-CURIE stub-import for NCIT and MESH — only the IDs
+    # referenced by mappings/* are imported (not the full ontologies).
+    # See kg_microbe/transform_utils/ontologies_stubs/ for the transform.
+    ontologies_stubs:
+      name: "ontologies_stubs"
+      input:
+        format: tsv
+        filename:
+          - data/transformed/ontologies_stubs/ncit_nodes.tsv
+          - data/transformed/ontologies_stubs/mesh_nodes.tsv
     bacdive:
       name: "bacdive"
       input:
diff --git a/merge_bakta.yaml b/merge_bakta.yaml
index c2f4431d..da1020f0 100644
--- a/merge_bakta.yaml
+++ b/merge_bakta.yaml
@@ -84,6 +84,14 @@ merged_graph:
         filename:
           - data/transformed/ontologies/metpo_nodes.tsv
           - data/transformed/ontologies/metpo_edges.tsv
+    # Selective per-CURIE stub-import for NCIT and MESH — see merge.yaml.
+    ontologies_stubs:
+      name: "ontologies_stubs"
+      input:
+        format: tsv
+        filename:
+          - data/transformed/ontologies_stubs/ncit_nodes.tsv
+          - data/transformed/ontologies_stubs/mesh_nodes.tsv
     bacdive:
       name: "bacdive"
       input:
diff --git a/tests/test_ontologies_stubs.py b/tests/test_ontologies_stubs.py
new file mode 100644
index 00000000..6bd108d3
--- /dev/null
+++ b/tests/test_ontologies_stubs.py
@@ -0,0 +1,228 @@
+"""Tests for the OntologiesStubsTransform."""
+
+from __future__ import annotations
+
+import csv
+from pathlib import Path
+from typing import Dict, List, Set
+
+import pytest
+
+from kg_microbe.transform_utils.ontologies_stubs.ontologies_stubs_transform import (
+    STUB_ONTOLOGY_SOURCES,
+    OntologiesStubsTransform,
+)
+from kg_microbe.utils.isolation_source_mapping_utils import (
+    STUB_ONTOLOGY_CATEGORY,
+    STUB_ONTOLOGY_PREFIXES,
+)
+from kg_microbe.utils.stub_curie_collection import collect_stub_curies
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+
+
+# ---------------------------------------------------------------------------
+# In-memory fakes
+# ---------------------------------------------------------------------------
+
+
+class _FakeAdapter:
+
+    """Minimal stand-in for an OAK SemSQL adapter — enough for the stub transform."""
+
+    def __init__(self, store: Dict[str, Dict]):
+        self._store = store  # {curie: {"label": str, "aliases": [...], "xrefs": [...]}}
+
+    def label(self, curie: str):
+        return self._store.get(curie, {}).get("label", "")
+
+    def entity_aliases(self, curie: str):
+        return list(self._store.get(curie, {}).get("aliases", []))
+
+    def entity_metadata_map(self, curie: str):
+        xrefs = list(self._store.get(curie, {}).get("xrefs", []))
+        return {"oio:hasDbXref": xrefs} if xrefs else {}
+
+
+class _StubbedTransform(OntologiesStubsTransform):
+
+    """Subclass that swaps in an in-memory adapter so tests don't touch SemSQL DBs on disk."""
+
+    def __init__(self, *, adapters: Dict[str, _FakeAdapter], curies: Dict[str, Set[str]],
+                 input_dir: Path, output_dir: Path):
+        super().__init__(input_dir=input_dir, output_dir=output_dir)
+        self._fake_adapters = adapters
+        self._fake_curies = curies
+
+    def _open_adapter(self, prefix, db_path):  # noqa: D401 — override
+        return self._fake_adapters.get(prefix)
+
+    def run(self, data_file=None):  # noqa: D401 — override
+        # Bypass collect_stub_curies (we inject a curated set instead).
+        for prefix, curies in self._fake_curies.items():
+            if prefix not in STUB_ONTOLOGY_SOURCES:
+                continue
+            cfg = STUB_ONTOLOGY_SOURCES[prefix]
+            output_file = self.output_dir / f"{prefix.lower()}_nodes.tsv"
+            self._write_stub_nodes(
+                prefix=prefix,
+                curies=sorted(curies),
+                db_path=self.input_base_dir / cfg["db_filename"],
+                knowledge_source=cfg["knowledge_source"],
+                output_file=output_file,
+            )
+
+
+def _read_tsv(path: Path) -> List[Dict[str, str]]:
+    with path.open("r", encoding="utf-8") as fh:
+        return list(csv.DictReader(fh, delimiter="\t"))
+
+
+# ---------------------------------------------------------------------------
+# Static / config tests
+# ---------------------------------------------------------------------------
+
+
+def test_stub_ontology_sources_subset_of_stub_prefixes():
+    """Every prefix the new transform handles must be a recognized stub prefix."""
+    assert set(STUB_ONTOLOGY_SOURCES.keys()).issubset(STUB_ONTOLOGY_PREFIXES)
+
+
+def test_stub_ontology_sources_covers_ncit_and_mesh():
+    """NCIT and mesh are the two prefixes that need full enrichment."""
+    assert set(STUB_ONTOLOGY_SOURCES.keys()) == {"NCIT", "mesh"}
+
+
+# ---------------------------------------------------------------------------
+# Transform behaviour with in-memory adapter
+# ---------------------------------------------------------------------------
+
+
+def test_transform_writes_label_synonyms_xrefs(tmp_path):
+    """A CURIE with full metadata in the fake adapter must round-trip into the TSV."""
+    adapters = {
+        "NCIT": _FakeAdapter({
+            "NCIT:C29298": {
+                "label": "Oatmeal",
+                "aliases": ["Avena sativa rolled groats", "Porridge oats"],
+                "xrefs": ["FOODON:00001540", "wikipedia:Oatmeal"],
+            },
+        }),
+    }
+    curies = {"NCIT": {"NCIT:C29298"}, "mesh": set()}
+    t = _StubbedTransform(adapters=adapters, curies=curies,
+                          input_dir=tmp_path / "in", output_dir=tmp_path / "out")
+    (tmp_path / "in").mkdir()
+    t.run()
+    rows = _read_tsv(tmp_path / "out" / "ontologies_stubs" / "ncit_nodes.tsv")
+    assert len(rows) == 1
+    row = rows[0]
+    assert row["id"] == "NCIT:C29298"
+    assert row["category"] == STUB_ONTOLOGY_CATEGORY
+    assert row["name"] == "Oatmeal"
+    assert "Avena sativa rolled groats" in row["synonym"].split("|")
+    assert "FOODON:00001540" in row["xref"].split("|")
+
+
+def test_transform_falls_back_to_curie_when_label_missing(tmp_path):
+    """Missing label must NOT produce an empty `name` cell — fall back to the CURIE."""
+    adapters = {"NCIT": _FakeAdapter({})}  # adapter knows nothing
+    curies = {"NCIT": {"NCIT:C99999"}, "mesh": set()}
+    t = _StubbedTransform(adapters=adapters, curies=curies,
+                          input_dir=tmp_path / "in", output_dir=tmp_path / "out")
+    (tmp_path / "in").mkdir()
+    t.run()
+    rows = _read_tsv(tmp_path / "out" / "ontologies_stubs" / "ncit_nodes.tsv")
+    assert len(rows) == 1
+    assert rows[0]["name"] == "NCIT:C99999"  # falls back to the CURIE itself
+
+
+def test_transform_writes_empty_tsv_when_no_curies(tmp_path):
+    """No CURIEs → empty file with header (so merge.yaml's filename declaration is satisfied)."""
+    adapters = {"mesh": _FakeAdapter({})}
+    curies = {"NCIT": set(), "mesh": set()}
+    t = _StubbedTransform(adapters=adapters, curies=curies,
+                          input_dir=tmp_path / "in", output_dir=tmp_path / "out")
+    (tmp_path / "in").mkdir()
+    t.run()
+    out = tmp_path / "out" / "ontologies_stubs"
+    # Both files written even with empty inputs (header-only).
+    assert (out / "ncit_nodes.tsv").is_file()
+    assert (out / "mesh_nodes.tsv").is_file()
+    assert _read_tsv(out / "ncit_nodes.tsv") == []
+    assert _read_tsv(out / "mesh_nodes.tsv") == []
+
+
+def test_transform_synonym_does_not_duplicate_label(tmp_path):
+    """If an alias equals the label, drop it from the synonym set (keep them disjoint)."""
+    adapters = {
+        "NCIT": _FakeAdapter({
+            "NCIT:C29298": {
+                "label": "Oatmeal",
+                "aliases": ["Oatmeal", "Porridge oats"],
+                "xrefs": [],
+            },
+        }),
+    }
+    curies = {"NCIT": {"NCIT:C29298"}, "mesh": set()}
+    t = _StubbedTransform(adapters=adapters, curies=curies,
+                          input_dir=tmp_path / "in", output_dir=tmp_path / "out")
+    (tmp_path / "in").mkdir()
+    t.run()
+    rows = _read_tsv(tmp_path / "out" / "ontologies_stubs" / "ncit_nodes.tsv")
+    assert rows[0]["name"] == "Oatmeal"
+    assert rows[0]["synonym"] == "Porridge oats"
+
+
+def test_transform_raises_when_db_missing(tmp_path):
+    """If neither the .db nor the .db.gz exists, the transform fails loudly (not silently)."""
+    # Use the real OntologiesStubsTransform here (not the _StubbedTransform) so the
+    # _open_adapter path runs against an empty input dir.
+    transform = OntologiesStubsTransform(
+        input_dir=tmp_path / "raw",  # empty
+        output_dir=tmp_path / "out",
+    )
+    (tmp_path / "raw").mkdir()
+    # Force collection to return at least one CURIE so we exercise the missing-DB branch.
+    object.__setattr__(transform, "run", lambda data_file=None: transform._write_stub_nodes(
+        prefix="NCIT",
+        curies=["NCIT:C29298"],
+        db_path=tmp_path / "raw" / "ncit.db",
+        knowledge_source="infores:ncit",
+        output_file=tmp_path / "out" / "ncit_nodes.tsv",
+    ))
+    with pytest.raises(FileNotFoundError, match="ncit.db"):
+        transform.run()
+
+
+# ---------------------------------------------------------------------------
+# End-to-end assertions against committed mapping data (skipped when stub
+# output absent — i.e. on a fresh checkout where the transform hasn't run).
+# ---------------------------------------------------------------------------
+
+
+_STUB_OUTPUT_DIR = REPO_ROOT / "data" / "transformed" / "ontologies_stubs"
+
+
+def _stub_outputs_present() -> bool:
+    return (_STUB_OUTPUT_DIR / "ncit_nodes.tsv").is_file() and (
+        _STUB_OUTPUT_DIR / "mesh_nodes.tsv"
+    ).is_file()
+
+
+@pytest.mark.skipif(not _stub_outputs_present(), reason="stub transform output not generated yet")
+def test_every_referenced_curie_has_stub_node():
+    """Every NCIT/mesh CURIE referenced under mappings/ must resolve to a stub node row."""
+    expected = collect_stub_curies(["NCIT", "mesh"])
+    for prefix, curies in expected.items():
+        out = _STUB_OUTPUT_DIR / f"{prefix.lower()}_nodes.tsv"
+        rows = _read_tsv(out)
+        ids = {row["id"] for row in rows}
+        missing = curies - ids
+        assert not missing, (
+            f"{prefix} stub TSV missing nodes for: {sorted(missing)} "
+            f"(re-run `poetry run kg transform -s ontologies_stubs`)"
+        )
+        # Every emitted row must carry a non-empty name (no dangling-style placeholders).
+        empty_names = [row["id"] for row in rows if not (row["name"] or "").strip()]
+        assert not empty_names, f"{prefix} stub rows with empty name: {empty_names}"
diff --git a/tests/test_stub_curie_collection.py b/tests/test_stub_curie_collection.py
new file mode 100644
index 00000000..8848a780
--- /dev/null
+++ b/tests/test_stub_curie_collection.py
@@ -0,0 +1,93 @@
+"""Tests for kg_microbe.utils.stub_curie_collection."""
+
+from __future__ import annotations
+
+import csv
+from pathlib import Path
+
+import pytest
+
+from kg_microbe.utils.stub_curie_collection import (
+    DEFAULT_MAPPING_PATHS,
+    collect_stub_curies,
+)
+
+
+def _write_tsv(path: Path, header: list[str], rows: list[list[str]]) -> None:
+    """Write a small TSV with the given header + rows."""
+    with path.open("w", newline="", encoding="utf-8") as fh:
+        writer = csv.writer(fh, delimiter="\t", lineterminator="\n")
+        writer.writerow(header)
+        writer.writerows(rows)
+
+
+def test_collect_finds_known_ncit_mesh_curies(tmp_path):
+    """A fixture mapping with one NCIT and one mesh row must surface both CURIEs."""
+    fixture = tmp_path / "fixture.tsv"
+    _write_tsv(
+        fixture,
+        ["subject_id", "object_id", "object_label"],
+        [
+            ["kgmicrobe.compound:oatmeal", "NCIT:C29298", "Oatmeal"],
+            ["kgmicrobe.compound:tween", "mesh:D011136", "Tween"],
+            ["kgmicrobe.compound:other", "CHEBI:15377", "water"],  # ignored prefix
+        ],
+    )
+    result = collect_stub_curies(["NCIT", "mesh"], mapping_paths=[fixture])
+    assert result["NCIT"] == {"NCIT:C29298"}
+    assert result["mesh"] == {"mesh:D011136"}
+
+
+def test_collect_normalizes_curie_case(tmp_path):
+    """``Mesh:D011136`` and ``ncit:C29298`` (wrong case) must collapse to the canonical case."""
+    fixture = tmp_path / "fixture.tsv"
+    _write_tsv(
+        fixture,
+        ["subject_id", "object_id", "object_label"],
+        [
+            ["x", "ncit:C29298", "lowercase ncit"],
+            ["x", "Mesh:D011136", "mixed-case mesh"],
+        ],
+    )
+    result = collect_stub_curies(["NCIT", "mesh"], mapping_paths=[fixture])
+    # Case is normalized to the requested-prefix case.
+    assert result["NCIT"] == {"NCIT:C29298"}
+    assert result["mesh"] == {"mesh:D011136"}
+
+
+def test_collect_returns_empty_set_for_unreferenced_prefix(tmp_path):
+    """A prefix with no references in any file gets an empty set, not a missing key."""
+    fixture = tmp_path / "fixture.tsv"
+    _write_tsv(fixture, ["object_id"], [["CHEBI:15377"]])
+    result = collect_stub_curies(["NCIT", "mesh"], mapping_paths=[fixture])
+    assert result == {"NCIT": set(), "mesh": set()}
+
+
+def test_collect_skips_missing_files_silently(tmp_path):
+    """Removing a mapping source from the repo must not break the collector."""
+    missing = tmp_path / "does-not-exist.tsv"
+    result = collect_stub_curies(["NCIT", "mesh"], mapping_paths=[missing])
+    assert result == {"NCIT": set(), "mesh": set()}
+
+
+def test_collect_handles_sssom_yaml_header(tmp_path):
+    """SSSOM-style ``# ...`` YAML metadata header lines must be skipped before the column header."""
+    fixture = tmp_path / "fixture.sssom.tsv"
+    fixture.write_text(
+        "# curie_map:\n"
+        "#   NCIT: 'http://purl.obolibrary.org/obo/NCIT_'\n"
+        "subject_id\tobject_id\tobject_label\n"
+        "x\tNCIT:C29298\tOatmeal\n",
+        encoding="utf-8",
+    )
+    result = collect_stub_curies(["NCIT"], mapping_paths=[fixture])
+    assert result["NCIT"] == {"NCIT:C29298"}
+
+
+def test_default_paths_yield_real_curies():
+    """Smoke-test against the committed mapping files: must find ≥1 NCIT and ≥1 mesh CURIE."""
+    if not any(p.is_file() for p in DEFAULT_MAPPING_PATHS):
+        pytest.skip("no mapping files present in this checkout")
+    result = collect_stub_curies(["NCIT", "mesh"])
+    assert len(result["NCIT"]) >= 1, "expected at least one NCIT CURIE in committed mappings"
+    assert len(result["mesh"]) >= 1, "expected at least one mesh CURIE in committed mappings"