Selective per-CURIE NCIT/MESH stub-import transform by realmarcin · Pull Request #565 · Knowledge-Graph-Hub/kg-microbe

realmarcin · 2026-05-12T04:41:54Z

Summary

Adds a new sibling transform ontologies_stubs that imports just the NCIT and MESH terms referenced by mappings/ — with full label, exact synonyms, and dbxrefs — without loading the rest of those ontologies (those belong to the sibling kg-microbe-biomedical pipeline).

Why

The chemical-mapping consolidator and the BacDive isolation-source mapper currently reference 70 NCIT and 92 MESH IDs as canonical xrefs for ingredients (e.g. NCIT:C29298 'Oatmeal', mesh:D011136 'Tween'). The existing STUB_ONTOLOGY_PREFIXES mechanism in isolation_source_mapping_utils.py was emitting label-only nodes for these inline from the BacDive transform, but:

Chemical-mapping–driven NCIT/MESH refs (the bulk of the new MIM-curated entries) were producing dangling node ids in the merged KG — no node row, no label, no xrefs.
Even where stubs existed they carried only the BacDive object_label, no synonyms, no xrefs.

This PR fixes both. For each NCIT/MESH CURIE referenced anywhere under mappings/, OAK queries the local SemSQL DB (data/raw/{ncit,mesh}.db) for rdfs:label, exact synonyms, and dbxrefs, then writes a labelled stub node to data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv. The DBs themselves are never loaded into the merged KG.

Architecture

Component	Purpose
`kg_microbe/utils/stub_curie_collection.py`	Discovers every NCIT/mesh CURIE referenced under `mappings/`. Explicit file list, case-normalized output.
`kg_microbe/transform_utils/ontologies_stubs/`	Per-CURIE OAK SemSQL fetch (label, aliases, xrefs); writes one node TSV per stub ontology. Same pattern as the chemical-mapping consolidator's `enrich_with_chebi_synonyms`.
`kg_microbe/transform.py`	Registers `ONTOLOGIES_STUBS` in `DATA_SOURCES` after `OntologiesTransform`.
`download.yaml`	Adds `ncit.db.gz` and `mesh.db.gz` from the standard `bbop-sqlite` distribution (~370 MB total, one-time).
`merge.yaml` / `merge.no_metatraits.yaml` / `merge_bakta.yaml`	Declare the new `ontologies_stubs` source. `merge.minimal.yaml` unchanged (skips ontology nodes entirely).
`kg_microbe/transform_utils/bacdive/bacdive.py`	Skips NCIT and mesh in the inline label-only stub emit (deferred to new transform); long-tail prefixes (PRIDE, PCO, GENEPIO, FAO, BTO, SNOMED — 1-3 IDs each) keep the inline path.
`kg_microbe/utils/isolation_source_mapping_utils.py`	`STUB_ONTOLOGY_PREFIXES` docstring rewritten to document the two stub-import paths.

Tests

13 new unit tests, all passing:

tests/test_stub_curie_collection.py (6 tests): collector discovery, case normalization, missing files, SSSOM YAML header skipping, smoke test against real mappings.
tests/test_ontologies_stubs.py (8 tests, 1 skipped pending real run): transform behaviour with in-memory fake adapter — round-trips label/synonyms/xrefs, falls back to CURIE when label missing, drops alias-equals-label duplicates, writes header-only file when no CURIEs, raises loudly when neither .db nor .db.gz is present. Plus an integration test that asserts every collector-discovered CURIE has a corresponding stub-node row (skipped until the real transform run).

ruff clean.

Test plan

poetry run kg download — fetches ncit.db.gz + mesh.db.gz (~370 MB total). One-time.
poetry run kg transform -s ontologies_stubs — should report something like [NCIT] wrote 70 stub nodes and [mesh] wrote 92 stub nodes on a current checkout.
wc -l data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv — expect ≥71 / ≥93 lines (CURIE count + 1 header).
poetry run pytest tests/test_stub_curie_collection.py tests/test_ontologies_stubs.py -v — all 14 tests green (the integration test is no longer skipped after the transform has run).

Spot-check enrichment:

awk -F'\t' '$1=="NCIT:C29298" {print}' data/transformed/ontologies_stubs/ncit_nodes.tsv
# → name=Oatmeal, synonym=…, xref=…

Full merge + dangling-edge check:

poetry run kg merge -y merge.yaml
python3 -c "
import gzip, csv
with gzip.open('data/merged/merged-kg_nodes.tsv.gz', 'rt') as f:
    nodes = {row['id'] for row in csv.DictReader(f, delimiter='\t')}
with gzip.open('data/merged/merged-kg_edges.tsv.gz', 'rt') as f:
    dangling = [r for r in csv.DictReader(f, delimiter='\t')
                if (r['subject'].split(':',1)[0] in {'NCIT','mesh'} and r['subject'] not in nodes)
                or (r['object'].split(':',1)[0] in {'NCIT','mesh'} and r['object'] not in nodes)]
print('Dangling NCIT/mesh edges:', len(dangling))
assert not dangling
"

Plan

The full design plan (with alternatives considered, scope estimate, edge cases) lives at /Users/marcin/.claude/plans/are-all-of-these-glittery-origami.md.

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds a new ontologies_stubs transform to generate enriched stub nodes for only the NCIT/MESH CURIEs referenced in mappings/, preventing dangling NCIT/mesh references in the merged KG without loading the full ontologies.

Changes:

Introduces OntologiesStubsTransform (SemSQL/OAK-backed) plus a mapping-CURIE collector to discover referenced NCIT/mesh IDs.
Updates merge configs to include the new stub-node TSVs and adjusts BacDive to stop emitting inline NCIT/mesh label-only stubs.
Adds downloads for ncit.db.gz and mesh.db.gz, and adds unit tests for both collection and transform behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`tests/test_stub_curie_collection.py`	New unit tests for collecting stub CURIEs from mapping TSVs (including SSSOM YAML header handling).
`tests/test_ontologies_stubs.py`	New unit tests for the stub transform using an in-memory fake OAK adapter plus an optional integration-style assertion.
`merge.yaml`	Includes `ontologies_stubs` node TSVs as an additional merge source.
`merge.no_metatraits.yaml`	Includes `ontologies_stubs` node TSVs as an additional merge source.
`merge_bakta.yaml`	Includes `ontologies_stubs` node TSVs as an additional merge source.
`kg_microbe/utils/stub_curie_collection.py`	New collector that scans explicit mapping files for NCIT/mesh CURIE usage.
`kg_microbe/utils/isolation_source_mapping_utils.py`	Updates stub-prefix documentation to describe the new enriched stub path vs inline stubs.
`kg_microbe/transform.py`	Registers the new `ONTOLOGIES_STUBS` data source in the transform pipeline.
`kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py`	Implements the SemSQL/OAK-backed stub-node generation for referenced NCIT/mesh terms.
`kg_microbe/transform_utils/ontologies_stubs/__init__.py`	Exposes the new transform package.
`kg_microbe/transform_utils/constants.py`	Adds the `ONTOLOGIES_STUBS` source-name constant.
`kg_microbe/transform_utils/bacdive/bacdive.py`	Stops inline stub-node emission for NCIT/mesh (delegates to new transform), keeps long-tail prefixes inline.
`download.yaml`	Adds download entries for `ncit.db.gz` and `mesh.db.gz`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def run(self, data_file=None) -> None:  # noqa: D401 — base class signature
+        """
+        Collect stub CURIEs, fetch metadata via OAK, write per-ontology node TSVs.
+
+        :param data_file: Unused (kept for the base-class signature). The
+            transform discovers its inputs from the mapping TSVs and the
+            SemSQL DBs in ``input_base_dir``.
+        """
+        prefixes = list(STUB_ONTOLOGY_SOURCES.keys())
+        curies_by_prefix = collect_stub_curies(prefixes)
+


+                label,                      # name
+                None,                        # description
+                _join_pipe(xrefs),          # xref
+                ONTOLOGIES_STUBS_SOURCE_NAME,  # provided_by


Adds a new sibling transform `ontologies_stubs` that imports just the NCIT and MESH terms referenced by mappings/ — with full label, exact synonyms, and dbxrefs — without loading the rest of those ontologies (those belong to the sibling kg-microbe-biomedical pipeline). Today the chemical-mapping consolidator and the BacDive isolation-source mapper reference 70 NCIT and 92 MESH IDs as canonical xrefs for ingredients (e.g. NCIT:C29298 'Oatmeal', mesh:D011136 'Tween'). The existing STUB_ONTOLOGY_PREFIXES mechanism in isolation_source_mapping_utils.py was emitting label-only nodes for these inline from the BacDive transform, but: (a) the chemical-mapping- driven NCIT/MESH refs were producing dangling node ids in the merged KG (no node row, label, or xrefs), and (b) even where stubs existed they carried only the BacDive object_label, no synonyms, no xrefs. This transform fixes both. For each NCIT/MESH CURIE referenced anywhere under mappings/, OAK queries the local SemSQL DB (data/raw/{ncit,mesh}.db) for rdfs:label, exact synonyms, and dbxrefs, then writes a labelled stub node to data/transformed/ontologies_stubs/ {ncit,mesh}_nodes.tsv. The DBs themselves are never loaded into the merged KG. Components: - kg_microbe/utils/stub_curie_collection.py (new): collect_stub_curies() scans an explicit list of mapping TSVs (unified SSSOM, isolation- source, MIM, canonical/*) and returns per-prefix CURIE sets with case normalization. Currently surfaces 70 NCIT + 92 MESH CURIEs. - kg_microbe/transform_utils/ontologies_stubs/ (new): per-CURIE OAK SemSQL fetch following the same pattern as the chemical-mapping consolidator's enrich_with_chebi_synonyms (label, entity_aliases, entity_metadata_map for dbxrefs). Auto-decompresses .db.gz on first run if the unzipped .db is missing. Fails loudly with an actionable message if neither is present (no silent dangling-xref fallback). - kg_microbe/transform.py: registers ONTOLOGIES_STUBS in DATA_SOURCES, ordered after OntologiesTransform so the SemSQL DBs are present. - download.yaml: adds ncit.db.gz and mesh.db.gz from s3.amazonaws.com/ bbop-sqlite (the standard SemSQL distribution; same source the OAK `sqlite:obo:` shim hits). - merge.yaml + merge.no_metatraits.yaml + merge_bakta.yaml: declare the new ontologies_stubs source so the merged KG picks up the stub nodes. merge.minimal.yaml unchanged (it skips ontology nodes entirely; not a target for stub enrichment). - kg_microbe/transform_utils/bacdive/bacdive.py: BacDive's inline stub-emit at lines 2990-3003 now skips NCIT and mesh prefixes (deferred to the new transform). Long-tail prefixes (PRIDE, PCO, GENEPIO, FAO, BTO, SNOMED — 1-3 IDs each) keep the inline label- only fallback. Build-time prefix validator unchanged. - kg_microbe/utils/isolation_source_mapping_utils.py: STUB_ONTOLOGY_PREFIXES docstring rewritten to document the two stub-import paths (SemSQL-enriched for NCIT/mesh, inline label-only for the long-tail). Tests: - tests/test_stub_curie_collection.py: collector unit tests covering CURIE discovery, case normalization, missing files, SSSOM YAML header skipping, and a smoke test against the real committed mappings. - tests/test_ontologies_stubs.py: transform unit tests with an in-memory fake adapter — round-trips label/synonyms/xrefs, falls back to CURIE when label missing, drops alias-equals-label duplicates, writes header-only file when no CURIEs, and raises loudly when neither .db nor .db.gz is present. Plus an integration test (skipped when stub output absent) that asserts every collector-discovered CURIE has a corresponding stub-node row. Verified: 13 unit tests pass; ruff clean. End-to-end verification (requires `poetry run kg download` to fetch ncit.db.gz + mesh.db.gz, ~370 MB total, one-time): poetry run kg transform -s ontologies_stubs wc -l data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv poetry run pytest tests/test_ontologies_stubs.py -v # integration test no longer skipped Plan: /Users/marcin/.claude/plans/are-all-of-these-glittery-origami.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 12, 2026 04:41

Copilot started reviewing on behalf of realmarcin May 12, 2026 04:42 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

realmarcin force-pushed the ncit-mesh-stub-import branch from 7ebf594 to 770865d Compare May 20, 2026 00:54

realmarcin merged commit d2b03ea into master May 20, 2026
3 checks passed

realmarcin deleted the ncit-mesh-stub-import branch May 20, 2026 00:58

realmarcin mentioned this pull request May 20, 2026

Extend stub-import transform to cover BTO #570

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selective per-CURIE NCIT/MESH stub-import transform#565

Selective per-CURIE NCIT/MESH stub-import transform#565
realmarcin merged 1 commit into
masterfrom
ncit-mesh-stub-import

realmarcin commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realmarcin commented May 12, 2026

Summary

Why

Architecture

Tests

Test plan

Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants