Skip to content

Selective per-CURIE NCIT/MESH stub-import transform#565

Merged
realmarcin merged 1 commit into
masterfrom
ncit-mesh-stub-import
May 20, 2026
Merged

Selective per-CURIE NCIT/MESH stub-import transform#565
realmarcin merged 1 commit into
masterfrom
ncit-mesh-stub-import

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

Adds a new sibling transform ontologies_stubs that imports just the NCIT and MESH terms referenced by mappings/ — with full label, exact synonyms, and dbxrefs — without loading the rest of those ontologies (those belong to the sibling kg-microbe-biomedical pipeline).

Why

The chemical-mapping consolidator and the BacDive isolation-source mapper currently reference 70 NCIT and 92 MESH IDs as canonical xrefs for ingredients (e.g. NCIT:C29298 'Oatmeal', mesh:D011136 'Tween'). The existing STUB_ONTOLOGY_PREFIXES mechanism in isolation_source_mapping_utils.py was emitting label-only nodes for these inline from the BacDive transform, but:

  1. Chemical-mapping–driven NCIT/MESH refs (the bulk of the new MIM-curated entries) were producing dangling node ids in the merged KG — no node row, no label, no xrefs.
  2. Even where stubs existed they carried only the BacDive object_label, no synonyms, no xrefs.

This PR fixes both. For each NCIT/MESH CURIE referenced anywhere under mappings/, OAK queries the local SemSQL DB (data/raw/{ncit,mesh}.db) for rdfs:label, exact synonyms, and dbxrefs, then writes a labelled stub node to data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv. The DBs themselves are never loaded into the merged KG.

Architecture

Component Purpose
kg_microbe/utils/stub_curie_collection.py Discovers every NCIT/mesh CURIE referenced under mappings/. Explicit file list, case-normalized output.
kg_microbe/transform_utils/ontologies_stubs/ Per-CURIE OAK SemSQL fetch (label, aliases, xrefs); writes one node TSV per stub ontology. Same pattern as the chemical-mapping consolidator's enrich_with_chebi_synonyms.
kg_microbe/transform.py Registers ONTOLOGIES_STUBS in DATA_SOURCES after OntologiesTransform.
download.yaml Adds ncit.db.gz and mesh.db.gz from the standard bbop-sqlite distribution (~370 MB total, one-time).
merge.yaml / merge.no_metatraits.yaml / merge_bakta.yaml Declare the new ontologies_stubs source. merge.minimal.yaml unchanged (skips ontology nodes entirely).
kg_microbe/transform_utils/bacdive/bacdive.py Skips NCIT and mesh in the inline label-only stub emit (deferred to new transform); long-tail prefixes (PRIDE, PCO, GENEPIO, FAO, BTO, SNOMED — 1-3 IDs each) keep the inline path.
kg_microbe/utils/isolation_source_mapping_utils.py STUB_ONTOLOGY_PREFIXES docstring rewritten to document the two stub-import paths.

Tests

13 new unit tests, all passing:

  • tests/test_stub_curie_collection.py (6 tests): collector discovery, case normalization, missing files, SSSOM YAML header skipping, smoke test against real mappings.
  • tests/test_ontologies_stubs.py (8 tests, 1 skipped pending real run): transform behaviour with in-memory fake adapter — round-trips label/synonyms/xrefs, falls back to CURIE when label missing, drops alias-equals-label duplicates, writes header-only file when no CURIEs, raises loudly when neither .db nor .db.gz is present. Plus an integration test that asserts every collector-discovered CURIE has a corresponding stub-node row (skipped until the real transform run).

ruff clean.

Test plan

  • poetry run kg download — fetches ncit.db.gz + mesh.db.gz (~370 MB total). One-time.
  • poetry run kg transform -s ontologies_stubs — should report something like [NCIT] wrote 70 stub nodes and [mesh] wrote 92 stub nodes on a current checkout.
  • wc -l data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv — expect ≥71 / ≥93 lines (CURIE count + 1 header).
  • poetry run pytest tests/test_stub_curie_collection.py tests/test_ontologies_stubs.py -v — all 14 tests green (the integration test is no longer skipped after the transform has run).
  • Spot-check enrichment:
    awk -F'\t' '$1=="NCIT:C29298" {print}' data/transformed/ontologies_stubs/ncit_nodes.tsv
    # → name=Oatmeal, synonym=…, xref=…
  • Full merge + dangling-edge check:
    poetry run kg merge -y merge.yaml
    python3 -c "
    import gzip, csv
    with gzip.open('data/merged/merged-kg_nodes.tsv.gz', 'rt') as f:
        nodes = {row['id'] for row in csv.DictReader(f, delimiter='\t')}
    with gzip.open('data/merged/merged-kg_edges.tsv.gz', 'rt') as f:
        dangling = [r for r in csv.DictReader(f, delimiter='\t')
                    if (r['subject'].split(':',1)[0] in {'NCIT','mesh'} and r['subject'] not in nodes)
                    or (r['object'].split(':',1)[0] in {'NCIT','mesh'} and r['object'] not in nodes)]
    print('Dangling NCIT/mesh edges:', len(dangling))
    assert not dangling
    "

Plan

The full design plan (with alternatives considered, scope estimate, edge cases) lives at /Users/marcin/.claude/plans/are-all-of-these-glittery-origami.md.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 12, 2026 04:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new ontologies_stubs transform to generate enriched stub nodes for only the NCIT/MESH CURIEs referenced in mappings/, preventing dangling NCIT/mesh references in the merged KG without loading the full ontologies.

Changes:

  • Introduces OntologiesStubsTransform (SemSQL/OAK-backed) plus a mapping-CURIE collector to discover referenced NCIT/mesh IDs.
  • Updates merge configs to include the new stub-node TSVs and adjusts BacDive to stop emitting inline NCIT/mesh label-only stubs.
  • Adds downloads for ncit.db.gz and mesh.db.gz, and adds unit tests for both collection and transform behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_stub_curie_collection.py New unit tests for collecting stub CURIEs from mapping TSVs (including SSSOM YAML header handling).
tests/test_ontologies_stubs.py New unit tests for the stub transform using an in-memory fake OAK adapter plus an optional integration-style assertion.
merge.yaml Includes ontologies_stubs node TSVs as an additional merge source.
merge.no_metatraits.yaml Includes ontologies_stubs node TSVs as an additional merge source.
merge_bakta.yaml Includes ontologies_stubs node TSVs as an additional merge source.
kg_microbe/utils/stub_curie_collection.py New collector that scans explicit mapping files for NCIT/mesh CURIE usage.
kg_microbe/utils/isolation_source_mapping_utils.py Updates stub-prefix documentation to describe the new enriched stub path vs inline stubs.
kg_microbe/transform.py Registers the new ONTOLOGIES_STUBS data source in the transform pipeline.
kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py Implements the SemSQL/OAK-backed stub-node generation for referenced NCIT/mesh terms.
kg_microbe/transform_utils/ontologies_stubs/__init__.py Exposes the new transform package.
kg_microbe/transform_utils/constants.py Adds the ONTOLOGIES_STUBS source-name constant.
kg_microbe/transform_utils/bacdive/bacdive.py Stops inline stub-node emission for NCIT/mesh (delegates to new transform), keeps long-tail prefixes inline.
download.yaml Adds download entries for ncit.db.gz and mesh.db.gz.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +94 to +104
def run(self, data_file=None) -> None: # noqa: D401 — base class signature
"""
Collect stub CURIEs, fetch metadata via OAK, write per-ontology node TSVs.

:param data_file: Unused (kept for the base-class signature). The
transform discovers its inputs from the mapping TSVs and the
SemSQL DBs in ``input_base_dir``.
"""
prefixes = list(STUB_ONTOLOGY_SOURCES.keys())
curies_by_prefix = collect_stub_curies(prefixes)

label, # name
None, # description
_join_pipe(xrefs), # xref
ONTOLOGIES_STUBS_SOURCE_NAME, # provided_by
Adds a new sibling transform `ontologies_stubs` that imports just the
NCIT and MESH terms referenced by mappings/ — with full label, exact
synonyms, and dbxrefs — without loading the rest of those ontologies
(those belong to the sibling kg-microbe-biomedical pipeline).

Today the chemical-mapping consolidator and the BacDive isolation-source
mapper reference 70 NCIT and 92 MESH IDs as canonical xrefs for
ingredients (e.g. NCIT:C29298 'Oatmeal', mesh:D011136 'Tween'). The
existing STUB_ONTOLOGY_PREFIXES mechanism in
isolation_source_mapping_utils.py was emitting label-only nodes for
these inline from the BacDive transform, but: (a) the chemical-mapping-
driven NCIT/MESH refs were producing dangling node ids in the merged
KG (no node row, label, or xrefs), and (b) even where stubs existed
they carried only the BacDive object_label, no synonyms, no xrefs.

This transform fixes both. For each NCIT/MESH CURIE referenced
anywhere under mappings/, OAK queries the local SemSQL DB
(data/raw/{ncit,mesh}.db) for rdfs:label, exact synonyms, and dbxrefs,
then writes a labelled stub node to data/transformed/ontologies_stubs/
{ncit,mesh}_nodes.tsv. The DBs themselves are never loaded into the
merged KG.

Components:

- kg_microbe/utils/stub_curie_collection.py (new): collect_stub_curies()
  scans an explicit list of mapping TSVs (unified SSSOM, isolation-
  source, MIM, canonical/*) and returns per-prefix CURIE sets with case
  normalization. Currently surfaces 70 NCIT + 92 MESH CURIEs.

- kg_microbe/transform_utils/ontologies_stubs/ (new): per-CURIE OAK
  SemSQL fetch following the same pattern as the chemical-mapping
  consolidator's enrich_with_chebi_synonyms (label, entity_aliases,
  entity_metadata_map for dbxrefs). Auto-decompresses .db.gz on first
  run if the unzipped .db is missing. Fails loudly with an actionable
  message if neither is present (no silent dangling-xref fallback).

- kg_microbe/transform.py: registers ONTOLOGIES_STUBS in DATA_SOURCES,
  ordered after OntologiesTransform so the SemSQL DBs are present.

- download.yaml: adds ncit.db.gz and mesh.db.gz from s3.amazonaws.com/
  bbop-sqlite (the standard SemSQL distribution; same source the OAK
  `sqlite:obo:` shim hits).

- merge.yaml + merge.no_metatraits.yaml + merge_bakta.yaml: declare
  the new ontologies_stubs source so the merged KG picks up the stub
  nodes. merge.minimal.yaml unchanged (it skips ontology nodes
  entirely; not a target for stub enrichment).

- kg_microbe/transform_utils/bacdive/bacdive.py: BacDive's inline
  stub-emit at lines 2990-3003 now skips NCIT and mesh prefixes
  (deferred to the new transform). Long-tail prefixes (PRIDE, PCO,
  GENEPIO, FAO, BTO, SNOMED — 1-3 IDs each) keep the inline label-
  only fallback. Build-time prefix validator unchanged.

- kg_microbe/utils/isolation_source_mapping_utils.py:
  STUB_ONTOLOGY_PREFIXES docstring rewritten to document the two
  stub-import paths (SemSQL-enriched for NCIT/mesh, inline label-only
  for the long-tail).

Tests:

- tests/test_stub_curie_collection.py: collector unit tests covering
  CURIE discovery, case normalization, missing files, SSSOM YAML
  header skipping, and a smoke test against the real committed
  mappings.

- tests/test_ontologies_stubs.py: transform unit tests with an
  in-memory fake adapter — round-trips label/synonyms/xrefs, falls
  back to CURIE when label missing, drops alias-equals-label
  duplicates, writes header-only file when no CURIEs, and raises
  loudly when neither .db nor .db.gz is present. Plus an integration
  test (skipped when stub output absent) that asserts every
  collector-discovered CURIE has a corresponding stub-node row.

Verified: 13 unit tests pass; ruff clean.

End-to-end verification (requires `poetry run kg download` to fetch
ncit.db.gz + mesh.db.gz, ~370 MB total, one-time):

  poetry run kg transform -s ontologies_stubs
  wc -l data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv
  poetry run pytest tests/test_ontologies_stubs.py -v   # integration test no longer skipped

Plan: /Users/marcin/.claude/plans/are-all-of-these-glittery-origami.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin force-pushed the ncit-mesh-stub-import branch from 7ebf594 to 770865d Compare May 20, 2026 00:54
@realmarcin realmarcin merged commit d2b03ea into master May 20, 2026
3 checks passed
@realmarcin realmarcin deleted the ncit-mesh-stub-import branch May 20, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants