Selective per-CURIE NCIT/MESH stub-import transform#565
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new ontologies_stubs transform to generate enriched stub nodes for only the NCIT/MESH CURIEs referenced in mappings/, preventing dangling NCIT/mesh references in the merged KG without loading the full ontologies.
Changes:
- Introduces
OntologiesStubsTransform(SemSQL/OAK-backed) plus a mapping-CURIE collector to discover referenced NCIT/mesh IDs. - Updates merge configs to include the new stub-node TSVs and adjusts BacDive to stop emitting inline NCIT/mesh label-only stubs.
- Adds downloads for
ncit.db.gzandmesh.db.gz, and adds unit tests for both collection and transform behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_stub_curie_collection.py |
New unit tests for collecting stub CURIEs from mapping TSVs (including SSSOM YAML header handling). |
tests/test_ontologies_stubs.py |
New unit tests for the stub transform using an in-memory fake OAK adapter plus an optional integration-style assertion. |
merge.yaml |
Includes ontologies_stubs node TSVs as an additional merge source. |
merge.no_metatraits.yaml |
Includes ontologies_stubs node TSVs as an additional merge source. |
merge_bakta.yaml |
Includes ontologies_stubs node TSVs as an additional merge source. |
kg_microbe/utils/stub_curie_collection.py |
New collector that scans explicit mapping files for NCIT/mesh CURIE usage. |
kg_microbe/utils/isolation_source_mapping_utils.py |
Updates stub-prefix documentation to describe the new enriched stub path vs inline stubs. |
kg_microbe/transform.py |
Registers the new ONTOLOGIES_STUBS data source in the transform pipeline. |
kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py |
Implements the SemSQL/OAK-backed stub-node generation for referenced NCIT/mesh terms. |
kg_microbe/transform_utils/ontologies_stubs/__init__.py |
Exposes the new transform package. |
kg_microbe/transform_utils/constants.py |
Adds the ONTOLOGIES_STUBS source-name constant. |
kg_microbe/transform_utils/bacdive/bacdive.py |
Stops inline stub-node emission for NCIT/mesh (delegates to new transform), keeps long-tail prefixes inline. |
download.yaml |
Adds download entries for ncit.db.gz and mesh.db.gz. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+94
to
+104
| def run(self, data_file=None) -> None: # noqa: D401 — base class signature | ||
| """ | ||
| Collect stub CURIEs, fetch metadata via OAK, write per-ontology node TSVs. | ||
|
|
||
| :param data_file: Unused (kept for the base-class signature). The | ||
| transform discovers its inputs from the mapping TSVs and the | ||
| SemSQL DBs in ``input_base_dir``. | ||
| """ | ||
| prefixes = list(STUB_ONTOLOGY_SOURCES.keys()) | ||
| curies_by_prefix = collect_stub_curies(prefixes) | ||
|
|
| label, # name | ||
| None, # description | ||
| _join_pipe(xrefs), # xref | ||
| ONTOLOGIES_STUBS_SOURCE_NAME, # provided_by |
Adds a new sibling transform `ontologies_stubs` that imports just the
NCIT and MESH terms referenced by mappings/ — with full label, exact
synonyms, and dbxrefs — without loading the rest of those ontologies
(those belong to the sibling kg-microbe-biomedical pipeline).
Today the chemical-mapping consolidator and the BacDive isolation-source
mapper reference 70 NCIT and 92 MESH IDs as canonical xrefs for
ingredients (e.g. NCIT:C29298 'Oatmeal', mesh:D011136 'Tween'). The
existing STUB_ONTOLOGY_PREFIXES mechanism in
isolation_source_mapping_utils.py was emitting label-only nodes for
these inline from the BacDive transform, but: (a) the chemical-mapping-
driven NCIT/MESH refs were producing dangling node ids in the merged
KG (no node row, label, or xrefs), and (b) even where stubs existed
they carried only the BacDive object_label, no synonyms, no xrefs.
This transform fixes both. For each NCIT/MESH CURIE referenced
anywhere under mappings/, OAK queries the local SemSQL DB
(data/raw/{ncit,mesh}.db) for rdfs:label, exact synonyms, and dbxrefs,
then writes a labelled stub node to data/transformed/ontologies_stubs/
{ncit,mesh}_nodes.tsv. The DBs themselves are never loaded into the
merged KG.
Components:
- kg_microbe/utils/stub_curie_collection.py (new): collect_stub_curies()
scans an explicit list of mapping TSVs (unified SSSOM, isolation-
source, MIM, canonical/*) and returns per-prefix CURIE sets with case
normalization. Currently surfaces 70 NCIT + 92 MESH CURIEs.
- kg_microbe/transform_utils/ontologies_stubs/ (new): per-CURIE OAK
SemSQL fetch following the same pattern as the chemical-mapping
consolidator's enrich_with_chebi_synonyms (label, entity_aliases,
entity_metadata_map for dbxrefs). Auto-decompresses .db.gz on first
run if the unzipped .db is missing. Fails loudly with an actionable
message if neither is present (no silent dangling-xref fallback).
- kg_microbe/transform.py: registers ONTOLOGIES_STUBS in DATA_SOURCES,
ordered after OntologiesTransform so the SemSQL DBs are present.
- download.yaml: adds ncit.db.gz and mesh.db.gz from s3.amazonaws.com/
bbop-sqlite (the standard SemSQL distribution; same source the OAK
`sqlite:obo:` shim hits).
- merge.yaml + merge.no_metatraits.yaml + merge_bakta.yaml: declare
the new ontologies_stubs source so the merged KG picks up the stub
nodes. merge.minimal.yaml unchanged (it skips ontology nodes
entirely; not a target for stub enrichment).
- kg_microbe/transform_utils/bacdive/bacdive.py: BacDive's inline
stub-emit at lines 2990-3003 now skips NCIT and mesh prefixes
(deferred to the new transform). Long-tail prefixes (PRIDE, PCO,
GENEPIO, FAO, BTO, SNOMED — 1-3 IDs each) keep the inline label-
only fallback. Build-time prefix validator unchanged.
- kg_microbe/utils/isolation_source_mapping_utils.py:
STUB_ONTOLOGY_PREFIXES docstring rewritten to document the two
stub-import paths (SemSQL-enriched for NCIT/mesh, inline label-only
for the long-tail).
Tests:
- tests/test_stub_curie_collection.py: collector unit tests covering
CURIE discovery, case normalization, missing files, SSSOM YAML
header skipping, and a smoke test against the real committed
mappings.
- tests/test_ontologies_stubs.py: transform unit tests with an
in-memory fake adapter — round-trips label/synonyms/xrefs, falls
back to CURIE when label missing, drops alias-equals-label
duplicates, writes header-only file when no CURIEs, and raises
loudly when neither .db nor .db.gz is present. Plus an integration
test (skipped when stub output absent) that asserts every
collector-discovered CURIE has a corresponding stub-node row.
Verified: 13 unit tests pass; ruff clean.
End-to-end verification (requires `poetry run kg download` to fetch
ncit.db.gz + mesh.db.gz, ~370 MB total, one-time):
poetry run kg transform -s ontologies_stubs
wc -l data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv
poetry run pytest tests/test_ontologies_stubs.py -v # integration test no longer skipped
Plan: /Users/marcin/.claude/plans/are-all-of-these-glittery-origami.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7ebf594 to
770865d
Compare
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new sibling transform
ontologies_stubsthat imports just the NCIT and MESH terms referenced bymappings/— with full label, exact synonyms, and dbxrefs — without loading the rest of those ontologies (those belong to the siblingkg-microbe-biomedicalpipeline).Why
The chemical-mapping consolidator and the BacDive isolation-source mapper currently reference 70 NCIT and 92 MESH IDs as canonical xrefs for ingredients (e.g.
NCIT:C29298 'Oatmeal',mesh:D011136 'Tween'). The existingSTUB_ONTOLOGY_PREFIXESmechanism inisolation_source_mapping_utils.pywas emitting label-only nodes for these inline from the BacDive transform, but:object_label, no synonyms, no xrefs.This PR fixes both. For each NCIT/MESH CURIE referenced anywhere under
mappings/, OAK queries the local SemSQL DB (data/raw/{ncit,mesh}.db) forrdfs:label, exact synonyms, and dbxrefs, then writes a labelled stub node todata/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv. The DBs themselves are never loaded into the merged KG.Architecture
kg_microbe/utils/stub_curie_collection.pymappings/. Explicit file list, case-normalized output.kg_microbe/transform_utils/ontologies_stubs/enrich_with_chebi_synonyms.kg_microbe/transform.pyONTOLOGIES_STUBSinDATA_SOURCESafterOntologiesTransform.download.yamlncit.db.gzandmesh.db.gzfrom the standardbbop-sqlitedistribution (~370 MB total, one-time).merge.yaml/merge.no_metatraits.yaml/merge_bakta.yamlontologies_stubssource.merge.minimal.yamlunchanged (skips ontology nodes entirely).kg_microbe/transform_utils/bacdive/bacdive.pykg_microbe/utils/isolation_source_mapping_utils.pySTUB_ONTOLOGY_PREFIXESdocstring rewritten to document the two stub-import paths.Tests
13 new unit tests, all passing:
tests/test_stub_curie_collection.py(6 tests): collector discovery, case normalization, missing files, SSSOM YAML header skipping, smoke test against real mappings.tests/test_ontologies_stubs.py(8 tests, 1 skipped pending real run): transform behaviour with in-memory fake adapter — round-trips label/synonyms/xrefs, falls back to CURIE when label missing, drops alias-equals-label duplicates, writes header-only file when no CURIEs, raises loudly when neither.dbnor.db.gzis present. Plus an integration test that asserts every collector-discovered CURIE has a corresponding stub-node row (skipped until the real transform run).ruff clean.
Test plan
poetry run kg download— fetchesncit.db.gz+mesh.db.gz(~370 MB total). One-time.poetry run kg transform -s ontologies_stubs— should report something like[NCIT] wrote 70 stub nodesand[mesh] wrote 92 stub nodeson a current checkout.wc -l data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv— expect ≥71 / ≥93 lines (CURIE count + 1 header).poetry run pytest tests/test_stub_curie_collection.py tests/test_ontologies_stubs.py -v— all 14 tests green (the integration test is no longer skipped after the transform has run).Plan
The full design plan (with alternatives considered, scope estimate, edge cases) lives at
/Users/marcin/.claude/plans/are-all-of-these-glittery-origami.md.🤖 Generated with Claude Code