Extend stub-import transform to cover PO and MICRO#571
Merged
Conversation
Previously PO (Plant Ontology, ~2,170 nodes) and MICRO (Microbial Conditions Ontology, ~17,600 nodes) were full-loaded by OntologiesTransform even though the merged KG only references a handful of CURIEs from each (6 PO, 34 MICRO). Promote both to the per-CURIE SemSQL stub-import path already used by NCIT, mesh, and BTO so the merged KG carries one labelled stub per referenced CURIE instead of thousands of unrelated nodes. PO uses the standard SemSQL adapter against po.db (bbop-sqlite, 4.6 MB). MICRO's bbop-sqlite distribution at s3.amazonaws.com/bbop-sqlite/micro.db.gz is a broken 29-byte placeholder, so MICRO uses a new in-house _ObographJsonAdapter that parses the OBO Graph JSON produced by ROBOT from micro.owl. The adapter normalizes both the standard http://purl.obolibrary.org/obo/MICRO_NNNN IRI shape and MicrO's quirky http://purl.obolibrary.org/obo/MicrO.owl/MICRO_NNNN variant to the canonical MICRO:NNNN CURIE (3 of 34 referenced CURIEs use the quirky form). Changes: - kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py: add PO + MICRO entries to STUB_ONTOLOGY_SOURCES with a new `source_type` field ("semsql" default for NCIT/mesh/BTO/PO, "obograph_json" for MICRO). Refactor _open_adapter to dispatch on source_type. Add _ObographJsonAdapter class implementing the OAK adapter subset the transform calls (label, entity_aliases, entity_metadata_map). When micro.json is missing but micro.owl is present, _open_adapter calls convert_to_json (ROBOT) once. - kg_microbe/transform_utils/ontologies/ontologies_transform.py: remove po and micro from ONTOLOGIES_MAP and ONTOLOGY_KNOWLEDGE_SOURCES; replace with a NOTE comment pointing readers to the stub transform. - kg_microbe/utils/isolation_source_mapping_utils.py: add PO and MICRO to STUB_ONTOLOGY_PREFIXES so BacDive's dangling-target check passes for the 8 isolation_source rows that mapped to PO terms. - kg_microbe/transform_utils/bacdive/bacdive.py: extend the inline-emit skip-list from {NCIT, mesh, BTO} to {NCIT, mesh, BTO, PO, MICRO} so BacDive defers stub-node emission to the OntologiesStubsTransform for all five prefixes. - download.yaml: add po.db.gz from s3.amazonaws.com/bbop-sqlite (~4.6 MB). Remove the now-orphaned po.owl entry (replaced by the SemSQL DB; no remaining code path consumes po.owl). Keep micro.owl since it is still required for the obograph-JSON fallback. Update the stub-import section header to mention the new prefixes and document why MICRO uses OWL not SemSQL. - merge.yaml / merge.no_metatraits.yaml / merge_bakta.yaml: add po_nodes.tsv and micro_nodes.tsv to the ontologies_stubs source filename list in each variant. - tests/test_ontologies_stubs.py: rename test_stub_ontology_sources_covers_ncit_mesh_bto → covers_ncit_mesh_bto_po_micro; add tests asserting MICRO uses the obograph_json source_type and NCIT/mesh/BTO/PO use semsql. Add a focused _ObographJsonAdapter test exercising both standard and quirky IRI shapes, dropping malformed synonym entries (missing val), and graceful empty-result handling for unknown CURIEs. Extend the integration assertion to cover all five prefixes. Verified: - End-to-end stub transform run produced 73 NCIT + 95 mesh + 2 BTO + 6 PO + 34 MICRO nodes, all with non-empty names. - collect_stub_curies(['PO','MICRO']) finds the 6 + 34 CURIEs from the committed mappings. - 27 unit tests pass across test_ontologies_stubs.py, test_isolation_source_mapping_utils.py, test_stub_curie_collection.py. - ruff check kg_microbe/ tests/ clean. End-to-end (requires `poetry run kg download` to fetch po.db.gz): poetry run kg transform -s ontologies_stubs # → data/transformed/ontologies_stubs/{ncit,mesh,bto,po,micro}_nodes.tsv Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR migrates PO (Plant Ontology) and MICRO (Microbial Conditions Ontology) from full ontology loading to the existing per-CURIE “stub import” mechanism, reducing unrelated ontology nodes in the merged KG while still providing labels/synonyms/xrefs for referenced CURIEs.
Changes:
- Extend the ontologies-stubs transform to emit enriched stubs for PO (SemSQL) and MICRO (OBO Graph JSON via a new in-house adapter).
- Update merge configs to include
po_nodes.tsvandmicro_nodes.tsvstub outputs and prevent BacDive from double-emitting these stubs inline. - Update downloads to add
po.db.gzand remove the now-unusedpo.owlentry.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_ontologies_stubs.py | Expands tests for PO/MICRO stub sources and adds unit coverage for the new Obograph JSON adapter behavior. |
| merge.yaml | Adds PO and MICRO stub node TSVs to the merged graph inputs. |
| merge.no_metatraits.yaml | Keeps the no-metatraits merge config aligned with new stub outputs. |
| merge_bakta.yaml | Keeps the Bakta merge config aligned with new stub outputs. |
| kg_microbe/utils/isolation_source_mapping_utils.py | Extends stub-prefix configuration/documentation so BacDive validation and inline-emission behavior stays consistent. |
| kg_microbe/transform_utils/ontologies/ontologies_transform.py | Removes PO/MICRO from full ontology loading and documents the new stub-based approach. |
| kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py | Adds PO/MICRO stub sources, introduces _ObographJsonAdapter, and extends stub generation logic. |
| kg_microbe/transform_utils/bacdive/bacdive.py | Extends the inline stub skip-set so BacDive doesn’t emit duplicate stubs for PO/MICRO. |
| download.yaml | Adds po.db.gz for SemSQL stub import and removes the unused PO OWL download entry; documents MICRO JSON fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Four review comments addressed, all in ontologies_stubs_transform.py plus tests: 1. Module docstring (line 13): clarify the reference-footprint phrasing to distinguish ~150 *reference rows* from ~34 *distinct MICRO IDs* (avoid implying ~150 unique IDs). 2. Module docstring (line 29-30): drop the stale claim that micro.json is "already-downloaded" / produced by the ontologies transform's OWL→JSON pass. MICRO was removed from ONTOLOGIES_MAP in this PR, so nothing in the full-load path produces micro.json anymore. Rewrite to state explicitly that _open_adapter generates the JSON on demand from data/raw/micro.owl via a ROBOT subprocess on first run. 3. STUB_ONTOLOGY_SOURCES["MICRO"] inline comment (line 125): same stale "already required for legacy reasons by the full-load path" claim; updated to match the new on-demand generation flow. 4. _write_stub_nodes error message (line 204): when MICRO inputs were missing, the FileNotFoundError still said "expected SemSQL DB at..." which is the wrong diagnostic for the obograph_json source. Branch on `STUB_ONTOLOGY_SOURCES[prefix]["source_type"]` and emit a source-specific message: SemSQL prefixes get the .db/.db.gz guidance; obograph_json prefixes get the .json/.owl + ROBOT guidance. Added a parallel test (test_transform_raises_with_obograph_json_message_when_micro_inputs_missing) that asserts the MICRO branch surfaces "OBO Graph JSON … ROBOT" in the error and not the SemSQL phrasing. Tightened the existing SemSQL test to assert "SemSQL DB" appears. Verified: - 12 unit tests pass (was 11; +1 new obograph error-message test). - ruff check kg_microbe/ tests/ clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promote PO (Plant Ontology) and MICRO (Microbial Conditions Ontology) from full-ontology loading to the per-CURIE SemSQL stub-import path already used by NCIT, mesh, and BTO. The merged KG references only 6 PO CURIEs and 34 MICRO CURIEs, so loading the full ontologies (~2,170 PO and ~17,600 MICRO nodes) was contributing thousands of unrelated nodes for no semantic gain.
po.db(bbop-sqlite, 4.6 MB) — same path as NCIT/mesh/BTO_ObographJsonAdapter— MICRO's bbop-sqlite distribution is a broken 29-byte placeholder, so the adapter parsesmicro.json(produced by ROBOT from the already-downloadedmicro.owl) and normalizes both the standardhttp://purl.obolibrary.org/obo/MICRO_NNNNIRI shape and MicrO's quirkyhttp://purl.obolibrary.org/obo/MicrO.owl/MICRO_NNNNvariant to the canonicalMICRO:NNNNCURIE.po.db.gztodownload.yaml; removes the now-orphanedpo.owlentry. Keepsmicro.owlfor the obograph-JSON fallback.STUB_ONTOLOGY_PREFIXESso the dangling-target check passes and BacDive doesn't double-emit label-only stubs.Test plan
poetry run kg transform -s ontologies_stubsproduces 73 NCIT + 95 mesh + 2 BTO + 6 PO + 34 MICRO stub nodes, all with non-empty namescollect_stub_curies(['PO','MICRO'])finds the 6 + 34 CURIEs from the committed mappingstest_ontologies_stubs.py,test_isolation_source_mapping_utils.py,test_stub_curie_collection.py_ObographJsonAdapterunit test exercises both standard and quirky IRI shapes plus malformed-synonym handlingruff check kg_microbe/ tests/clean🤖 Generated with Claude Code