Skip to content

Extend stub-import transform to cover PO and MICRO#571

Merged
realmarcin merged 2 commits into
masterfrom
add-po-micro-to-stub-import
May 21, 2026
Merged

Extend stub-import transform to cover PO and MICRO#571
realmarcin merged 2 commits into
masterfrom
add-po-micro-to-stub-import

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

Promote PO (Plant Ontology) and MICRO (Microbial Conditions Ontology) from full-ontology loading to the per-CURIE SemSQL stub-import path already used by NCIT, mesh, and BTO. The merged KG references only 6 PO CURIEs and 34 MICRO CURIEs, so loading the full ontologies (~2,170 PO and ~17,600 MICRO nodes) was contributing thousands of unrelated nodes for no semantic gain.

  • PO: SemSQL via po.db (bbop-sqlite, 4.6 MB) — same path as NCIT/mesh/BTO
  • MICRO: OBO Graph JSON via a new in-house _ObographJsonAdapter — MICRO's bbop-sqlite distribution is a broken 29-byte placeholder, so the adapter parses micro.json (produced by ROBOT from the already-downloaded micro.owl) and normalizes both the standard http://purl.obolibrary.org/obo/MICRO_NNNN IRI shape and MicrO's quirky http://purl.obolibrary.org/obo/MicrO.owl/MICRO_NNNN variant to the canonical MICRO:NNNN CURIE.
  • Adds po.db.gz to download.yaml; removes the now-orphaned po.owl entry. Keeps micro.owl for the obograph-JSON fallback.
  • Extends BacDive's inline-emit skip-list and STUB_ONTOLOGY_PREFIXES so the dangling-target check passes and BacDive doesn't double-emit label-only stubs.

Test plan

  • poetry run kg transform -s ontologies_stubs produces 73 NCIT + 95 mesh + 2 BTO + 6 PO + 34 MICRO stub nodes, all with non-empty names
  • collect_stub_curies(['PO','MICRO']) finds the 6 + 34 CURIEs from the committed mappings
  • 27 unit tests pass across test_ontologies_stubs.py, test_isolation_source_mapping_utils.py, test_stub_curie_collection.py
  • New _ObographJsonAdapter unit test exercises both standard and quirky IRI shapes plus malformed-synonym handling
  • ruff check kg_microbe/ tests/ clean
  • CI green

🤖 Generated with Claude Code

Previously PO (Plant Ontology, ~2,170 nodes) and MICRO (Microbial
Conditions Ontology, ~17,600 nodes) were full-loaded by OntologiesTransform
even though the merged KG only references a handful of CURIEs from each
(6 PO, 34 MICRO). Promote both to the per-CURIE SemSQL stub-import path
already used by NCIT, mesh, and BTO so the merged KG carries one labelled
stub per referenced CURIE instead of thousands of unrelated nodes.

PO uses the standard SemSQL adapter against po.db (bbop-sqlite, 4.6 MB).
MICRO's bbop-sqlite distribution at s3.amazonaws.com/bbop-sqlite/micro.db.gz
is a broken 29-byte placeholder, so MICRO uses a new in-house
_ObographJsonAdapter that parses the OBO Graph JSON produced by ROBOT
from micro.owl. The adapter normalizes both the standard
http://purl.obolibrary.org/obo/MICRO_NNNN IRI shape and MicrO's quirky
http://purl.obolibrary.org/obo/MicrO.owl/MICRO_NNNN variant to the
canonical MICRO:NNNN CURIE (3 of 34 referenced CURIEs use the quirky
form).

Changes:

- kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py:
  add PO + MICRO entries to STUB_ONTOLOGY_SOURCES with a new
  `source_type` field ("semsql" default for NCIT/mesh/BTO/PO,
  "obograph_json" for MICRO). Refactor _open_adapter to dispatch on
  source_type. Add _ObographJsonAdapter class implementing the OAK
  adapter subset the transform calls (label, entity_aliases,
  entity_metadata_map). When micro.json is missing but micro.owl is
  present, _open_adapter calls convert_to_json (ROBOT) once.

- kg_microbe/transform_utils/ontologies/ontologies_transform.py:
  remove po and micro from ONTOLOGIES_MAP and ONTOLOGY_KNOWLEDGE_SOURCES;
  replace with a NOTE comment pointing readers to the stub transform.

- kg_microbe/utils/isolation_source_mapping_utils.py: add PO and MICRO
  to STUB_ONTOLOGY_PREFIXES so BacDive's dangling-target check passes
  for the 8 isolation_source rows that mapped to PO terms.

- kg_microbe/transform_utils/bacdive/bacdive.py: extend the inline-emit
  skip-list from {NCIT, mesh, BTO} to {NCIT, mesh, BTO, PO, MICRO} so
  BacDive defers stub-node emission to the OntologiesStubsTransform
  for all five prefixes.

- download.yaml: add po.db.gz from s3.amazonaws.com/bbop-sqlite
  (~4.6 MB). Remove the now-orphaned po.owl entry (replaced by the
  SemSQL DB; no remaining code path consumes po.owl). Keep micro.owl
  since it is still required for the obograph-JSON fallback. Update
  the stub-import section header to mention the new prefixes and
  document why MICRO uses OWL not SemSQL.

- merge.yaml / merge.no_metatraits.yaml / merge_bakta.yaml: add
  po_nodes.tsv and micro_nodes.tsv to the ontologies_stubs source
  filename list in each variant.

- tests/test_ontologies_stubs.py: rename
  test_stub_ontology_sources_covers_ncit_mesh_bto →
  covers_ncit_mesh_bto_po_micro; add tests asserting MICRO uses the
  obograph_json source_type and NCIT/mesh/BTO/PO use semsql. Add a
  focused _ObographJsonAdapter test exercising both standard and
  quirky IRI shapes, dropping malformed synonym entries (missing val),
  and graceful empty-result handling for unknown CURIEs. Extend the
  integration assertion to cover all five prefixes.

Verified:
- End-to-end stub transform run produced 73 NCIT + 95 mesh + 2 BTO +
  6 PO + 34 MICRO nodes, all with non-empty names.
- collect_stub_curies(['PO','MICRO']) finds the 6 + 34 CURIEs from
  the committed mappings.
- 27 unit tests pass across test_ontologies_stubs.py,
  test_isolation_source_mapping_utils.py, test_stub_curie_collection.py.
- ruff check kg_microbe/ tests/ clean.

End-to-end (requires `poetry run kg download` to fetch po.db.gz):

  poetry run kg transform -s ontologies_stubs
  # → data/transformed/ontologies_stubs/{ncit,mesh,bto,po,micro}_nodes.tsv

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 21, 2026 20:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates PO (Plant Ontology) and MICRO (Microbial Conditions Ontology) from full ontology loading to the existing per-CURIE “stub import” mechanism, reducing unrelated ontology nodes in the merged KG while still providing labels/synonyms/xrefs for referenced CURIEs.

Changes:

  • Extend the ontologies-stubs transform to emit enriched stubs for PO (SemSQL) and MICRO (OBO Graph JSON via a new in-house adapter).
  • Update merge configs to include po_nodes.tsv and micro_nodes.tsv stub outputs and prevent BacDive from double-emitting these stubs inline.
  • Update downloads to add po.db.gz and remove the now-unused po.owl entry.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_ontologies_stubs.py Expands tests for PO/MICRO stub sources and adds unit coverage for the new Obograph JSON adapter behavior.
merge.yaml Adds PO and MICRO stub node TSVs to the merged graph inputs.
merge.no_metatraits.yaml Keeps the no-metatraits merge config aligned with new stub outputs.
merge_bakta.yaml Keeps the Bakta merge config aligned with new stub outputs.
kg_microbe/utils/isolation_source_mapping_utils.py Extends stub-prefix configuration/documentation so BacDive validation and inline-emission behavior stays consistent.
kg_microbe/transform_utils/ontologies/ontologies_transform.py Removes PO/MICRO from full ontology loading and documents the new stub-based approach.
kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py Adds PO/MICRO stub sources, introduces _ObographJsonAdapter, and extends stub generation logic.
kg_microbe/transform_utils/bacdive/bacdive.py Extends the inline stub skip-set so BacDive doesn’t emit duplicate stubs for PO/MICRO.
download.yaml Adds po.db.gz for SemSQL stub import and removes the unused PO OWL download entry; documents MICRO JSON fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py Outdated
Comment thread kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py Outdated
Comment thread kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py Outdated
Four review comments addressed, all in ontologies_stubs_transform.py
plus tests:

1. Module docstring (line 13): clarify the reference-footprint phrasing
   to distinguish ~150 *reference rows* from ~34 *distinct MICRO IDs*
   (avoid implying ~150 unique IDs).

2. Module docstring (line 29-30): drop the stale claim that micro.json
   is "already-downloaded" / produced by the ontologies transform's
   OWL→JSON pass. MICRO was removed from ONTOLOGIES_MAP in this PR, so
   nothing in the full-load path produces micro.json anymore. Rewrite
   to state explicitly that _open_adapter generates the JSON on demand
   from data/raw/micro.owl via a ROBOT subprocess on first run.

3. STUB_ONTOLOGY_SOURCES["MICRO"] inline comment (line 125): same stale
   "already required for legacy reasons by the full-load path" claim;
   updated to match the new on-demand generation flow.

4. _write_stub_nodes error message (line 204): when MICRO inputs were
   missing, the FileNotFoundError still said "expected SemSQL DB at..."
   which is the wrong diagnostic for the obograph_json source. Branch
   on `STUB_ONTOLOGY_SOURCES[prefix]["source_type"]` and emit a
   source-specific message: SemSQL prefixes get the .db/.db.gz
   guidance; obograph_json prefixes get the .json/.owl + ROBOT
   guidance. Added a parallel test
   (test_transform_raises_with_obograph_json_message_when_micro_inputs_missing)
   that asserts the MICRO branch surfaces "OBO Graph JSON … ROBOT" in
   the error and not the SemSQL phrasing. Tightened the existing
   SemSQL test to assert "SemSQL DB" appears.

Verified:
- 12 unit tests pass (was 11; +1 new obograph error-message test).
- ruff check kg_microbe/ tests/ clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin merged commit 2f78ecc into master May 21, 2026
3 checks passed
@realmarcin realmarcin deleted the add-po-micro-to-stub-import branch May 21, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants