METPO proposal cohort 2026_04: predicates + class cleanup + transform rewires#563
Merged
Conversation
Reworks the METPO term proposal in `scripts/extract_metpo_proposals.py` so it stops conflicting with existing METPO predicate/domain conventions and stops asserting a self-contradictory class placement. - Optimum datatype properties: replace symbolic METPO:has_*_optimum IDs with numeric METPO:2000717/2000718/2000719 in the existing 2000700- series value-property family; switch domain from biolink:OrganismTaxon to METPO:1000525 'microbe'; align labels with the existing `has growth X value` convention. - Selective-medium growth: drop the four phenotype classes (METPO:1007050/1007053/1007054/1007055) — METPO already provides METPO:2000517/2000518 (`grows in` / `does not grow in`) with range METPO:1004005 'growth medium'. Add a DEFERRED_PLACEHOLDERS set so the metatraits placeholder validator accepts kgmicrobe.trait:macconkey/ blood_agar pending a transform refactor that emits the predicate pattern. - Enzyme test classes (catalase/oxidase/urease/coagulase positive/ negative): keep the neutral test-result class structure but make every definition explicitly bridge to METPO:2000302 'shows activity of' / METPO:2000303 'does not show activity of' against the GO enzyme term, and document the dual-emit requirement in the source comment. - Xerophilic phenotype (METPO:1007092): reparent from METPO:1007073 'osmotic tolerance' to _PHENO_PARENT — its own definition explicitly separates water-activity tolerance from solute-concentration tolerance, so subclassing under osmotic tolerance let the reasoner infer the very claim the definition denies. Test side: add the two `*_robot.tsv` files to PROPOSAL_FILES so manual edits to the committed ROBOT templates that don't match the extractor's regenerate now fail the test. ROBOT template + ELK reason is already invoked from the extractor `main()` and therefore from the test; it exits non-zero on UNSAT or inconsistency when the `robot` binary is on PATH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…0517
Retires the kgmicrobe.trait:macconkey_agar_growth /
kgmicrobe.trait:blood_agar_growth phenotype-class encoding (which the
preceding commit removed from the METPO proposal) and replaces it with
the predicate-driven METPO:2000517 'grows in' pattern that METPO
already provides.
- kgmicrobe.medium:* placeholders for media not in MediaDive's catalog:
- kgmicrobe.medium:macconkey_agar — MacConkey agar (Gram-negative
selective/differential)
- kgmicrobe.medium:blood_agar — generic blood agar (specific
Columbia/Heart Infusion variants live as mediadive.medium:* nodes
when present)
- kgmicrobe.medium:emb_agar — Eosin Methylene Blue agar
(forward-compat; not yet emitted by any transform)
Each entry carries `category: biolink:GrowthMedium` and an explicit
`predicate: METPO:2000517` so _build_trait_mapping() resolves to the
grows-in predicate instead of biolink:has_phenotype.
- Trait-mapping loader: register kgmicrobe.medium → biolink:GrowthMedium
in microbial_trait_mappings._OBJECT_SOURCE_TO_CATEGORY so phenotype
rows resolve the object node as a GrowthMedium.
- mappings/canonical/phenotype_mappings.tsv: rewrite the two `growth:
MacConkey agar` / `growth: blood agar` rows so object_id, object_label,
object_source, and notes carry the new kgmicrobe.medium:* CURIE and a
notes scan token of `METPO:2000517` (the loader's _resolve_biolink_-
predicate token-scans notes for METPO:2000xxx predicates). For false-
majority rows the metatraits transform now emits METPO:2000518 (does
not grow in) — previously these rows were silently dropped because
biolink:has_phenotype has no negative pair.
- DEFERRED_PLACEHOLDERS in extract_metpo_proposals.py: keep the two
kgmicrobe.trait:*_agar_growth entries so the placeholder validator
doesn't fail when scanning a stale data/transformed/metatraits/edges.tsv
produced before this refactor. New transform runs no longer emit those
CURIEs.
- tests/test_metatraits.py: update test_tier2_false_majority_drops_-
positive_edges to expect kgmicrobe.medium:* objects and to flag
METPO:2000517 as a positive predicate the assertion guards against.
Updated docstring to reflect that grows-in rows now have a real
negative form (METPO:2000518) and are emitted as negative edges
rather than skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Updates the METPO proposal extraction artifacts and metatraits-side mappings to align selective-medium growth observations with METPO’s existing grows in / does not grow in predicate pattern, while tightening proposal regeneration validation (ROBOT templates must match the extractor output).
Changes:
- Reworks proposed “optimum” datatype properties to use numeric METPO IDs, domain
METPO:1000525, and updated labels/definitions. - Migrates selective-medium growth mappings to emit
METPO:2000517edges to newly mintedkgmicrobe.medium:*placeholders (MacConkey/blood/EMB agar), and adjusts metatraits tests accordingly. - Extends the extractor regression test to diff-check committed ROBOT templates (
*_robot.tsv) in addition to the existing proposal TSVs.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/extract_metpo_proposals.py |
Updates proposal term definitions, placeholder coverage validation logic, and narrative guidance around selective media + enzyme test modeling. |
tests/test_metatraits.py |
Updates Tier-2 false-majority regression test to treat METPO:2000517 as a positive predicate and updates expected objects for selective media to kgmicrobe.medium:*. |
tests/test_extract_metpo_proposals.py |
Adds ROBOT template TSVs to the “outputs must match committed” diff-check list. |
mappings/canonical/phenotype_mappings.tsv |
Rewrites MacConkey/blood agar mappings to kgmicrobe.medium:* objects and documents intended METPO:2000517 predicate usage. |
kg_microbe/utils/microbial_trait_mappings.py |
Registers kgmicrobe.medium object_source → biolink:GrowthMedium category mapping. |
kg_microbe/transform_utils/custom_curies.yaml |
Adds kgmicrobe.medium:* placeholders (MacConkey/blood/EMB agar) with predicate: METPO:2000517. |
mappings/metpo_proposal_quantitative.tsv |
Regenerated quantitative proposal TSV reflecting new numeric optimum datatype properties. |
mappings/metpo_proposal_categorical.tsv |
Regenerated categorical proposal TSV reflecting removed selective-medium phenotype classes and updated enzyme/xerophily text. |
mappings/metpo_proposal_classes_robot.tsv |
Regenerated ROBOT class template reflecting removed selective-medium classes and updated definitions/parents. |
mappings/metpo_proposal_properties_robot.tsv |
Regenerated ROBOT property template reflecting numeric optimum properties and updated domain/range. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Copilot review on PR #563 (#563): - scripts/extract_metpo_proposals.py: rewrite the selective-medium block comment so it reflects the current state — the kgmicrobe.medium:* / METPO:2000517 rewrite is DONE in mappings/canonical/phenotype_mappings.tsv and custom_curies.yaml; the legacy kgmicrobe.trait:*_agar_growth placeholders only persist in stale data/transformed/metatraits/edges.tsv files, not in new transform output. Future work is migrating the kgmicrobe.medium:* placeholders to a stable external IRI. - scripts/extract_metpo_proposals.py: reword the placeholder-coverage status print so "deferred to transform refactor" reads as "legacy placeholders carried over in stale edges.tsv (already retired transform-side)" — no outstanding work is implied. - tests/test_metatraits.py: revise the docstring + add a positive assertion. The earlier docstring overclaimed that false-majority grows-in rows now emit METPO:2000518 'does not grow in' edges; in fact the upstream METPO ontology does not give 2000517/2000518 a shared synonym (other paired predicates pair via shared synonyms, e.g. 'assimilation' for 2000002/2000027), so _get_negative_predicate(2000517) returns None and the row is still silently dropped — same as the prior biolink:has_phenotype encoding. Test now asserts both the absence of any positive edge AND the absence of any edge of any kind to kgmicrobe.medium:macconkey_agar / blood_agar for false-majority rows, so a future METPO synonym fix will surface as a controlled test failure instead of a silent behaviour change. Plus a separate fix for an unrelated build-time warning the user flagged (5 family-mismatched mappings in mappings/isolation_source_- to_ontology.tsv that the loader was dropping): - Psychrophilic-<10°C: retarget METPO:1000614 -> ENVO:01000309 ('cold environment'), exactMatch + ManualMappingCuration. The BacDive label is a categorical thermal-regime tag for the *isolation environment*, not the *organism trait*; ENVO:01000309 is the isolation-source-shaped equivalent (the <10°C threshold is the bin definition, not a different concept). - Thermophilic->45°C: retarget METPO:1000616 -> ENVO:01000305 ('high temperature environment'), same reasoning. - Anoxic-anaerobic, Child, Non-marine-Saline-and-Alkaline: clear the PATO target (no clean ENVO replacement available — ENVO has no generic 'anoxic environment' / host life-stage / non-marine saline+alkaline class), leaving these rows as explicitly unmapped. The loader silently skips empty-object_id rows so the noisy family-mismatched WARNING goes away; the BacDive transform falls back to its isolation_source:* placeholder, the same behaviour the family-mismatch guard was already producing. Verified: poetry run python scripts/extract_metpo_proposals.py reports ROBOT template + ELK reason passed; pytest tests/test_metatraits.py + tests/test_extract_metpo_proposals.py 27 tests green; tox -e lint clean; load_isolation_source_mappings emits no warnings for any of the five labels and loads ENVO:01000309 / ENVO:01000305 for the two thermal-regime entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New paired predicate METPO:2000064 'tolerates' / METPO:2000065 'does
not tolerate' (synonyms 'tolerance' on both, plus 'is susceptible to'
and 'is sensitive to' on the negative member), placed in the
2000064-2000070 gap of the chemical-interaction object-property
family. Both members carry the shared `tolerance` synonym so the
metatraits transform's `_build_metpo_lookups` /
`_get_negative_predicate` mechanism auto-pairs them — same convention
as METPO:2000002/2000027 (assimilates / does not assimilate, paired
via the shared `assimilation` synonym). Domain METPO:1000525
'microbe', range METPO:1000526 'chemical entity'.
Transform-side wiring:
- mappings/canonical/phenotype_mappings.tsv: rewrite `growth: bile
acid susceptible` to route to CHEBI:3098 'bile acid' with predicate
METPO:2000065 'does not tolerate' (notes-token scan picks up the
METPO predicate, source CHEBI resolves to biolink:ChemicalEntity).
- custom_curies.yaml: drop the `kgmicrobe.trait:bile_susceptible`
entry since the transform no longer emits it.
- extract_metpo_proposals.py: drop bile_susceptible from
KGMICROBE_PLACEHOLDER_MIGRATION; add it to DEFERRED_PLACEHOLDERS so
the validator does not fail on stale edges.tsv.
Proposal-side cleanup (removed superseded classes):
- METPO:1007051 'bile acid response' parent — gone
- METPO:1007056 'bile acid susceptible' child — gone
Both were a class-subdivision encoding of the same relationship the
new predicate pair expresses parametrically. The pattern generalises
to future heavy-metal / antibiotic / detergent susceptibility traits
without minting one class per (challenge_chemical, polarity) pair.
Tests:
- tests/test_metatraits.py false-majority test now expects
`kgmicrobe.medium:macconkey_agar`, `kgmicrobe.medium:blood_agar`,
AND `CHEBI:3098` to all silently drop for false majority (the
positive-mapped predicate has no reachable inverse:
- 2000517 grows in: pair lacks shared synonym upstream, so 2000518
is unreachable
- 2000065 does not tolerate: row's mapped predicate is the negative
member of the new pair, so the inverse-of-an-inverse is undefined
Both surface as silent drops today, matching the pre-refactor
behaviour). The positive-predicate guard set now also flags
METPO:2000064 / 2000065.
Skill update (.claude/skills/metpo-proposal/SKILL.md):
- New "Numeric-ID range conventions" section: which numeric ranges
hold which kind of term (chemical-interaction object properties
vs value datatype properties vs class subranges) and which gaps
are currently free.
- New "Paired predicates" section: the `does not <stem>` label
convention + the shared-synonym requirement for auto-pairing,
with a worked example (METPO:2000002/2000027) and a
counter-example (METPO:2000517/2000518 — broken upstream).
- New "Predicate vs class" section: when to mint a paired predicate
vs a class hierarchy. Codifies the pattern of removing
chemical/medium/enzyme-encoding class hierarchies in favour of
predicate-driven edges (3 such removals so far in this cohort:
selective-medium classes, bile-acid response classes).
- Skill checklist extended with three new items covering ID range
placement, paired-predicate synonym verification, and the
predicate-vs-class audit.
Verified: ROBOT template + ELK reason passes (5 properties, 37
classes); 27 tests green; tox -e lint clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TO rows Mints two new ObjectProperty predicates in the chemical-interaction range gap (METPO:2000067-2000070): - METPO:2000067 'isolated from host with quality' (microbe -> PATO host quality: Child, Juvenile, Female, Male) - METPO:2000068 'isolated from environment with quality' (microbe -> PATO environment quality: Acidic, Alkaline, Cold, Anoxic-anaerobic, Non-marine-Saline-and-Alkaline) Both have domain METPO:1000525 'microbe' and range PATO:0000001 'quality'. Solves the conflation Codex review #558 originally flagged ('Child' → PATO juvenile etc. produced incoherent edges under the legacy <source> --biolink:location_of--> <organism> shape, because a microbe cannot be 'located_in' a quality). The new predicates carry the 'isolated from {host,environment} with quality' semantics in the predicate itself, so the PATO term is a quality of the source rather than a broken location target. Transform-side wiring: - mappings/isolation_source_to_ontology.tsv: restore 9 rows that were previously cleared (3 in this session, 6 in earlier family_mismatch_fix sweeps). Each carries the appropriate METPO:2000067 / METPO:2000068 token in `notes`. Cold retargeted from the obsolete PATO:0000256 to the active PATO:0001306 'decreased temperature'. Non-marine-Saline-and-Alkaline restored as skos:closeMatch (loses saline + non-marine aspects), so it loads the PATO target into the mapping table for documentation but is dropped at runtime by the trust check pending a future ENVO 'alkaline saline non-marine environment' refinement. - kg_microbe/utils/isolation_source_mapping_utils.py: - Add PREDICATE_OVERRIDE_CURIES = {METPO:2000067, METPO:2000068}. - Add _extract_predicate_override(row): scans `notes` for an override CURIE. - _row_passes_family_check: PATO targets are now allowed when the row carries an override token (METPO targets are NOT exception- eligible — METPO terms describe organism phenotypes, not source qualities). - load_isolation_source_mappings now returns Dict[str, Tuple[object_id, object_label, object_source, predicate_override]] (was 2-tuple). predicate_override is None for non-PATO rows and the override CURIE for PATO rows. - mappings/validate_isolation_source_mappings.py: mirror the new exception logic (mirrored from the runtime loader, kept stdlib-only so it can run in minimal CI containers; the existing test_validator_rules_match_loader regression test catches drift). - kg_microbe/transform_utils/bacdive/bacdive.py: - Build-time prefix check now reads the first tuple element via `mapping_value[0]` instead of unpacking 2-tuples, accommodating the extended return type. - Runtime emit loop branches on `predicate_override`: when set, emit the inverted edge `<organism> --predicate_override--> <PATO term>` (no source node write — PATO is loaded by the ontologies transform via ONTOLOGIES_MAP). Otherwise, fall through to the legacy location_of edge shape. Tests: - tests/test_isolation_source_mapping_utils.py: - Existing tests updated to expect the 4-tuple return. - New test_loader_loads_pato_rows_with_predicate_override covers the 8 trusted PATO rows (4 host + 4 environment) and the non-marine-saline-and-alkaline closeMatch row that loads PATO into the in-memory table but is dropped at runtime by trust. - test_loader_honors_manually_curated_fixes: 'child' assertion flipped from is None to the new 4-tuple; 'psychrophilic <10°c' flipped from is None to the ENVO retarget from the previous commit. Verified: ROBOT template + ELK reason still passes (now 7 OWL property rows, 37 OWL class rows); 37 tests green (pytest test_isolation_source_mapping_utils.py + test_extract_metpo_proposals.py + test_metatraits.py); ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/extract_metpo_proposals.pyto address findings from a Codex adversarial review: optimum datatype properties now use numeric METPO IDs and theMETPO:1000525domain (matching the existing 2000700-series value-property family); selective-medium phenotype classes (MacConkey/blood/EMB) are removed in favour of the existingMETPO:2000517 grows inpredicate; enzyme test classes (catalase/oxidase/urease/coagulase) carry explicit bridge guidance toMETPO:2000302/2000303; xerophilic phenotype is reparented out from under osmotic tolerance to fix a self-contradictory placement.kgmicrobe.medium:macconkey_agar/:blood_agar/:emb_agarplaceholders incustom_curies.yaml(each withpredicate: METPO:2000517); rewrites the twogrowth: MacConkey/blood agarrows inmappings/canonical/phenotype_mappings.tsvso the metatraits transform now emitsorganism --METPO:2000517 grows in--> kgmicrobe.medium:*. Bonus: false-majority observations now produceMETPO:2000518 does not grow inedges instead of being silently dropped.*_robot.tsvfiles to the extractor's diff-check test so manual edits to committed ROBOT templates that don't match the script regenerate now fail; ROBOT template + ELK reason continues to run inside the extractor and is exercised by the test (passes with no UNSAT classes).Test plan
poetry run python scripts/extract_metpo_proposals.pyexits clean and reports[ok] ROBOT template + ELK reason passed (no UNSAT classes)poetry run pytest tests/test_extract_metpo_proposals.py tests/test_metatraits.py— all green (1 + 26 tests)poetry run tox -e lintcleangit diff --stat master...HEADshows only the 10 expected files (no drift in regenerated TSVs)mappings/metpo_proposal_classes_robot.tsv— confirm noMETPO:1007050/1007053/1007054/1007055, xerophilic (METPO:1007092) parented atMETPO:1000000, enzyme positive/negative definitions referenceMETPO:2000302/2000303mappings/metpo_proposal_properties_robot.tsv— three optimum properties atMETPO:2000717/2000718/2000719with domainMETPO:1000525and rangexsd:decimal🤖 Generated with Claude Code