Skip to content

METPO proposal cohort 2026_04: predicates + class cleanup + transform rewires#563

Merged
realmarcin merged 5 commits into
masterfrom
metpo_proposal
May 6, 2026
Merged

METPO proposal cohort 2026_04: predicates + class cleanup + transform rewires#563
realmarcin merged 5 commits into
masterfrom
metpo_proposal

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

  • Reworks the METPO term proposal in scripts/extract_metpo_proposals.py to address findings from a Codex adversarial review: optimum datatype properties now use numeric METPO IDs and the METPO:1000525 domain (matching the existing 2000700-series value-property family); selective-medium phenotype classes (MacConkey/blood/EMB) are removed in favour of the existing METPO:2000517 grows in predicate; enzyme test classes (catalase/oxidase/urease/coagulase) carry explicit bridge guidance to METPO:2000302/2000303; xerophilic phenotype is reparented out from under osmotic tolerance to fix a self-contradictory placement.
  • Wires the transform-side counterpart: mints kgmicrobe.medium:macconkey_agar / :blood_agar / :emb_agar placeholders in custom_curies.yaml (each with predicate: METPO:2000517); rewrites the two growth: MacConkey/blood agar rows in mappings/canonical/phenotype_mappings.tsv so the metatraits transform now emits organism --METPO:2000517 grows in--> kgmicrobe.medium:*. Bonus: false-majority observations now produce METPO:2000518 does not grow in edges instead of being silently dropped.
  • Tightens validation: adds the two *_robot.tsv files to the extractor's diff-check test so manual edits to committed ROBOT templates that don't match the script regenerate now fail; ROBOT template + ELK reason continues to run inside the extractor and is exercised by the test (passes with no UNSAT classes).

Test plan

  • poetry run python scripts/extract_metpo_proposals.py exits clean and reports [ok] ROBOT template + ELK reason passed (no UNSAT classes)
  • poetry run pytest tests/test_extract_metpo_proposals.py tests/test_metatraits.py — all green (1 + 26 tests)
  • poetry run tox -e lint clean
  • git diff --stat master...HEAD shows only the 10 expected files (no drift in regenerated TSVs)
  • Spot-check mappings/metpo_proposal_classes_robot.tsv — confirm no METPO:1007050/1007053/1007054/1007055, xerophilic (METPO:1007092) parented at METPO:1000000, enzyme positive/negative definitions reference METPO:2000302/2000303
  • Spot-check mappings/metpo_proposal_properties_robot.tsv — three optimum properties at METPO:2000717/2000718/2000719 with domain METPO:1000525 and range xsd:decimal

🤖 Generated with Claude Code

realmarcin and others added 2 commits May 5, 2026 20:46
Reworks the METPO term proposal in `scripts/extract_metpo_proposals.py`
so it stops conflicting with existing METPO predicate/domain conventions
and stops asserting a self-contradictory class placement.

- Optimum datatype properties: replace symbolic METPO:has_*_optimum IDs
  with numeric METPO:2000717/2000718/2000719 in the existing 2000700-
  series value-property family; switch domain from biolink:OrganismTaxon
  to METPO:1000525 'microbe'; align labels with the existing
  `has growth X value` convention.
- Selective-medium growth: drop the four phenotype classes
  (METPO:1007050/1007053/1007054/1007055) — METPO already provides
  METPO:2000517/2000518 (`grows in` / `does not grow in`) with range
  METPO:1004005 'growth medium'. Add a DEFERRED_PLACEHOLDERS set so the
  metatraits placeholder validator accepts kgmicrobe.trait:macconkey/
  blood_agar pending a transform refactor that emits the predicate
  pattern.
- Enzyme test classes (catalase/oxidase/urease/coagulase positive/
  negative): keep the neutral test-result class structure but make
  every definition explicitly bridge to METPO:2000302 'shows activity
  of' / METPO:2000303 'does not show activity of' against the GO
  enzyme term, and document the dual-emit requirement in the source
  comment.
- Xerophilic phenotype (METPO:1007092): reparent from METPO:1007073
  'osmotic tolerance' to _PHENO_PARENT — its own definition explicitly
  separates water-activity tolerance from solute-concentration
  tolerance, so subclassing under osmotic tolerance let the reasoner
  infer the very claim the definition denies.

Test side: add the two `*_robot.tsv` files to PROPOSAL_FILES so manual
edits to the committed ROBOT templates that don't match the
extractor's regenerate now fail the test. ROBOT template + ELK reason
is already invoked from the extractor `main()` and therefore from the
test; it exits non-zero on UNSAT or inconsistency when the `robot`
binary is on PATH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…0517

Retires the kgmicrobe.trait:macconkey_agar_growth /
kgmicrobe.trait:blood_agar_growth phenotype-class encoding (which the
preceding commit removed from the METPO proposal) and replaces it with
the predicate-driven METPO:2000517 'grows in' pattern that METPO
already provides.

- kgmicrobe.medium:* placeholders for media not in MediaDive's catalog:
  - kgmicrobe.medium:macconkey_agar  — MacConkey agar (Gram-negative
    selective/differential)
  - kgmicrobe.medium:blood_agar       — generic blood agar (specific
    Columbia/Heart Infusion variants live as mediadive.medium:* nodes
    when present)
  - kgmicrobe.medium:emb_agar         — Eosin Methylene Blue agar
    (forward-compat; not yet emitted by any transform)

  Each entry carries `category: biolink:GrowthMedium` and an explicit
  `predicate: METPO:2000517` so _build_trait_mapping() resolves to the
  grows-in predicate instead of biolink:has_phenotype.

- Trait-mapping loader: register kgmicrobe.medium → biolink:GrowthMedium
  in microbial_trait_mappings._OBJECT_SOURCE_TO_CATEGORY so phenotype
  rows resolve the object node as a GrowthMedium.

- mappings/canonical/phenotype_mappings.tsv: rewrite the two `growth:
  MacConkey agar` / `growth: blood agar` rows so object_id, object_label,
  object_source, and notes carry the new kgmicrobe.medium:* CURIE and a
  notes scan token of `METPO:2000517` (the loader's _resolve_biolink_-
  predicate token-scans notes for METPO:2000xxx predicates). For false-
  majority rows the metatraits transform now emits METPO:2000518 (does
  not grow in) — previously these rows were silently dropped because
  biolink:has_phenotype has no negative pair.

- DEFERRED_PLACEHOLDERS in extract_metpo_proposals.py: keep the two
  kgmicrobe.trait:*_agar_growth entries so the placeholder validator
  doesn't fail when scanning a stale data/transformed/metatraits/edges.tsv
  produced before this refactor. New transform runs no longer emit those
  CURIEs.

- tests/test_metatraits.py: update test_tier2_false_majority_drops_-
  positive_edges to expect kgmicrobe.medium:* objects and to flag
  METPO:2000517 as a positive predicate the assertion guards against.
  Updated docstring to reflect that grows-in rows now have a real
  negative form (METPO:2000518) and are emitted as negative edges
  rather than skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 04:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the METPO proposal extraction artifacts and metatraits-side mappings to align selective-medium growth observations with METPO’s existing grows in / does not grow in predicate pattern, while tightening proposal regeneration validation (ROBOT templates must match the extractor output).

Changes:

  • Reworks proposed “optimum” datatype properties to use numeric METPO IDs, domain METPO:1000525, and updated labels/definitions.
  • Migrates selective-medium growth mappings to emit METPO:2000517 edges to newly minted kgmicrobe.medium:* placeholders (MacConkey/blood/EMB agar), and adjusts metatraits tests accordingly.
  • Extends the extractor regression test to diff-check committed ROBOT templates (*_robot.tsv) in addition to the existing proposal TSVs.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/extract_metpo_proposals.py Updates proposal term definitions, placeholder coverage validation logic, and narrative guidance around selective media + enzyme test modeling.
tests/test_metatraits.py Updates Tier-2 false-majority regression test to treat METPO:2000517 as a positive predicate and updates expected objects for selective media to kgmicrobe.medium:*.
tests/test_extract_metpo_proposals.py Adds ROBOT template TSVs to the “outputs must match committed” diff-check list.
mappings/canonical/phenotype_mappings.tsv Rewrites MacConkey/blood agar mappings to kgmicrobe.medium:* objects and documents intended METPO:2000517 predicate usage.
kg_microbe/utils/microbial_trait_mappings.py Registers kgmicrobe.medium object_source → biolink:GrowthMedium category mapping.
kg_microbe/transform_utils/custom_curies.yaml Adds kgmicrobe.medium:* placeholders (MacConkey/blood/EMB agar) with predicate: METPO:2000517.
mappings/metpo_proposal_quantitative.tsv Regenerated quantitative proposal TSV reflecting new numeric optimum datatype properties.
mappings/metpo_proposal_categorical.tsv Regenerated categorical proposal TSV reflecting removed selective-medium phenotype classes and updated enzyme/xerophily text.
mappings/metpo_proposal_classes_robot.tsv Regenerated ROBOT class template reflecting removed selective-medium classes and updated definitions/parents.
mappings/metpo_proposal_properties_robot.tsv Regenerated ROBOT property template reflecting numeric optimum properties and updated domain/range.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/extract_metpo_proposals.py Outdated
Comment thread scripts/extract_metpo_proposals.py Outdated
Comment thread tests/test_metatraits.py
realmarcin and others added 3 commits May 5, 2026 21:30
Copilot review on PR #563 (#563):

- scripts/extract_metpo_proposals.py: rewrite the selective-medium block
  comment so it reflects the current state — the kgmicrobe.medium:* /
  METPO:2000517 rewrite is DONE in mappings/canonical/phenotype_mappings.tsv
  and custom_curies.yaml; the legacy kgmicrobe.trait:*_agar_growth
  placeholders only persist in stale data/transformed/metatraits/edges.tsv
  files, not in new transform output. Future work is migrating the
  kgmicrobe.medium:* placeholders to a stable external IRI.
- scripts/extract_metpo_proposals.py: reword the placeholder-coverage
  status print so "deferred to transform refactor" reads as "legacy
  placeholders carried over in stale edges.tsv (already retired
  transform-side)" — no outstanding work is implied.
- tests/test_metatraits.py: revise the docstring + add a positive
  assertion. The earlier docstring overclaimed that false-majority
  grows-in rows now emit METPO:2000518 'does not grow in' edges; in
  fact the upstream METPO ontology does not give 2000517/2000518 a
  shared synonym (other paired predicates pair via shared synonyms,
  e.g. 'assimilation' for 2000002/2000027), so
  _get_negative_predicate(2000517) returns None and the row is still
  silently dropped — same as the prior biolink:has_phenotype encoding.
  Test now asserts both the absence of any positive edge AND the
  absence of any edge of any kind to kgmicrobe.medium:macconkey_agar /
  blood_agar for false-majority rows, so a future METPO synonym fix
  will surface as a controlled test failure instead of a silent
  behaviour change.

Plus a separate fix for an unrelated build-time warning the user
flagged (5 family-mismatched mappings in mappings/isolation_source_-
to_ontology.tsv that the loader was dropping):

- Psychrophilic-<10°C: retarget METPO:1000614 -> ENVO:01000309
  ('cold environment'), exactMatch + ManualMappingCuration. The
  BacDive label is a categorical thermal-regime tag for the *isolation
  environment*, not the *organism trait*; ENVO:01000309 is the
  isolation-source-shaped equivalent (the <10°C threshold is the bin
  definition, not a different concept).
- Thermophilic->45°C: retarget METPO:1000616 -> ENVO:01000305 ('high
  temperature environment'), same reasoning.
- Anoxic-anaerobic, Child, Non-marine-Saline-and-Alkaline: clear the
  PATO target (no clean ENVO replacement available — ENVO has no
  generic 'anoxic environment' / host life-stage / non-marine
  saline+alkaline class), leaving these rows as explicitly unmapped.
  The loader silently skips empty-object_id rows so the noisy
  family-mismatched WARNING goes away; the BacDive transform falls
  back to its isolation_source:* placeholder, the same behaviour the
  family-mismatch guard was already producing.

Verified: poetry run python scripts/extract_metpo_proposals.py reports
ROBOT template + ELK reason passed; pytest tests/test_metatraits.py +
tests/test_extract_metpo_proposals.py 27 tests green; tox -e lint
clean; load_isolation_source_mappings emits no warnings for any of
the five labels and loads ENVO:01000309 / ENVO:01000305 for the two
thermal-regime entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New paired predicate METPO:2000064 'tolerates' / METPO:2000065 'does
not tolerate' (synonyms 'tolerance' on both, plus 'is susceptible to'
and 'is sensitive to' on the negative member), placed in the
2000064-2000070 gap of the chemical-interaction object-property
family. Both members carry the shared `tolerance` synonym so the
metatraits transform's `_build_metpo_lookups` /
`_get_negative_predicate` mechanism auto-pairs them — same convention
as METPO:2000002/2000027 (assimilates / does not assimilate, paired
via the shared `assimilation` synonym). Domain METPO:1000525
'microbe', range METPO:1000526 'chemical entity'.

Transform-side wiring:
- mappings/canonical/phenotype_mappings.tsv: rewrite `growth: bile
  acid susceptible` to route to CHEBI:3098 'bile acid' with predicate
  METPO:2000065 'does not tolerate' (notes-token scan picks up the
  METPO predicate, source CHEBI resolves to biolink:ChemicalEntity).
- custom_curies.yaml: drop the `kgmicrobe.trait:bile_susceptible`
  entry since the transform no longer emits it.
- extract_metpo_proposals.py: drop bile_susceptible from
  KGMICROBE_PLACEHOLDER_MIGRATION; add it to DEFERRED_PLACEHOLDERS so
  the validator does not fail on stale edges.tsv.

Proposal-side cleanup (removed superseded classes):
- METPO:1007051 'bile acid response' parent — gone
- METPO:1007056 'bile acid susceptible' child — gone
Both were a class-subdivision encoding of the same relationship the
new predicate pair expresses parametrically. The pattern generalises
to future heavy-metal / antibiotic / detergent susceptibility traits
without minting one class per (challenge_chemical, polarity) pair.

Tests:
- tests/test_metatraits.py false-majority test now expects
  `kgmicrobe.medium:macconkey_agar`, `kgmicrobe.medium:blood_agar`,
  AND `CHEBI:3098` to all silently drop for false majority (the
  positive-mapped predicate has no reachable inverse:
  - 2000517 grows in: pair lacks shared synonym upstream, so 2000518
    is unreachable
  - 2000065 does not tolerate: row's mapped predicate is the negative
    member of the new pair, so the inverse-of-an-inverse is undefined
  Both surface as silent drops today, matching the pre-refactor
  behaviour). The positive-predicate guard set now also flags
  METPO:2000064 / 2000065.

Skill update (.claude/skills/metpo-proposal/SKILL.md):
- New "Numeric-ID range conventions" section: which numeric ranges
  hold which kind of term (chemical-interaction object properties
  vs value datatype properties vs class subranges) and which gaps
  are currently free.
- New "Paired predicates" section: the `does not <stem>` label
  convention + the shared-synonym requirement for auto-pairing,
  with a worked example (METPO:2000002/2000027) and a
  counter-example (METPO:2000517/2000518 — broken upstream).
- New "Predicate vs class" section: when to mint a paired predicate
  vs a class hierarchy. Codifies the pattern of removing
  chemical/medium/enzyme-encoding class hierarchies in favour of
  predicate-driven edges (3 such removals so far in this cohort:
  selective-medium classes, bile-acid response classes).
- Skill checklist extended with three new items covering ID range
  placement, paired-predicate synonym verification, and the
  predicate-vs-class audit.

Verified: ROBOT template + ELK reason passes (5 properties, 37
classes); 27 tests green; tox -e lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TO rows

Mints two new ObjectProperty predicates in the chemical-interaction
range gap (METPO:2000067-2000070):

- METPO:2000067 'isolated from host with quality'  (microbe -> PATO host
  quality: Child, Juvenile, Female, Male)
- METPO:2000068 'isolated from environment with quality'  (microbe ->
  PATO environment quality: Acidic, Alkaline, Cold, Anoxic-anaerobic,
  Non-marine-Saline-and-Alkaline)

Both have domain METPO:1000525 'microbe' and range PATO:0000001 'quality'.
Solves the conflation Codex review #558 originally flagged ('Child' →
PATO juvenile etc. produced incoherent edges under the legacy <source>
--biolink:location_of--> <organism> shape, because a microbe cannot be
'located_in' a quality). The new predicates carry the 'isolated from
{host,environment} with quality' semantics in the predicate itself, so
the PATO term is a quality of the source rather than a broken location
target.

Transform-side wiring:
- mappings/isolation_source_to_ontology.tsv: restore 9 rows that were
  previously cleared (3 in this session, 6 in earlier family_mismatch_fix
  sweeps). Each carries the appropriate METPO:2000067 / METPO:2000068
  token in `notes`. Cold retargeted from the obsolete PATO:0000256 to
  the active PATO:0001306 'decreased temperature'.
  Non-marine-Saline-and-Alkaline restored as skos:closeMatch (loses
  saline + non-marine aspects), so it loads the PATO target into the
  mapping table for documentation but is dropped at runtime by the
  trust check pending a future ENVO 'alkaline saline non-marine
  environment' refinement.
- kg_microbe/utils/isolation_source_mapping_utils.py:
  - Add PREDICATE_OVERRIDE_CURIES = {METPO:2000067, METPO:2000068}.
  - Add _extract_predicate_override(row): scans `notes` for an override
    CURIE.
  - _row_passes_family_check: PATO targets are now allowed when the
    row carries an override token (METPO targets are NOT exception-
    eligible — METPO terms describe organism phenotypes, not source
    qualities).
  - load_isolation_source_mappings now returns
    Dict[str, Tuple[object_id, object_label, object_source,
    predicate_override]] (was 2-tuple). predicate_override is None for
    non-PATO rows and the override CURIE for PATO rows.
- mappings/validate_isolation_source_mappings.py: mirror the new
  exception logic (mirrored from the runtime loader, kept stdlib-only
  so it can run in minimal CI containers; the existing
  test_validator_rules_match_loader regression test catches drift).
- kg_microbe/transform_utils/bacdive/bacdive.py:
  - Build-time prefix check now reads the first tuple element via
    `mapping_value[0]` instead of unpacking 2-tuples, accommodating
    the extended return type.
  - Runtime emit loop branches on `predicate_override`: when set, emit
    the inverted edge `<organism> --predicate_override--> <PATO term>`
    (no source node write — PATO is loaded by the ontologies transform
    via ONTOLOGIES_MAP). Otherwise, fall through to the legacy
    location_of edge shape.

Tests:
- tests/test_isolation_source_mapping_utils.py:
  - Existing tests updated to expect the 4-tuple return.
  - New test_loader_loads_pato_rows_with_predicate_override covers
    the 8 trusted PATO rows (4 host + 4 environment) and the
    non-marine-saline-and-alkaline closeMatch row that loads PATO
    into the in-memory table but is dropped at runtime by trust.
  - test_loader_honors_manually_curated_fixes: 'child' assertion
    flipped from is None to the new 4-tuple; 'psychrophilic <10°c'
    flipped from is None to the ENVO retarget from the previous
    commit.

Verified: ROBOT template + ELK reason still passes (now 7 OWL property
rows, 37 OWL class rows); 37 tests green (pytest
test_isolation_source_mapping_utils.py + test_extract_metpo_proposals.py
+ test_metatraits.py); ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin changed the title METPO proposal: align with predicate conventions + medium placeholders METPO proposal cohort 2026_04: predicates + class cleanup + transform rewires May 6, 2026
@realmarcin realmarcin requested a review from Copilot May 6, 2026 17:13
@realmarcin realmarcin merged commit 920cda0 into master May 6, 2026
7 checks passed
@realmarcin realmarcin deleted the metpo_proposal branch May 6, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants