Skip to content

Record METPO GC-bin label corrections (berkeleybop/metpo#432)#566

Merged
realmarcin merged 1 commit into
masterfrom
metpo-issue-432-gc-label-corrections
May 19, 2026
Merged

Record METPO GC-bin label corrections (berkeleybop/metpo#432)#566
realmarcin merged 1 commit into
masterfrom
metpo-issue-432-gc-label-corrections

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

Adds a new proposal artifact mappings/metpo_label_corrections.tsv that captures upstream label-fix requests for existing METPO terms whose rdfs:label disagrees with their numeric-threshold synonyms. Each row cites a berkeleybop/metpo issue. This is distinct from metpo_existing_aliases.tsv (which routes a kg-microbe-side label to an existing METPO ID as-is) — corrections request that the upstream record itself be amended.

First cohort: berkeleybop/metpo#432

The four GC-content bin records have labels that don't line up with their numeric-threshold synonyms. Sorted by ascending GC%, the synonyms are the source of truth; three of the four labels are wrong:

METPO ID Current label Corrected label Threshold synonym
METPO:1000432 GC high GC low GC_<=42.65 (lowest range)
METPO:1000429 GC low GC mid1 GC_42.65_57.0 (second-lowest)
METPO:1000431 GC mid2 GC mid2 (no change) GC_57.0_66.3 (already correct)
METPO:1000430 GC mid1 GC high GC_>66.3 (highest range)

Why this works as a separate artifact

KG-Microbe transforms don't directly reference any of METPO:1000429-1000432. The GC-content bins in custom_curies.yaml use independent gc:low/mid1/mid2/high placeholders that are already correctly labeled by numeric range:

gc:low   = "GC content <= 42.65%"     ← matches METPO:1000432 numeric range
gc:mid1  = "GC content 42.65% - 57.0%"
gc:mid2  = "GC content 57.0% - 66.3%"
gc:high  = "GC content > 66.3%"        ← matches METPO:1000430 numeric range

The merged graph is therefore unaffected by the upstream bug. The label-corrections TSV is purely a curator-facing artifact for the upstream PR — once berkeleybop/metpo#432 is fixed, the entries become stale and the freshness check will fail loudly.

Implementation

  • scripts/extract_metpo_proposals.py:
    • new LabelCorrection dataclass + LABEL_CORRECTION_HEADER
    • new METPO_LABEL_CORRECTIONS list with 4 GC-bin entries (3 changes + 1 unchanged-for-completeness)
    • new validate_label_corrections(): reads data/transformed/ontologies/metpo_nodes.tsv and confirms each entry's current_label still matches what METPO actually has — fails if upstream has already fixed a row (stale entry) or the METPO ID is missing entirely
    • new write_label_correction_tsv() writer
    • main() invokes both, after the existing-aliases write
  • tests/test_extract_metpo_proposals.py: adds metpo_label_corrections.tsv to PROPOSAL_FILES so the regenerate-and-diff CI gate catches drift between the script and the committed TSV
  • .claude/skills/metpo-proposal/SKILL.md: documents the new artifact in the artifacts table + adds a skill-checklist item for the freshness check

Verification

$ poetry run python scripts/extract_metpo_proposals.py
…
[ok] label-correction freshness check passed for 4 upstream label-fix request(s)
[ok] metpo_label_corrections.tsv          (4 upstream label-fix request(s); 3 actual label change(s) — see GitHub refs in TSV)
…
[ok] ROBOT template + ELK reason passed (no UNSAT classes)

$ poetry run pytest tests/test_extract_metpo_proposals.py -v
PASSED

🤖 Generated with Claude Code

Adds a new proposal artifact `mappings/metpo_label_corrections.tsv` that
captures upstream label-fix requests for EXISTING METPO terms whose
rdfs:label disagrees with their numeric-threshold synonyms. Each entry
cites a berkeleybop/metpo issue so the upstream maintainer has the
audit trail. This is distinct from `metpo_existing_aliases.tsv`:
aliases route a kg-microbe-side label to an existing METPO ID as-is;
corrections request that the upstream record itself be amended.

First cohort: the four GC-content bin records (berkeleybop/metpo#432).
Sorted by ascending GC%, the numeric-threshold synonyms are the source
of truth; three of the four labels are wrong:

  METPO ID       current → corrected   threshold synonym
  METPO:1000432  GC high  → GC low     GC_<=42.65        (lowest range)
  METPO:1000429  GC low   → GC mid1    GC_42.65_57.0     (second-lowest)
  METPO:1000431  GC mid2  → GC mid2    GC_57.0_66.3      (already correct)
  METPO:1000430  GC mid1  → GC high    GC_>66.3          (highest range)

The 'GC mid2' row is listed for completeness with no change — recorded
so curators know all four GC bins have been audited together.

KG-Microbe transforms don't directly reference any of METPO:1000429-432;
they emit their own correctly-labelled `gc:low/mid1/mid2/high`
placeholders from custom_curies.yaml. The merged graph is unaffected
by the upstream bug. The label-corrections TSV is purely a curator-
facing artifact for the upstream PR.

Implementation:

- scripts/extract_metpo_proposals.py:
  - new `LabelCorrection` dataclass + `LABEL_CORRECTION_HEADER`.
  - new `METPO_LABEL_CORRECTIONS` list with the 4 GC-bin entries.
  - new `validate_label_corrections()` freshness check that reads the
    local METPO snapshot and confirms each `current_label` still
    matches what METPO actually has today — fails loudly if upstream
    has already fixed a row (stale entry) or if a METPO ID is missing
    entirely.
  - new `write_label_correction_tsv()` writer.
  - main() invokes both, after the existing-aliases write.

- tests/test_extract_metpo_proposals.py: adds
  `metpo_label_corrections.tsv` to PROPOSAL_FILES so the regenerate-
  and-diff CI gate catches drift between the script and the committed
  TSV.

- .claude/skills/metpo-proposal/SKILL.md: documents the new artifact
  in the artifacts table and adds a skill-checklist item that
  cross-references the freshness check.

Verified: extractor exits clean reporting `[ok] label-correction
freshness check passed for 4 upstream label-fix request(s)` and `[ok]
metpo_label_corrections.tsv (4 upstream label-fix request(s); 3
actual label change(s))`; pytest tests/test_extract_metpo_proposals.py
passes; tox-linted tree clean (the 9 ruff hits in scripts/ predate
this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 16, 2026 00:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new curator-facing artifact mappings/metpo_label_corrections.tsv that tracks upstream label-fix requests for existing METPO terms whose rdfs:label disagrees with their numeric-threshold synonyms. The first cohort addresses berkeleybop/metpo#432 (four GC-content bin records with mis-ordered labels). A freshness validator reads the local METPO snapshot to ensure each correction's current_label still matches upstream, failing loudly if entries are stale or IDs missing.

Changes:

  • Introduce LabelCorrection dataclass, METPO_LABEL_CORRECTIONS cohort, validate_label_corrections(), and write_label_correction_tsv() in scripts/extract_metpo_proposals.py; wire into main().
  • Commit generated mappings/metpo_label_corrections.tsv and add it to PROPOSAL_FILES so the regenerate-and-diff CI gate catches drift.
  • Document the new artifact and freshness check in .claude/skills/metpo-proposal/SKILL.md.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
scripts/extract_metpo_proposals.py Adds dataclass, header, cohort list, validator, writer, and main() wiring for label corrections
mappings/metpo_label_corrections.tsv New generated TSV with the four GC-bin label-correction rows
tests/test_extract_metpo_proposals.py Adds new TSV to PROPOSAL_FILES for CI regenerate-and-diff check
.claude/skills/metpo-proposal/SKILL.md Documents the new artifact and adds a checklist item

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@realmarcin realmarcin merged commit 852360d into master May 19, 2026
7 checks passed
@realmarcin realmarcin deleted the metpo-issue-432-gc-label-corrections branch May 19, 2026 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants