Record METPO GC-bin label corrections (berkeleybop/metpo#432)#566
Merged
Conversation
Adds a new proposal artifact `mappings/metpo_label_corrections.tsv` that captures upstream label-fix requests for EXISTING METPO terms whose rdfs:label disagrees with their numeric-threshold synonyms. Each entry cites a berkeleybop/metpo issue so the upstream maintainer has the audit trail. This is distinct from `metpo_existing_aliases.tsv`: aliases route a kg-microbe-side label to an existing METPO ID as-is; corrections request that the upstream record itself be amended. First cohort: the four GC-content bin records (berkeleybop/metpo#432). Sorted by ascending GC%, the numeric-threshold synonyms are the source of truth; three of the four labels are wrong: METPO ID current → corrected threshold synonym METPO:1000432 GC high → GC low GC_<=42.65 (lowest range) METPO:1000429 GC low → GC mid1 GC_42.65_57.0 (second-lowest) METPO:1000431 GC mid2 → GC mid2 GC_57.0_66.3 (already correct) METPO:1000430 GC mid1 → GC high GC_>66.3 (highest range) The 'GC mid2' row is listed for completeness with no change — recorded so curators know all four GC bins have been audited together. KG-Microbe transforms don't directly reference any of METPO:1000429-432; they emit their own correctly-labelled `gc:low/mid1/mid2/high` placeholders from custom_curies.yaml. The merged graph is unaffected by the upstream bug. The label-corrections TSV is purely a curator- facing artifact for the upstream PR. Implementation: - scripts/extract_metpo_proposals.py: - new `LabelCorrection` dataclass + `LABEL_CORRECTION_HEADER`. - new `METPO_LABEL_CORRECTIONS` list with the 4 GC-bin entries. - new `validate_label_corrections()` freshness check that reads the local METPO snapshot and confirms each `current_label` still matches what METPO actually has today — fails loudly if upstream has already fixed a row (stale entry) or if a METPO ID is missing entirely. - new `write_label_correction_tsv()` writer. - main() invokes both, after the existing-aliases write. - tests/test_extract_metpo_proposals.py: adds `metpo_label_corrections.tsv` to PROPOSAL_FILES so the regenerate- and-diff CI gate catches drift between the script and the committed TSV. - .claude/skills/metpo-proposal/SKILL.md: documents the new artifact in the artifacts table and adds a skill-checklist item that cross-references the freshness check. Verified: extractor exits clean reporting `[ok] label-correction freshness check passed for 4 upstream label-fix request(s)` and `[ok] metpo_label_corrections.tsv (4 upstream label-fix request(s); 3 actual label change(s))`; pytest tests/test_extract_metpo_proposals.py passes; tox-linted tree clean (the 9 ruff hits in scripts/ predate this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new curator-facing artifact mappings/metpo_label_corrections.tsv that tracks upstream label-fix requests for existing METPO terms whose rdfs:label disagrees with their numeric-threshold synonyms. The first cohort addresses berkeleybop/metpo#432 (four GC-content bin records with mis-ordered labels). A freshness validator reads the local METPO snapshot to ensure each correction's current_label still matches upstream, failing loudly if entries are stale or IDs missing.
Changes:
- Introduce
LabelCorrectiondataclass,METPO_LABEL_CORRECTIONScohort,validate_label_corrections(), andwrite_label_correction_tsv()inscripts/extract_metpo_proposals.py; wire intomain(). - Commit generated
mappings/metpo_label_corrections.tsvand add it toPROPOSAL_FILESso the regenerate-and-diff CI gate catches drift. - Document the new artifact and freshness check in
.claude/skills/metpo-proposal/SKILL.md.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| scripts/extract_metpo_proposals.py | Adds dataclass, header, cohort list, validator, writer, and main() wiring for label corrections |
| mappings/metpo_label_corrections.tsv | New generated TSV with the four GC-bin label-correction rows |
| tests/test_extract_metpo_proposals.py | Adds new TSV to PROPOSAL_FILES for CI regenerate-and-diff check |
| .claude/skills/metpo-proposal/SKILL.md | Documents the new artifact and adds a checklist item |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new proposal artifact
mappings/metpo_label_corrections.tsvthat captures upstream label-fix requests for existing METPO terms whoserdfs:labeldisagrees with their numeric-threshold synonyms. Each row cites aberkeleybop/metpoissue. This is distinct frommetpo_existing_aliases.tsv(which routes a kg-microbe-side label to an existing METPO ID as-is) — corrections request that the upstream record itself be amended.First cohort: berkeleybop/metpo#432
The four GC-content bin records have labels that don't line up with their numeric-threshold synonyms. Sorted by ascending GC%, the synonyms are the source of truth; three of the four labels are wrong:
METPO:1000432GC highGC lowGC_<=42.65(lowest range)METPO:1000429GC lowGC mid1GC_42.65_57.0(second-lowest)METPO:1000431GC mid2GC mid2(no change)GC_57.0_66.3(already correct)METPO:1000430GC mid1GC highGC_>66.3(highest range)Why this works as a separate artifact
KG-Microbe transforms don't directly reference any of
METPO:1000429-1000432. The GC-content bins in custom_curies.yaml use independentgc:low/mid1/mid2/highplaceholders that are already correctly labeled by numeric range:The merged graph is therefore unaffected by the upstream bug. The label-corrections TSV is purely a curator-facing artifact for the upstream PR — once berkeleybop/metpo#432 is fixed, the entries become stale and the freshness check will fail loudly.
Implementation
scripts/extract_metpo_proposals.py:LabelCorrectiondataclass +LABEL_CORRECTION_HEADERMETPO_LABEL_CORRECTIONSlist with 4 GC-bin entries (3 changes + 1 unchanged-for-completeness)validate_label_corrections(): readsdata/transformed/ontologies/metpo_nodes.tsvand confirms each entry'scurrent_labelstill matches what METPO actually has — fails if upstream has already fixed a row (stale entry) or the METPO ID is missing entirelywrite_label_correction_tsv()writermain()invokes both, after the existing-aliases writetests/test_extract_metpo_proposals.py: addsmetpo_label_corrections.tsvtoPROPOSAL_FILESso the regenerate-and-diff CI gate catches drift between the script and the committed TSV.claude/skills/metpo-proposal/SKILL.md: documents the new artifact in the artifacts table + adds a skill-checklist item for the freshness checkVerification
🤖 Generated with Claude Code