Refresh chemical mappings from MIM SSSOM (2026-05-18 republish)#564
Merged
Conversation
Contributor
There was a problem hiding this comment.
Copilot wasn't able to review any files in this pull request.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Re-run of scripts/consolidate_chemical_mappings.py after the
MediaIngredientMech sibling repo published its 2026-05-18 republish
('post schema-gap fixes'; MIM commit a231a67), which supersedes the
prior 2026-05-12 republish (a3212b4) that was the basis of the earlier
draft of this PR. Same set of 17 newly canonicalized entities as the
2026-05-12 cohort; the May-18 republish refines synonym/validation-
method fields on existing rows rather than adding new mappings.
Rebased onto master (66315e8 — picks up PR #567's calcium
pantothenate ↔ capsaicin cross-link fix, so CHEBI:31345 retains its
correct canonical 'Calcium pantothenate' and gets +2 synonyms from
the MIM refresh without re-introducing the spurious cross-link rows).
Net change vs master baseline: +98 rows, +17 entities, 0 removed
(598,072 → 598,170 rows; 120,260 → 120,277 entities). 91 existing
records gained 208 cross-CURIE synonyms via xref propagation.
Newly canonicalized (17 entities):
- CHEBI: Dihydrocelastrol (132340), Acetylated xylan (134431),
Naphthalene sulfonic acid (36336)
- FOODON: Soy flour (03302142), Corn meal (03310257)
- ENVO: Salt water (00002010)
- NCIT: Bimuno (C187267), Oatmeal (C29298),
Middlebrook 7H10 agar (C85509)
- mesh: TiCl3 (C039460), Apidaecin IB (C061361), Tween (D011136)
- kgmicrobe.compound / kgmicrobe.ingredient stock solutions:
disodium phosphate heptahydrate (0.02 M),
sodium nitrate (0.70 M), soyton
Notable synonym gains via cross-CURIE propagation (top entries):
- CHEBI:3311 CaCO3 (+3) / CHEBI:31793 MgCO3 (+3)
- kgmicrobe.compound:bacteriocin_isk_1 (+3)
- CHEBI:33118 H3BO3 / CHEBI:33134 Boric Acid (+2 each)
- CHEBI:62946 (NH4)2SO4 / CHEBI:63051 (NH4)2HPO4 /
CHEBI:63038 (NH4)NO3 (+2 each)
- CHEBI:4735 EDTA / CHEBI:31345 Calcium pantothenate /
CHEBI:46756 HEPES / CHEBI:25979 Benzylcyanide (+2 each)
- CHEBI:15930 atrazine / CHEBI:29377 Sodium carbonate (+2 each)
- CHEBI:17439 Cyanocobalamin / CHEBI:17533 N-Acetyl-L-Glutamic
Acid / CHEBI:9754 Tris base / CHEBI:63036 KH2PO4 (+1 each)
Hygiene:
- Vendored mappings/ingredient_mappings.sssom.tsv now matches the
sibling source-of-truth (a231a67) byte-for-byte (sync_mim_sssom).
- MIM input rows: 2,212 (was 2,193 on master baseline).
- Consolidator log: 'Swept stale MIM xrefs: 1 stale, 4 diverged'.
- SSSOM validation passed (598,170 rows, 60 prefixes).
- ChEBI OAK enrichment skipped (pre-existing 'No module named
kg_microbe' import error in OAK adapter init; existing labels
persist).
- Cross-link sanity: 0 spurious CHEBI:3374 ↔ CHEBI:31345 rows;
CHEBI:31345 canonical='Calcium pantothenate'; CHEBI:3374
canonical='capsaicin' (PR #567 fix preserved through the rebase).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
448203a to
459e4a6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-runs
scripts/consolidate_chemical_mappings.pyto pick up the MediaIngredientMech 2026-05-12 SSSOM republish (MIM commita3212b4+ a sequence of high-confidence unmapped-ingredient curations:9c5223b/4f85c7f/f68ee6a/4214c58). Vendored MIM cache (mappings/ingredient_mappings.sssom.tsv) is refreshed from the sibling repo bysync_mim_sssomto match byte-for-byte; the unified file (mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz) is regenerated.Net change
Newly canonicalized entities (17)
Notable synonym gains (top entries)
CHEBI:3311 CaCO3(+3),CHEBI:31793 MgCO3(+3),CHEBI:4735 EDTA acid form(+2),CHEBI:31345 Calcium pantothenate(+2),CHEBI:33134 Boric Acid/CHEBI:33118 H3BO3(+2 each),CHEBI:46756 HEPES(+2),CHEBI:63051 (NH4)2HPO4/CHEBI:62946 (NH4)2SO4/CHEBI:63038 (NH4)NO3(+2 each),CHEBI:15930 atrazine(+2),CHEBI:3374 capsaicin(+2). A handful of -1 deltas reflect ownership shifts where MIM gave one of two competing entries the canonical xref.Hygiene
mappings/ingredient_mappings.sssom.tsvmatches sibling source-of-truth byte-for-byteSwept stale MIM xrefs: 1 stale, 4 divergedinitial +0 stale, 7 divergedpost-complex_ingredients re-sweepTest plan
poetry run python scripts/consolidate_chemical_mappings.pyre-run is a no-op (vendored MIM matches sibling, unified file regenerates byte-identical)poetry run pytest tests/test_chemical_mapping_utils.pycleanfind_chebi_by_name("Soyton")→kgmicrobe.ingredient:soyton,find_chebi_by_name("Salt water")→ENVO:00002010,find_chebi_by_name("Tween")→mesh:D011136🤖 Generated with Claude Code