Skip to content

Refresh chemical mappings from MIM SSSOM (2026-05-18 republish)#564

Merged
realmarcin merged 1 commit into
masterfrom
mim-sssom-refresh-20260511
May 18, 2026
Merged

Refresh chemical mappings from MIM SSSOM (2026-05-18 republish)#564
realmarcin merged 1 commit into
masterfrom
mim-sssom-refresh-20260511

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

Re-runs scripts/consolidate_chemical_mappings.py to pick up the MediaIngredientMech 2026-05-12 SSSOM republish (MIM commit a3212b4 + a sequence of high-confidence unmapped-ingredient curations: 9c5223b / 4f85c7f / f68ee6a / 4214c58). Vendored MIM cache (mappings/ingredient_mappings.sssom.tsv) is refreshed from the sibling repo by sync_mim_sssom to match byte-for-byte; the unified file (mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz) is regenerated.

Net change

  • +100 rows, +17 entities, 0 removed (598,074 → 598,174 rows; 120,260 → 120,277 entities)
  • 46 existing entries gained synonyms via cross-CURIE propagation
  • MIM SSSOM input rows: 2,193 → 2,212

Newly canonicalized entities (17)

Prefix New Highlights
CHEBI 3 Dihydrocelastrol, Acetylated xylan, Naphthalene sulfonic acid
FOODON 2 Soy flour, Corn meal
ENVO 1 Salt water
NCIT 3 Bimuno, Oatmeal, Middlebrook 7H10 agar
mesh 3 TiCl3, Apidaecin IB, Tween
kgmicrobe.compound / .ingredient 5 disodium phosphate heptahydrate (0.02 M), sodium nitrate (0.70 M), soyton

Notable synonym gains (top entries)

CHEBI:3311 CaCO3 (+3), CHEBI:31793 MgCO3 (+3), CHEBI:4735 EDTA acid form (+2), CHEBI:31345 Calcium pantothenate (+2), CHEBI:33134 Boric Acid / CHEBI:33118 H3BO3 (+2 each), CHEBI:46756 HEPES (+2), CHEBI:63051 (NH4)2HPO4 / CHEBI:62946 (NH4)2SO4 / CHEBI:63038 (NH4)NO3 (+2 each), CHEBI:15930 atrazine (+2), CHEBI:3374 capsaicin (+2). A handful of -1 deltas reflect ownership shifts where MIM gave one of two competing entries the canonical xref.

Hygiene

  • Vendored mappings/ingredient_mappings.sssom.tsv matches sibling source-of-truth byte-for-byte
  • Consolidator log: Swept stale MIM xrefs: 1 stale, 4 diverged initial + 0 stale, 7 diverged post-complex_ingredients re-sweep
  • SSSOM validation passed (598,174 rows, 60 prefixes)
  • ChEBI OAK enrichment skipped (pre-existing import error; existing labels persist)

Test plan

  • poetry run python scripts/consolidate_chemical_mappings.py re-run is a no-op (vendored MIM matches sibling, unified file regenerates byte-identical)
  • poetry run pytest tests/test_chemical_mapping_utils.py clean
  • Spot-check a few canonical lookups in a Python REPL: find_chebi_by_name("Soyton")kgmicrobe.ingredient:soyton, find_chebi_by_name("Salt water")ENVO:00002010, find_chebi_by_name("Tween")mesh:D011136

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 12, 2026 03:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Re-run of scripts/consolidate_chemical_mappings.py after the
MediaIngredientMech sibling repo published its 2026-05-18 republish
('post schema-gap fixes'; MIM commit a231a67), which supersedes the
prior 2026-05-12 republish (a3212b4) that was the basis of the earlier
draft of this PR. Same set of 17 newly canonicalized entities as the
2026-05-12 cohort; the May-18 republish refines synonym/validation-
method fields on existing rows rather than adding new mappings.

Rebased onto master (66315e8 — picks up PR #567's calcium
pantothenate ↔ capsaicin cross-link fix, so CHEBI:31345 retains its
correct canonical 'Calcium pantothenate' and gets +2 synonyms from
the MIM refresh without re-introducing the spurious cross-link rows).

Net change vs master baseline: +98 rows, +17 entities, 0 removed
(598,072 → 598,170 rows; 120,260 → 120,277 entities). 91 existing
records gained 208 cross-CURIE synonyms via xref propagation.

Newly canonicalized (17 entities):
- CHEBI: Dihydrocelastrol (132340), Acetylated xylan (134431),
  Naphthalene sulfonic acid (36336)
- FOODON: Soy flour (03302142), Corn meal (03310257)
- ENVO: Salt water (00002010)
- NCIT: Bimuno (C187267), Oatmeal (C29298),
  Middlebrook 7H10 agar (C85509)
- mesh: TiCl3 (C039460), Apidaecin IB (C061361), Tween (D011136)
- kgmicrobe.compound / kgmicrobe.ingredient stock solutions:
  disodium phosphate heptahydrate (0.02 M),
  sodium nitrate (0.70 M), soyton

Notable synonym gains via cross-CURIE propagation (top entries):
- CHEBI:3311 CaCO3 (+3) / CHEBI:31793 MgCO3 (+3)
- kgmicrobe.compound:bacteriocin_isk_1 (+3)
- CHEBI:33118 H3BO3 / CHEBI:33134 Boric Acid (+2 each)
- CHEBI:62946 (NH4)2SO4 / CHEBI:63051 (NH4)2HPO4 /
  CHEBI:63038 (NH4)NO3 (+2 each)
- CHEBI:4735 EDTA / CHEBI:31345 Calcium pantothenate /
  CHEBI:46756 HEPES / CHEBI:25979 Benzylcyanide (+2 each)
- CHEBI:15930 atrazine / CHEBI:29377 Sodium carbonate (+2 each)
- CHEBI:17439 Cyanocobalamin / CHEBI:17533 N-Acetyl-L-Glutamic
  Acid / CHEBI:9754 Tris base / CHEBI:63036 KH2PO4 (+1 each)

Hygiene:
- Vendored mappings/ingredient_mappings.sssom.tsv now matches the
  sibling source-of-truth (a231a67) byte-for-byte (sync_mim_sssom).
- MIM input rows: 2,212 (was 2,193 on master baseline).
- Consolidator log: 'Swept stale MIM xrefs: 1 stale, 4 diverged'.
- SSSOM validation passed (598,170 rows, 60 prefixes).
- ChEBI OAK enrichment skipped (pre-existing 'No module named
  kg_microbe' import error in OAK adapter init; existing labels
  persist).
- Cross-link sanity: 0 spurious CHEBI:3374 ↔ CHEBI:31345 rows;
  CHEBI:31345 canonical='Calcium pantothenate'; CHEBI:3374
  canonical='capsaicin' (PR #567 fix preserved through the rebase).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin force-pushed the mim-sssom-refresh-20260511 branch from 448203a to 459e4a6 Compare May 18, 2026 20:19
@realmarcin realmarcin changed the title Refresh chemical mappings from MIM SSSOM (2026-05-12 republish) Refresh chemical mappings from MIM SSSOM (2026-05-18 republish) May 18, 2026
@realmarcin realmarcin requested a review from Copilot May 18, 2026 20:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

@realmarcin realmarcin merged commit 5fac77e into master May 18, 2026
3 checks passed
@realmarcin realmarcin deleted the mim-sssom-refresh-20260511 branch May 18, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants