Fix Calcium pantothenate ↔ capsaicin cross-link in unified mappings#567
Merged
Conversation
Root cause: one bad row in `mappings/culturebotai_reviewed_ingredients.tsv`.
Of the 13 'calcium pantothenate' name-family rows, twelve correctly map
to CHEBI:31345; the one with the canonical capitalization had
CHEBI:3374 (capsaicin) in columns 3 and 7 instead of CHEBI:31345:
ingredient_name chebi_id cas_rn kg_microbe_node_id mim_id culturemech_term_id
Calcium pantothenate CHEBI:3374 137-08-6 CHEBI:31345 CHEBI:31345 CHEBI:3374 <- bad
^^^^^^^^^^ ^^^^^^^^^^
(CAS 137-08-6 is calcium pantothenate's correct CAS; capsaicin is
CAS 404-86-4.) Fixed in place — both miscoded cells now say CHEBI:31345.
Amplification: the chemical-mapping consolidator's
`load_existing_unified` step seeds from the prior unified SSSOM and
treats it as a baseline. Once the culturebotai loader had created a
spurious 'Calcium pantothenate → CHEBI:3374' name-index entry, the
audit_mim_merge step propagated it into CHEBI:31345's xref list (and
vice versa), producing two symmetric `skos:exactMatch` rows in the
unified SSSOM:
CHEBI:3374 --skos:exactMatch--> CHEBI:31345 (capsaicin → Calcium pantothenate)
CHEBI:31345 --skos:exactMatch--> CHEBI:3374 (Calcium pantothenate → capsaicin)
Re-running the consolidator with only the culturebotai fix was NOT
enough — the cross-link survives because `load_existing_unified`
preserves baseline xrefs unless a source positively removes them. The
fix therefore has two parts:
1. Patch `mappings/culturebotai_reviewed_ingredients.tsv` (this commit
tracks the file in git for the first time — it was previously a
manual drop on per-developer disks, documented as the authoritative
priority=10 source in mappings/README.md, .claude/skills/chemical-
mapping/SKILL.md, and docs/MAPPING_AUDIT.md, but never tracked).
2. Scrub the 2 spurious `skos:exactMatch` rows from
`mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz` so the
consolidator's next-run baseline is clean.
Together: subsequent consolidator runs will not re-introduce the
cross-link because (a) the culturebotai source no longer asserts it
and (b) the seed baseline no longer carries it forward.
Verified:
- bad row in culturebotai TSV → fixed (column 3 and 7 both
CHEBI:31345); the calcium-pantothenate family is now uniformly
CHEBI:31345 across all 13 rows.
- unified SSSOM scrubbed: 0 rows where {subject_id, object_id} ==
{CHEBI:3374, CHEBI:31345}.
- canonical names preserved: CHEBI:31345 → 'Calcium pantothenate',
CHEBI:3374 → 'capsaicin'.
- row count delta: 598,074 → 598,072 (-2 cross-link rows).
- SSSOM validation passed (60 prefixes).
Notes for reviewers:
- This PR tracks `mappings/culturebotai_reviewed_ingredients.tsv`
in git for the first time (3,854 rows total). The diff against
nothing is the entire file, so reviewers cannot do a line-by-line
diff for the fix; the smoking-gun row is at line 51, and it now
matches the other 12 'calcium pantothenate' family rows that all
correctly point at CHEBI:31345.
- The vendored `mappings/ingredient_mappings.sssom.tsv` is
deliberately unchanged on this branch; the MIM-refresh delta
belongs in PR #564 (`mim-sssom-refresh-20260511`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Copilot wasn't able to review any files in this pull request.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
realmarcin
added a commit
that referenced
this pull request
May 18, 2026
Re-run of scripts/consolidate_chemical_mappings.py after the
MediaIngredientMech sibling repo published its 2026-05-18 republish
('post schema-gap fixes'; MIM commit a231a67), which supersedes the
prior 2026-05-12 republish (a3212b4) that was the basis of the earlier
draft of this PR. Same set of 17 newly canonicalized entities as the
2026-05-12 cohort; the May-18 republish refines synonym/validation-
method fields on existing rows rather than adding new mappings.
Rebased onto master (66315e8 — picks up PR #567's calcium
pantothenate ↔ capsaicin cross-link fix, so CHEBI:31345 retains its
correct canonical 'Calcium pantothenate' and gets +2 synonyms from
the MIM refresh without re-introducing the spurious cross-link rows).
Net change vs master baseline: +98 rows, +17 entities, 0 removed
(598,072 → 598,170 rows; 120,260 → 120,277 entities). 91 existing
records gained 208 cross-CURIE synonyms via xref propagation.
Newly canonicalized (17 entities):
- CHEBI: Dihydrocelastrol (132340), Acetylated xylan (134431),
Naphthalene sulfonic acid (36336)
- FOODON: Soy flour (03302142), Corn meal (03310257)
- ENVO: Salt water (00002010)
- NCIT: Bimuno (C187267), Oatmeal (C29298),
Middlebrook 7H10 agar (C85509)
- mesh: TiCl3 (C039460), Apidaecin IB (C061361), Tween (D011136)
- kgmicrobe.compound / kgmicrobe.ingredient stock solutions:
disodium phosphate heptahydrate (0.02 M),
sodium nitrate (0.70 M), soyton
Notable synonym gains via cross-CURIE propagation (top entries):
- CHEBI:3311 CaCO3 (+3) / CHEBI:31793 MgCO3 (+3)
- kgmicrobe.compound:bacteriocin_isk_1 (+3)
- CHEBI:33118 H3BO3 / CHEBI:33134 Boric Acid (+2 each)
- CHEBI:62946 (NH4)2SO4 / CHEBI:63051 (NH4)2HPO4 /
CHEBI:63038 (NH4)NO3 (+2 each)
- CHEBI:4735 EDTA / CHEBI:31345 Calcium pantothenate /
CHEBI:46756 HEPES / CHEBI:25979 Benzylcyanide (+2 each)
- CHEBI:15930 atrazine / CHEBI:29377 Sodium carbonate (+2 each)
- CHEBI:17439 Cyanocobalamin / CHEBI:17533 N-Acetyl-L-Glutamic
Acid / CHEBI:9754 Tris base / CHEBI:63036 KH2PO4 (+1 each)
Hygiene:
- Vendored mappings/ingredient_mappings.sssom.tsv now matches the
sibling source-of-truth (a231a67) byte-for-byte (sync_mim_sssom).
- MIM input rows: 2,212 (was 2,193 on master baseline).
- Consolidator log: 'Swept stale MIM xrefs: 1 stale, 4 diverged'.
- SSSOM validation passed (598,170 rows, 60 prefixes).
- ChEBI OAK enrichment skipped (pre-existing 'No module named
kg_microbe' import error in OAK adapter init; existing labels
persist).
- Cross-link sanity: 0 spurious CHEBI:3374 ↔ CHEBI:31345 rows;
CHEBI:31345 canonical='Calcium pantothenate'; CHEBI:3374
canonical='capsaicin' (PR #567 fix preserved through the rebase).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a spurious
skos:exactMatchcross-link between CHEBI:3374 (capsaicin) and CHEBI:31345 (Calcium pantothenate) inmappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz. Two unrelated chemicals were being asserted as the same entity in the unified mapping output.Root cause
One bad row at line 51 of
mappings/culturebotai_reviewed_ingredients.tsvhadCHEBI:3374(capsaicin) in columns 3 and 7 instead ofCHEBI:31345:Calcium pantothenateCHEBI:3374❌137-08-6CHEBI:3374❌(CAS
137-08-6is calcium pantothenate's correct CAS; capsaicin is CAS404-86-4. The other 12 'calcium pantothenate' family rows in the same file all correctly point atCHEBI:31345.)When the consolidator's
load_culturebotai_reviewedstep ingested this row, it created a name-index entryCalcium pantothenate → CHEBI:3374, which theaudit_mim_mergestep then propagated intoCHEBI:31345's xref list (and vice versa), producing the symmetric cross-link rows.Why a single-cell fix wasn't enough
Re-running the consolidator with only the culturebotai fix did not clear the cross-link —
load_existing_unifiedseeds from the prior unified SSSOM and preserves baseline xrefs unless a source positively removes them. So this PR has two parts:mappings/culturebotai_reviewed_ingredients.tsv(columns 3 and 7:CHEBI:3374→CHEBI:31345).skos:exactMatchrows frommappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gzso the consolidator's next-run baseline is clean.Together: subsequent consolidator runs will not re-introduce the cross-link.
Verification
Notes for reviewers
mappings/culturebotai_reviewed_ingredients.tsvin git for the first time (3,854 rows / 362 KB). The file is documented as the authoritative priority=10 source inmappings/README.md,.claude/skills/chemical-mapping/SKILL.md, anddocs/MAPPING_AUDIT.md, but had been a per-developer manual drop on disk rather than versioned content. Adding it now means the diff against nothing is the entire file — the smoking-gun row is at line 51 and the fix is a single-cell change in two columns of that row.mappings/ingredient_mappings.sssom.tsvis deliberately unchanged on this branch; the MIM-refresh delta belongs in PR Refresh chemical mappings from MIM SSSOM (2026-05-18 republish) #564 (mim-sssom-refresh-20260511).🤖 Generated with Claude Code