Skip to content

Fix Calcium pantothenate ↔ capsaicin cross-link in unified mappings#567

Merged
realmarcin merged 1 commit into
masterfrom
fix-calcium-pantothenate-mismapping
May 16, 2026
Merged

Fix Calcium pantothenate ↔ capsaicin cross-link in unified mappings#567
realmarcin merged 1 commit into
masterfrom
fix-calcium-pantothenate-mismapping

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

Summary

Fixes a spurious skos:exactMatch cross-link between CHEBI:3374 (capsaicin) and CHEBI:31345 (Calcium pantothenate) in mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz. Two unrelated chemicals were being asserted as the same entity in the unified mapping output.

Root cause

One bad row at line 51 of mappings/culturebotai_reviewed_ingredients.tsv had CHEBI:3374 (capsaicin) in columns 3 and 7 instead of CHEBI:31345:

ingredient_name chebi_id cas_rn culturemech_term_id
Calcium pantothenate CHEBI:3374 137-08-6 CHEBI:3374

(CAS 137-08-6 is calcium pantothenate's correct CAS; capsaicin is CAS 404-86-4. The other 12 'calcium pantothenate' family rows in the same file all correctly point at CHEBI:31345.)

When the consolidator's load_culturebotai_reviewed step ingested this row, it created a name-index entry Calcium pantothenate → CHEBI:3374, which the audit_mim_merge step then propagated into CHEBI:31345's xref list (and vice versa), producing the symmetric cross-link rows.

Why a single-cell fix wasn't enough

Re-running the consolidator with only the culturebotai fix did not clear the cross-link — load_existing_unified seeds from the prior unified SSSOM and preserves baseline xrefs unless a source positively removes them. So this PR has two parts:

  1. Patch the bad row in mappings/culturebotai_reviewed_ingredients.tsv (columns 3 and 7: CHEBI:3374CHEBI:31345).
  2. Scrub the 2 spurious skos:exactMatch rows from mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz so the consolidator's next-run baseline is clean.

Together: subsequent consolidator runs will not re-introduce the cross-link.

Verification

$ grep -nE "^Calcium pantothenate\b" mappings/culturebotai_reviewed_ingredients.tsv
51:Calcium pantothenate	817	CHEBI:31345	137-08-6	CHEBI:31345	CHEBI:31345	CHEBI:31345	MAPPED	…

$ python3 -c "import gzip,csv; …
  CHEBI:3374 ↔ CHEBI:31345 cross-link rows: 0
  CHEBI:31345 canonical: 'Calcium pantothenate'
  CHEBI:3374 canonical: 'capsaicin'
  row count: 598,074 → 598,072  (-2 cross-link rows)
  SSSOM validation passed (60 prefixes)

Notes for reviewers

  • This PR tracks mappings/culturebotai_reviewed_ingredients.tsv in git for the first time (3,854 rows / 362 KB). The file is documented as the authoritative priority=10 source in mappings/README.md, .claude/skills/chemical-mapping/SKILL.md, and docs/MAPPING_AUDIT.md, but had been a per-developer manual drop on disk rather than versioned content. Adding it now means the diff against nothing is the entire file — the smoking-gun row is at line 51 and the fix is a single-cell change in two columns of that row.
  • The vendored mappings/ingredient_mappings.sssom.tsv is deliberately unchanged on this branch; the MIM-refresh delta belongs in PR Refresh chemical mappings from MIM SSSOM (2026-05-18 republish) #564 (mim-sssom-refresh-20260511).
  • Future consolidator runs will not regenerate the bad rows, because both the source (culturebotai) and the baseline (unified SSSOM) are now clean.

🤖 Generated with Claude Code

Root cause: one bad row in `mappings/culturebotai_reviewed_ingredients.tsv`.
Of the 13 'calcium pantothenate' name-family rows, twelve correctly map
to CHEBI:31345; the one with the canonical capitalization had
CHEBI:3374 (capsaicin) in columns 3 and 7 instead of CHEBI:31345:

  ingredient_name       chebi_id    cas_rn      kg_microbe_node_id  mim_id        culturemech_term_id
  Calcium pantothenate  CHEBI:3374  137-08-6    CHEBI:31345         CHEBI:31345   CHEBI:3374     <- bad
                        ^^^^^^^^^^                                                ^^^^^^^^^^

(CAS 137-08-6 is calcium pantothenate's correct CAS; capsaicin is
CAS 404-86-4.) Fixed in place — both miscoded cells now say CHEBI:31345.

Amplification: the chemical-mapping consolidator's
`load_existing_unified` step seeds from the prior unified SSSOM and
treats it as a baseline. Once the culturebotai loader had created a
spurious 'Calcium pantothenate → CHEBI:3374' name-index entry, the
audit_mim_merge step propagated it into CHEBI:31345's xref list (and
vice versa), producing two symmetric `skos:exactMatch` rows in the
unified SSSOM:

  CHEBI:3374  --skos:exactMatch-->  CHEBI:31345   (capsaicin → Calcium pantothenate)
  CHEBI:31345 --skos:exactMatch-->  CHEBI:3374   (Calcium pantothenate → capsaicin)

Re-running the consolidator with only the culturebotai fix was NOT
enough — the cross-link survives because `load_existing_unified`
preserves baseline xrefs unless a source positively removes them. The
fix therefore has two parts:

1. Patch `mappings/culturebotai_reviewed_ingredients.tsv` (this commit
   tracks the file in git for the first time — it was previously a
   manual drop on per-developer disks, documented as the authoritative
   priority=10 source in mappings/README.md, .claude/skills/chemical-
   mapping/SKILL.md, and docs/MAPPING_AUDIT.md, but never tracked).
2. Scrub the 2 spurious `skos:exactMatch` rows from
   `mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz` so the
   consolidator's next-run baseline is clean.

Together: subsequent consolidator runs will not re-introduce the
cross-link because (a) the culturebotai source no longer asserts it
and (b) the seed baseline no longer carries it forward.

Verified:
  - bad row in culturebotai TSV → fixed (column 3 and 7 both
    CHEBI:31345); the calcium-pantothenate family is now uniformly
    CHEBI:31345 across all 13 rows.
  - unified SSSOM scrubbed: 0 rows where {subject_id, object_id} ==
    {CHEBI:3374, CHEBI:31345}.
  - canonical names preserved: CHEBI:31345 → 'Calcium pantothenate',
    CHEBI:3374 → 'capsaicin'.
  - row count delta: 598,074 → 598,072 (-2 cross-link rows).
  - SSSOM validation passed (60 prefixes).

Notes for reviewers:
  - This PR tracks `mappings/culturebotai_reviewed_ingredients.tsv`
    in git for the first time (3,854 rows total). The diff against
    nothing is the entire file, so reviewers cannot do a line-by-line
    diff for the fix; the smoking-gun row is at line 51, and it now
    matches the other 12 'calcium pantothenate' family rows that all
    correctly point at CHEBI:31345.
  - The vendored `mappings/ingredient_mappings.sssom.tsv` is
    deliberately unchanged on this branch; the MIM-refresh delta
    belongs in PR #564 (`mim-sssom-refresh-20260511`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 16, 2026 08:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@realmarcin realmarcin requested a review from Copilot May 16, 2026 08:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

@realmarcin realmarcin merged commit 66315e8 into master May 16, 2026
3 checks passed
@realmarcin realmarcin deleted the fix-calcium-pantothenate-mismapping branch May 16, 2026 08:58
realmarcin added a commit that referenced this pull request May 18, 2026
Re-run of scripts/consolidate_chemical_mappings.py after the
MediaIngredientMech sibling repo published its 2026-05-18 republish
('post schema-gap fixes'; MIM commit a231a67), which supersedes the
prior 2026-05-12 republish (a3212b4) that was the basis of the earlier
draft of this PR. Same set of 17 newly canonicalized entities as the
2026-05-12 cohort; the May-18 republish refines synonym/validation-
method fields on existing rows rather than adding new mappings.

Rebased onto master (66315e8 — picks up PR #567's calcium
pantothenate ↔ capsaicin cross-link fix, so CHEBI:31345 retains its
correct canonical 'Calcium pantothenate' and gets +2 synonyms from
the MIM refresh without re-introducing the spurious cross-link rows).

Net change vs master baseline: +98 rows, +17 entities, 0 removed
(598,072 → 598,170 rows; 120,260 → 120,277 entities). 91 existing
records gained 208 cross-CURIE synonyms via xref propagation.

Newly canonicalized (17 entities):
- CHEBI: Dihydrocelastrol (132340), Acetylated xylan (134431),
  Naphthalene sulfonic acid (36336)
- FOODON: Soy flour (03302142), Corn meal (03310257)
- ENVO: Salt water (00002010)
- NCIT: Bimuno (C187267), Oatmeal (C29298),
  Middlebrook 7H10 agar (C85509)
- mesh: TiCl3 (C039460), Apidaecin IB (C061361), Tween (D011136)
- kgmicrobe.compound / kgmicrobe.ingredient stock solutions:
  disodium phosphate heptahydrate (0.02 M),
  sodium nitrate (0.70 M), soyton

Notable synonym gains via cross-CURIE propagation (top entries):
- CHEBI:3311 CaCO3 (+3) / CHEBI:31793 MgCO3 (+3)
- kgmicrobe.compound:bacteriocin_isk_1 (+3)
- CHEBI:33118 H3BO3 / CHEBI:33134 Boric Acid (+2 each)
- CHEBI:62946 (NH4)2SO4 / CHEBI:63051 (NH4)2HPO4 /
  CHEBI:63038 (NH4)NO3 (+2 each)
- CHEBI:4735 EDTA / CHEBI:31345 Calcium pantothenate /
  CHEBI:46756 HEPES / CHEBI:25979 Benzylcyanide (+2 each)
- CHEBI:15930 atrazine / CHEBI:29377 Sodium carbonate (+2 each)
- CHEBI:17439 Cyanocobalamin / CHEBI:17533 N-Acetyl-L-Glutamic
  Acid / CHEBI:9754 Tris base / CHEBI:63036 KH2PO4 (+1 each)

Hygiene:
- Vendored mappings/ingredient_mappings.sssom.tsv now matches the
  sibling source-of-truth (a231a67) byte-for-byte (sync_mim_sssom).
- MIM input rows: 2,212 (was 2,193 on master baseline).
- Consolidator log: 'Swept stale MIM xrefs: 1 stale, 4 diverged'.
- SSSOM validation passed (598,170 rows, 60 prefixes).
- ChEBI OAK enrichment skipped (pre-existing 'No module named
  kg_microbe' import error in OAK adapter init; existing labels
  persist).
- Cross-link sanity: 0 spurious CHEBI:3374 ↔ CHEBI:31345 rows;
  CHEBI:31345 canonical='Calcium pantothenate'; CHEBI:3374
  canonical='capsaicin' (PR #567 fix preserved through the rebase).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants