Evaluation Governance Infrastructure for Domain-Specific AI Doctrinal Benchmarking
The problem this framework solves: General-purpose AI benchmarks measure capability. They do not measure whether an AI model handles the doctrinal claims of a specific religious tradition accurately, calibrated to that tradition's own authority structure. This framework does.
The CDFI Framework is a reusable evaluation governance methodology for building domain-specific AI doctrinal benchmarks. It was derived from seven frontier AI safety research publications and translated into a scoring architecture purpose-built for Catholic doctrinal evaluation. SAICRED v2 is the reference implementation.
The framework is the first of its kind: a published, version-controlled methodology that any religious institution or denomination can adapt to evaluate AI models against its own doctrinal standards.
It is not a benchmark. It is the methodology that makes a benchmark defensible.
| Statement | Practical meaning |
|---|---|
| An evaluation governance methodology derived from published AI safety research | Every weight, gate, and threshold traces to a named publication |
| A tradition-agnostic framework | Catholic doctrine is the reference implementation; any tradition can substitute its own authority structure |
| A portable reference implementation of the CDFI formula | Run engine/cdfi_calculator.py independently of the production pipeline |
| A publication-readiness protocol | Three explicit gates must clear before benchmark scores carry institutional weight |
| Statement | What is explicitly excluded |
|---|---|
| A benchmark dataset | Prompts and model responses live in the production pipeline (saicred-benchmark) |
| A production scoring pipeline | That is saicred-benchmark/scoring_service.py |
| Regulatory or theological advice | All doctrinal and institutional determinations remain with qualified human authorities |
| An autonomous system | No component decides, approves, or classifies without human oversight |
Every benchmark built on this framework follows seven steps in order. Each step converts the output of the previous step into a more specific artifact.
Literature Claim
↓
Risk Mechanism
↓
Observable Failure Mode
↓
Metric or Gate
↓
Scoring Rule
↓
Reliability Test
↓
Deployment Tier
This sequence is what distinguishes evaluation governance infrastructure from research synthesis. Reading AI safety literature produces knowledge. Moving through this sequence produces an institution-grade scoring instrument.
cdfi-framework/
│
├── README.md ← You are here
├── TRACEABILITY.md ← 7 publications → CDFI architecture (full causal chain)
├── LIMITATIONS.md ← Six known limitations with exact disclosure language
├── CHANGELOG.md ← Version history, reliability run log, v2 results
├── TRANSLATION-METHOD.md ← How each publication became a computable CDFI mechanism
├── CITATION.cff ← Machine-readable citation metadata
├── CONTRIBUTING.md ← How to adapt, extend, or contribute
├── LICENSE ← Apache License 2.0
├── NOTICE ← Required attribution for derivative works
│
├── engine/ ← Reference implementation of the CDFI formula
│ ├── __init__.py ← Package entry point
│ └── cdfi_calculator.py ← Standalone formula: scores in → CDFIResult out
│
├── configs/ ← All numerical parameters (edit here to adapt for your tradition)
│ ├── authority_matrix.json ← Metric weights keyed to four doctrinal authority levels
│ └── threshold_gates.yaml ← Gate definitions, cap value, deployment tier thresholds
│
├── docs/
│ ├── translations/ ← One file per research-finding → CDFI-mechanism translation
│ │ ├── README.md ← Navigation guide: reading order, relationships, audience routing
│ │ ├── 01-evaluation-criteria.md ← Pub 1: subject-matter standards → weighting matrix
│ │ ├── 02-rubric-reliability.md ← Pub 1: inter-rater reliability → publication gate
│ │ ├── 03-hallucination-gate.md ← Pub 2: auditing hidden objectives → hallucination gate
│ │ ├── 04-statistical-rigor.md ← Pub 3: uncertainty → CI + deployment tier thresholds
│ │ ├── 05-framing-sensitivity.md ← Pub 4: framing shifts → relativism resistance gate
│ │ ├── 06-adversarial-probing.md ← Pub 7: feature steering → prompt sensitivity drift
│ │ ├── 07-categorical-failures.md ← Pub 6: sabotage logic → cap gate architecture
│ │ └── 08-confidence-calibration.md ← Original construct: Pubs 4+5 combined → ninth metric
│ │
│ ├── specifications/ ← Complete technical specifications
│ │ ├── CDFI-formula.md ← Formula, weighting matrix, gate logic
│ │ ├── failure-taxonomy.md ← Five failure modes with detection methods
│ │ ├── authority-levels.md ← Four doctrinal authority levels explained
│ │ ├── deployment-tiers.md ← Formation, General, R&D, Not Recommended
│ │ └── scoring-anchors.md ← Concrete score-level examples from v2 judge reasoning
│ │
│ ├── reliability/ ← Judge certification protocol
│ │ ├── judge-reliability-protocol.md ← Four-part certification: what each part tests
│ │ └── publication-gates.md ← Three gates that must clear before publication
│ │
│ └── governance/ ← Institutional use and adaptation
│ ├── adapting-for-other-traditions.md ← How another denomination uses this framework
│ ├── limitation-register-template.md ← Required disclosure language for publication
│ └── temporal-versioning.md ← How scores expire with model version updates
│
├── examples/
│ └── saicred-v2/ ← Reference implementation (Catholic benchmark)
│ ├── README.md ← Dataset, methodology, and benchmark overview
│ ├── results-summary.md ← Full v2 findings: rankings, CI, cap rates
│ └── framing-effect-analysis.md ← Primary policy finding: the framing effect
│
└── assets/
└── cdfi-weighting-matrix.png ← Visual reference for the four-column formula
Related repositories:
- saicred-benchmark — Production scoring pipeline: 400 prompts × 6 models × 9 metrics, Gemini 2.5 Flash judge, CDFI computation, and results dashboard (private — access pending publication)
Every architectural decision in the CDFI traces to one of these publications. No weight, gate, or threshold was chosen by convention.
| # | Publication | CDFI Element Produced |
|---|---|---|
| 1 | Challenges in Evaluating AI Systems — Anthropic, 2023 | Four-column weighting matrix; inter-rater reliability gate (kappa >= 0.60 on Critical metrics) |
| 2 | Auditing Language Models for Hidden Objectives — Anthropic, 2025 | Hallucination pass/fail gate; citation verification protocol |
| 3 | A Statistical Approach to Model Evaluations — Anthropic, 2024 | 95% CI requirement; clustered standard errors; temporal versioning; deployment tier thresholds |
| 4 | Discrimination in Language Model Decisions — 2024 | Four-variant prompt structure; relativism resistance gate |
| 5 | Measuring Faithfulness in Chain-of-Thought Reasoning — Anthropic, 2023 | Confidence calibration metric (original construct, derived from Pubs 4 and 5 combined) |
| 6 | Sabotage Evaluations — Anthropic, 2024 | Five failure mode taxonomy; cap gate architecture |
| 7 | Evaluating Feature Steering — Anthropic, 2023 | Adversarial prompt taxonomy; prompt sensitivity drift failure mode |
Full translation detail — including the exact causal chain from finding to formula element for each publication: TRACEABILITY.md
The systematic methodology used to perform each translation — the seven-step sequence from literature claim to deployment tier: TRANSLATION-METHOD.md
Step 1 — Weighted sum:
CDFI = SUM( metric_score_i x column_weight_i )
where column_weight_i is drawn from the doctrinal authority level column of the question being scored.
Step 2 — Gate override:
if hallucination_gate = FAIL or relativism_gate = FAIL:
CDFI = min(CDFI, 40)
The gate override is a classification, not a penalty. A response that fabricates a doctrinal source or relativizes defined doctrine is disqualified regardless of its nine metric scores.
The four authority columns and doctrinal precision weights:
| Column | Example (Catholic) | Doctrinal Precision Weight |
|---|---|---|
| Defined Dogma | Real Presence in the Eucharist | 0.30 |
| Ordinary Magisterium | Papal teaching on social ethics | 0.25 |
| Theological Consensus | Majority opinion on secondary matters | 0.20 |
| Legitimate Theological Opinion | Whether Limbo exists | 0.15 |
Full specification: docs/specifications/CDFI-formula.md
| Failure Mode | Type | Detection Method |
|---|---|---|
| Doctrinal Omission | Gradational | Required-elements rubric |
| Moral Softening | Gradational | Moral fidelity rubric |
| Citation Fabrication | Categorical | Hallucination gate — caps CDFI at 40 |
| Prompt Sensitivity Drift | Gradational | Four-variant framing analysis |
| Contextual Relativization | Categorical | Relativism resistance gate — caps CDFI at 40 |
Categorical failures override the weighted composite. They are not averaged with other scores.
Full taxonomy: docs/specifications/failure-taxonomy.md
| CDFI Score | Tier | Permitted Institutional Use |
|---|---|---|
| 85–100 | Formation and Catechesis | RCIA, classroom faith formation, homily preparation, seminary study support |
| 70–84 | General Information | General information use; formation requires a prompt wrapper supplying explicit doctrinal context |
| 50–69 | R&D Only | Internal research and development; no public-facing deployment |
| Below 50 or any gate failure | Not Recommended | No institutional use recommended |
SAICRED (Standard for Assessing AI for Catholic Reliability and Doctrinal Fidelity) is the benchmark built on this framework. It tested six frontier AI models across 400 prompts drawn from 100 Catholic doctrinal questions, producing 21,599 metric scores.
Headline finding: o3 (CDFI 85.0) is the only model in v2 to clear the formation threshold. Five models cleared the general information threshold (70–84).
Primary policy finding: Five of six models perform 10–16 CDFI points better when the Catholic context is explicit in the prompt. Claude Sonnet 4.6 showed a 15.8-point gap (89.4 Catholic framing vs. 73.6 adversarial framing). o3 showed a gap of -0.8 points, effectively zero.
Full results: examples/saicred-v2/
Before any CDFI scores go to print, the automated judge must pass a four-part certification:
| Part | What It Tests | Pass Threshold | SAICRED v2 Result |
|---|---|---|---|
| 1 | Intra-rater consistency (Cohen's kappa per metric) | kappa >= 0.60 on Critical metrics | PASS — May 7, 2026 |
| 2 | Anchor calibration | >= 90% accuracy | PASS — 98.3% |
| 3 | Adversarial invariance | >= 90% | PASS — 100% |
| 4 | Cap gate precision | >= 90% | PASS — 100% |
All four parts cleared: May 11, 2026.
Full protocol: docs/reliability/judge-reliability-protocol.md
The methodology is tradition-agnostic. Any religious institution evaluating AI model reliability against its own doctrinal standards can use this framework by substituting:
- The doctrinal authority level taxonomy with the authority structure of the target tradition
- The failure mode taxonomy with tradition-specific failure modes
- The scoring anchors with examples drawn from the target tradition's texts
- The deployment tier thresholds, reviewed against the institutional risk profile
The seven-step translation sequence, the gate architecture, the reliability certification protocol, and the statistical requirements do not change. They are methodology, not theology.
Adaptation guide: docs/governance/adapting-for-other-traditions.md
Six limitations are documented with exact disclosure language:
| # | Limitation | Publication Impact |
|---|---|---|
| L1 | Authority level classification pending — all 400 v2 prompts used ordinary_magisterium default |
Blocks final CDFI |
| L2 | Human theological review pending | Blocks full publication |
| L3 | Pastoral appropriateness kappa = 0.352 (formula weight 0.02–0.05; non-blocking) | Disclosure only |
| L4 | Stability scores hardcoded at 3.0 — deferred to v2.1 | Non-blocking |
| L5 | Positions 1–5 not statistically distinguishable (only Grok vs. Claude gap reaches p < 0.05) | Interpretive constraint |
| L6 | Scores tied to specific model versions; expire on major version update | Active via versioning protocol |
Full register with paste-ready disclosure language: LIMITATIONS.md
This project used Claude (Anthropic) for methodology development, document drafting, scoring architecture design, and repository construction (March–May 2026). All AI-generated output was treated as draft material subject to human review. The author assumes sole responsibility for the selection, translation, integration, and accuracy of all content. The seven source publications, the CDFI formula, the weighting matrix, the gate architecture, the reliability protocol, and all benchmark methodology decisions are the original intellectual contribution of the author.
@software{banasihan2026cdfi,
author = {Banasihan, Mark Julius},
title = {{CDFI Framework}: Evaluation Governance Infrastructure
for Domain-Specific {AI} Doctrinal Benchmarking},
year = {2026},
month = {5},
version = {1.1},
doi = {10.5281/zenodo.20464408},
url = {https://doi.org/10.5281/zenodo.20464408},
license = {Apache-2.0}
}See also: CITATION.cff for machine-readable citation metadata (GitHub, Zenodo, ORCID compatible).
Copyright © 2026 Mark Julius Banasihan. Licensed under the Apache License 2.0. The methodology is free to use, adapt, and extend. Attribution required.
Mark Julius Banasihan Evaluation governance systems for AI in high-stakes institutional and doctrinal contexts.
GitHub · LinkedIn · ORCID · Email · Atlanta, Georgia, United States