How do you establish mathematical trust in raw datasets before they feed high-stakes models, executive reporting, or regulatory analytics?
I engineered this Python-driven diagnostic engine to act as an automated intake audit for raw data. The system is designed to identify forensic anomalies that traditional descriptive data-quality checks often miss, including hidden null-equivalent strings, dominant-bias patterns, format inconsistencies, and zero-information fields.
The objective is simple: prevent corrupted or structurally weak data from reaching downstream models, dashboards, or executive decision processes.
This project demonstrates a practical data-governance principle:
Data quality should not be a vague technical concern.
It should be converted into a measurable integrity score, a remediation map, and an audit-ready executive narrative.
The executive dashboard translates column-level forensic diagnostics into a leadership-ready view of dataset health, remediation potential, and final audit readiness.
The engine identifies structural data-quality issues that can survive ordinary profiling:
- hidden null-equivalent strings such as
none,N/A,???,., and blank strings - high null density
- dominant-value bias
- zero-variance columns
- inconsistent string formats or lengths
These checks are designed to find corruption patterns, not just obvious missing values.
Each column receives a weighted integrity score based on detected issues.
The current scoring framework evaluates:
- density risk
- format inconsistency
- dominant bias
- zero variance
The dataset-level score is calculated as the average of the column integrity scores.
The engine distinguishes between repairable and structural issues.
Repairable issues, such as density and format inconsistencies, are used to estimate a simulated post-remediation score. This gives stakeholders a practical view of how much audit readiness can improve if known defects are addressed.
The engine produces two forms of stakeholder evidence:
- an Excel-based forensic audit ledger for technical owners
- a PowerPoint executive scorecard for decision-makers
This supports the “No Cold Handoffs” standard: executives see the health signal, while developers and data owners see the repair map.
Data science is only as reliable as the data it consumes. This engine acts as a gatekeeper that prevents corrupted, biased, or structurally weak data from reaching downstream analytics.
Standard profiling tools often identify missing values and summary statistics. This engine focuses on forensic failure modes: hidden null-like values, dominant-bias patterns, inconsistent formats, and structurally non-informative fields.
It is far cheaper to fix data at the intake stage than to remediate a model, dashboard, or regulatory analysis after incorrect outputs have already been produced.
The integrity score converts data-quality review into a measurable business signal. Stakeholders receive a clear Go / No-Go metric for whether a dataset is ready for downstream use.
The engine supports both executive and technical reporting:
- executive scorecard for leadership review
- forensic audit ledger for remediation owners
- column-level health chart for visual inspection
Primary source file:
src/forensic_diagnostic_engine.py
The executable core is implemented as a class-based diagnostic engine:
DataForensicEngineThe engine accepts a pandas DataFrame, audits each column, calculates column-level scores, calculates a dataset integrity score, estimates remediation potential, and generates visual / reporting artifacts.
The engine centralizes key controls:
NULL_THRESHOLD = 0.20
FORMAT_DEVIATION_LIMIT = 0.05
BIAS_THRESHOLD = 0.99These thresholds allow the audit posture to be tuned for different enterprise environments or risk tolerances.
Column-level deductions are governed through explicit weights:
WEIGHTS = {
"Density": 0.30,
"Format_Inconsistency": 0.25,
"Dominant_Bias": 0.20,
"Zero_Variance": 0.25
}This creates a transparent scoring methodology rather than an opaque quality label.
Hidden Null Detection
The engine identifies null-like strings that can bypass standard .isnull() checks:
STRING_NULLS = ['nan', 'null', 'na', 'n/a', 'undefined', '???', '.', '']This is especially important in enterprise data pipelines where bad inputs may be encoded as strings instead of true nulls.
The remediation simulation assumes density and format issues are repairable, while dominant bias and zero variance are treated as structural issues.
This distinction helps teams prioritize engineering effort:
- repairable defects can increase the audit score
- structural defects may require data-source, feature, or business-process review
The engine can export:
| Output | Purpose |
|---|---|
Forensic_Audit_Ledger.xlsx |
Column-level audit ledger with scores, statuses, alert types, and forensic rationale |
Executive_Scorecard.pptx |
Executive scorecard with dataset integrity score, remediation potential, and visual column-health summary |
Install dependencies:
pip install pandas numpy matplotlib python-pptx openpyxlRun the demo engine:
python src/forensic_diagnostic_engine.pyThe script includes a demo dataset with intentional forensic flaws, including:
- format inconsistency
- dominant bias
- high null density
- hidden null-equivalent strings
- clean control fields
By default, the demo prints the current dataset integrity score and simulated post-remediation score.
To export the Excel audit ledger and PowerPoint scorecard, call or uncomment:
engine.export_to_excel("Andrew_Goad_Forensic_Audit.xlsx")
engine.export_to_pptx("Executive_Data_Scorecard.pptx")forensic-data-integrity/
│
├── README.md
│
├── docs/
│ └── executive_dashboard_preview.png
│
└── src/
└── forensic_diagnostic_engine.py
All data and visual outputs in this repository are generated from synthetic or anonymized datasets to protect proprietary information.
This framework demonstrates methodology for high-stakes enterprise and regulatory environments, but it does not expose real customer data, proprietary data-quality rules, confidential model inputs, or regulated production pipelines.
Important interpretation boundaries:
- The integrity score is a diagnostic metric, not a regulatory certification.
- The remediation potential is a simulated estimate, not a guarantee of production remediation.
- The demo dataset is intentionally flawed to illustrate forensic detection logic.
- The framework is designed for intake review, governance triage, and stakeholder communication.
No Cold Handoffs — engineering zero-defect, audit-ready results so stakeholders internalize the underlying “why.”
This project is designed to ensure that data-quality findings are not trapped inside technical outputs. The goal is to translate forensic diagnostics into clear, defensible, stakeholder-ready evidence before downstream analytics depend on the data.
