Universal Data Integrity: Forensic Gatekeeper Engine

Strategic Intent: Forensic vs. Descriptive Data Quality

How do you establish mathematical trust in raw datasets before they feed high-stakes models, executive reporting, or regulatory analytics?

I engineered this Python-driven diagnostic engine to act as an automated intake audit for raw data. The system is designed to identify forensic anomalies that traditional descriptive data-quality checks often miss, including hidden null-equivalent strings, dominant-bias patterns, format inconsistencies, and zero-information fields.

The objective is simple: prevent corrupted or structurally weak data from reaching downstream models, dashboards, or executive decision processes.

This project demonstrates a practical data-governance principle:

Data quality should not be a vague technical concern.
It should be converted into a measurable integrity score, a remediation map, and an audit-ready executive narrative.

Executive Dashboard

Open dashboard full size

The executive dashboard translates column-level forensic diagnostics into a leadership-ready view of dataset health, remediation potential, and final audit readiness.

The Gatekeeper Framework

1. Forensic Anomaly Detection

The engine identifies structural data-quality issues that can survive ordinary profiling:

hidden null-equivalent strings such as none, N/A, ???, ., and blank strings
high null density
dominant-value bias
zero-variance columns
inconsistent string formats or lengths

These checks are designed to find corruption patterns, not just obvious missing values.

2. Multi-Dimensional Quality Scoring

Each column receives a weighted integrity score based on detected issues.

The current scoring framework evaluates:

density risk
format inconsistency
dominant bias
zero variance

The dataset-level score is calculated as the average of the column integrity scores.

3. Simulated Remediation Potential

The engine distinguishes between repairable and structural issues.

Repairable issues, such as density and format inconsistencies, are used to estimate a simulated post-remediation score. This gives stakeholders a practical view of how much audit readiness can improve if known defects are addressed.

4. Audit-Ready Output

The engine produces two forms of stakeholder evidence:

an Excel-based forensic audit ledger for technical owners
a PowerPoint executive scorecard for decision-makers

This supports the “No Cold Handoffs” standard: executives see the health signal, while developers and data owners see the repair map.

Executive Interpretation

Garbage-In, Garbage-Out Risk

Data science is only as reliable as the data it consumes. This engine acts as a gatekeeper that prevents corrupted, biased, or structurally weak data from reaching downstream analytics.

Forensic vs. Descriptive Quality

Standard profiling tools often identify missing values and summary statistics. This engine focuses on forensic failure modes: hidden null-like values, dominant-bias patterns, inconsistent formats, and structurally non-informative fields.

Reducing Technical Debt at Intake

It is far cheaper to fix data at the intake stage than to remediate a model, dashboard, or regulatory analysis after incorrect outputs have already been produced.

Quantifiable Integrity Score

The integrity score converts data-quality review into a measurable business signal. Stakeholders receive a clear Go / No-Go metric for whether a dataset is ready for downstream use.

Multi-Channel Reporting

The engine supports both executive and technical reporting:

executive scorecard for leadership review
forensic audit ledger for remediation owners
column-level health chart for visual inspection

Technical Architecture

Core Engine

Primary source file:

src/forensic_diagnostic_engine.py

The executable core is implemented as a class-based diagnostic engine:

DataForensicEngine

The engine accepts a pandas DataFrame, audits each column, calculates column-level scores, calculates a dataset integrity score, estimates remediation potential, and generates visual / reporting artifacts.

Configurable Thresholds

The engine centralizes key controls:

NULL_THRESHOLD = 0.20
FORMAT_DEVIATION_LIMIT = 0.05
BIAS_THRESHOLD = 0.99

These thresholds allow the audit posture to be tuned for different enterprise environments or risk tolerances.

Weighted Scoring Model

Column-level deductions are governed through explicit weights:

WEIGHTS = {
    "Density": 0.30,
    "Format_Inconsistency": 0.25,
    "Dominant_Bias": 0.20,
    "Zero_Variance": 0.25
}

This creates a transparent scoring methodology rather than an opaque quality label.

Hidden Null Detection

The engine identifies null-like strings that can bypass standard .isnull() checks:

STRING_NULLS = ['nan', 'null', 'na', 'n/a', 'undefined', '???', '.', '']

This is especially important in enterprise data pipelines where bad inputs may be encoded as strings instead of true nulls.

Remediation Potential

The remediation simulation assumes density and format issues are repairable, while dominant bias and zero variance are treated as structural issues.

This distinction helps teams prioritize engineering effort:

repairable defects can increase the audit score
structural defects may require data-source, feature, or business-process review

Reporting Outputs

The engine can export:

Output	Purpose
`Forensic_Audit_Ledger.xlsx`	Column-level audit ledger with scores, statuses, alert types, and forensic rationale
`Executive_Scorecard.pptx`	Executive scorecard with dataset integrity score, remediation potential, and visual column-health summary

How to Run

Install dependencies:

pip install pandas numpy matplotlib python-pptx openpyxl

Run the demo engine:

python src/forensic_diagnostic_engine.py

The script includes a demo dataset with intentional forensic flaws, including:

format inconsistency
dominant bias
high null density
hidden null-equivalent strings
clean control fields

By default, the demo prints the current dataset integrity score and simulated post-remediation score.

To export the Excel audit ledger and PowerPoint scorecard, call or uncomment:

engine.export_to_excel("Andrew_Goad_Forensic_Audit.xlsx")
engine.export_to_pptx("Executive_Data_Scorecard.pptx")

Repository Structure

forensic-data-integrity/
│
├── README.md
│
├── docs/
│   └── executive_dashboard_preview.png
│
└── src/
    └── forensic_diagnostic_engine.py

Data Privacy and Interpretation Boundaries

All data and visual outputs in this repository are generated from synthetic or anonymized datasets to protect proprietary information.

This framework demonstrates methodology for high-stakes enterprise and regulatory environments, but it does not expose real customer data, proprietary data-quality rules, confidential model inputs, or regulated production pipelines.

Important interpretation boundaries:

The integrity score is a diagnostic metric, not a regulatory certification.
The remediation potential is a simulated estimate, not a guarantee of production remediation.
The demo dataset is intentionally flawed to illustrate forensic detection logic.
The framework is designed for intake review, governance triage, and stakeholder communication.

Portfolio Philosophy

No Cold Handoffs — engineering zero-defect, audit-ready results so stakeholders internalize the underlying “why.”

This project is designed to ensure that data-quality findings are not trapped inside technical outputs. The goal is to translate forensic diagnostics into clear, defensible, stakeholder-ready evidence before downstream analytics depend on the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Universal Data Integrity: Forensic Gatekeeper Engine

Strategic Intent: Forensic vs. Descriptive Data Quality

Executive Dashboard

The Gatekeeper Framework

1. Forensic Anomaly Detection

2. Multi-Dimensional Quality Scoring

3. Simulated Remediation Potential

4. Audit-Ready Output

Executive Interpretation

Garbage-In, Garbage-Out Risk

Forensic vs. Descriptive Quality

Reducing Technical Debt at Intake

Quantifiable Integrity Score

Multi-Channel Reporting

Technical Architecture

Core Engine

Configurable Thresholds

Weighted Scoring Model

Hidden Null Detection

Remediation Potential

Reporting Outputs

How to Run

Repository Structure

Data Privacy and Interpretation Boundaries

Portfolio Philosophy

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
src		src
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Universal Data Integrity: Forensic Gatekeeper Engine

Strategic Intent: Forensic vs. Descriptive Data Quality

Executive Dashboard

The Gatekeeper Framework

1. Forensic Anomaly Detection

2. Multi-Dimensional Quality Scoring

3. Simulated Remediation Potential

4. Audit-Ready Output

Executive Interpretation

Garbage-In, Garbage-Out Risk

Forensic vs. Descriptive Quality

Reducing Technical Debt at Intake

Quantifiable Integrity Score

Multi-Channel Reporting

Technical Architecture

Core Engine

Configurable Thresholds

Weighted Scoring Model

Hidden Null Detection

Remediation Potential

Reporting Outputs

How to Run

Repository Structure

Data Privacy and Interpretation Boundaries

Portfolio Philosophy

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages