Database-Backed Metadata Storage Design

Overview

This document describes the design for persistent metadata storage to support the Oraculus-DI-Auditor system. The database will store document metadata, provenance records, anomaly detections, and reference graphs for efficient querying and auditing.

Goals

Persistent Storage: Replace in-memory storage with durable database backend
Query Performance: Enable fast lookups by document ID, jurisdiction, date range
Audit Trail: Maintain complete history of all analyses and detections
Scalability: Support corpus of 50,000+ documents (USC, CFR, state codes)
Provenance: Track full lineage from source files to analysis results

Database Options

Option 1: SQLite (Recommended for v1.0)

Pros:

Zero configuration, file-based
Built into Python standard library
Sufficient for single-node deployments
Easy backup (copy .db file)

Cons:

Limited concurrency
Not suitable for distributed systems

Option 2: PostgreSQL (Recommended for Production)

Pros:

Full ACID compliance
Advanced indexing (GiST, GIN)
JSON/JSONB support for flexible schema
Excellent concurrent read/write performance
Replication and high availability

Cons:

Requires separate server installation
More complex setup and maintenance

Option 3: Hybrid Approach

SQLite for development and small deployments
PostgreSQL for production and large-scale deployments
Abstract database layer (SQLAlchemy) for compatibility

Schema Design

Core Tables

1. `documents`

Stores normalized document metadata.

CREATE TABLE documents (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    document_id VARCHAR(255) UNIQUE NOT NULL,
    title TEXT NOT NULL,
    document_type VARCHAR(50) NOT NULL,
    jurisdiction VARCHAR(100),
    authority VARCHAR(255),
    version_date DATE,
    signatory VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    metadata_json TEXT  -- JSON blob for extensibility
);

CREATE INDEX idx_documents_type ON documents(document_type);
CREATE INDEX idx_documents_jurisdiction ON documents(jurisdiction);
CREATE INDEX idx_documents_version_date ON documents(version_date);

2. `provenance`

Tracks document provenance and integrity.

CREATE TABLE provenance (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    document_id VARCHAR(255) NOT NULL,
    source_path TEXT NOT NULL,
    hash VARCHAR(64) NOT NULL,  -- SHA-256
    verified_on TIMESTAMP NOT NULL,
    file_size_bytes INTEGER,
    format VARCHAR(20),  -- json, txt, pdf, xml
    FOREIGN KEY (document_id) REFERENCES documents(document_id)
);

CREATE INDEX idx_provenance_doc_id ON provenance(document_id);
CREATE INDEX idx_provenance_hash ON provenance(hash);

3. `sections`

Stores document sections for full-text search.

CREATE TABLE sections (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    document_id VARCHAR(255) NOT NULL,
    section_id VARCHAR(100) NOT NULL,
    content TEXT NOT NULL,
    order_index INTEGER,
    FOREIGN KEY (document_id) REFERENCES documents(document_id)
);

CREATE INDEX idx_sections_doc_id ON sections(document_id);
CREATE UNIQUE INDEX idx_sections_composite ON sections(document_id, section_id);
CREATE VIRTUAL TABLE sections_fts USING fts5(document_id, section_id, content);

4. `references`

Tracks citations and cross-references.

CREATE TABLE references (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_document_id VARCHAR(255) NOT NULL,
    target_document_id VARCHAR(255),  -- NULL if external reference
    reference_text TEXT NOT NULL,
    reference_type VARCHAR(50),  -- usc, cfr, case, statute, etc.
    FOREIGN KEY (source_document_id) REFERENCES documents(document_id),
    FOREIGN KEY (target_document_id) REFERENCES documents(document_id)
);

CREATE INDEX idx_references_source ON references(source_document_id);
CREATE INDEX idx_references_target ON references(target_document_id);
CREATE INDEX idx_references_type ON references(reference_type);

5. `analyses`

Stores analysis results from audit engine.

CREATE TABLE analyses (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    document_id VARCHAR(255) NOT NULL,
    analysis_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    anomaly_count INTEGER NOT NULL,
    scalar_score REAL NOT NULL,  -- 0.0 to 1.0
    engine_version VARCHAR(20),
    FOREIGN KEY (document_id) REFERENCES documents(document_id)
);

CREATE INDEX idx_analyses_doc_id ON analyses(document_id);
CREATE INDEX idx_analyses_timestamp ON analyses(analysis_timestamp);
CREATE INDEX idx_analyses_score ON analyses(scalar_score);

6. `anomalies`

Stores detected anomalies with full details.

CREATE TABLE anomalies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    analysis_id INTEGER NOT NULL,
    anomaly_id VARCHAR(255) NOT NULL,  -- e.g., fiscal:missing-provenance-hash
    issue TEXT NOT NULL,
    severity VARCHAR(20) NOT NULL,  -- low, medium, high
    layer VARCHAR(50) NOT NULL,  -- fiscal, constitutional, surveillance, etc.
    details_json TEXT,  -- JSON blob with structured details
    FOREIGN KEY (analysis_id) REFERENCES analyses(id)
);

CREATE INDEX idx_anomalies_analysis ON anomalies(analysis_id);
CREATE INDEX idx_anomalies_severity ON anomalies(severity);
CREATE INDEX idx_anomalies_layer ON anomalies(layer);

7. `embeddings`

Stores vector embeddings for semantic search.

CREATE TABLE embeddings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    document_id VARCHAR(255) NOT NULL,
    embedding_vector BLOB NOT NULL,  -- Serialized numpy array
    model_name VARCHAR(100) NOT NULL,  -- e.g., tfidf-local
    dimension INTEGER NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (document_id) REFERENCES documents(document_id)
);

CREATE INDEX idx_embeddings_doc_id ON embeddings(document_id);
CREATE INDEX idx_embeddings_model ON embeddings(model_name);

Implementation Plan

Phase 1: Database Abstraction Layer

Create a database interface using SQLAlchemy ORM:

# src/oraculus_di_auditor/storage/database.py
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

class DatabaseManager:
    def __init__(self, db_url: str):
        self.engine = create_engine(db_url)
        self.Session = sessionmaker(bind=self.engine)

    def store_document(self, doc: dict) -> int:
        """Store normalized document and return DB ID."""
        pass

    def get_document(self, document_id: str) -> dict:
        """Retrieve document by ID."""
        pass

    def store_analysis(self, document_id: str, result: dict) -> int:
        """Store analysis result."""
        pass

    def query_anomalies(
        self, severity: str = None, layer: str = None
    ) -> list[dict]:
        """Query anomalies by filters."""
        pass

Phase 2: Migration from In-Memory to Database

Replace in-memory storage in:

provenance_tracker.py → Use provenance and references tables
retriever.py → Use embeddings table with optional vector database
reporter.py → Query from analyses and anomalies tables

Phase 3: Query API

Add query endpoints to REST API:

@app.get("/api/v1/documents/{document_id}")
async def get_document(document_id: str):
    """Retrieve document by ID."""
    pass

@app.get("/api/v1/anomalies")
async def query_anomalies(
    severity: str = None,
    layer: str = None,
    limit: int = 100
):
    """Query anomalies with filters."""
    pass

@app.get("/api/v1/documents/{document_id}/references")
async def get_references(document_id: str):
    """Get all references for a document."""
    pass

Vector Database Integration (Optional)

For large-scale semantic search, consider specialized vector databases:

Option 1: FAISS (Facebook AI Similarity Search)

High-performance in-memory vector search
Good for read-heavy workloads
Requires periodic index rebuilds

Option 2: Qdrant

Purpose-built vector database
REST API and Python client
Persistent storage with fast search

Option 3: Chroma

Lightweight, embedded vector DB
Easy integration with Python
Good for development and small deployments

Migration Strategy

Step 1: Create Schema

python scripts/create_schema.py --db-url sqlite:///data/oraculus.db

Step 2: Import Existing Data

python scripts/import_legacy_data.py \
  --source data/cases \
  --db-url sqlite:///data/oraculus.db

Step 3: Update Configuration

# config.py
DATABASE_URL = os.getenv(
    "ORACULUS_DB_URL",
    "sqlite:///data/oraculus.db"
)

Step 4: Gradual Rollout

Add database writes alongside existing in-memory storage
Validate consistency
Switch reads to database
Remove in-memory storage code

Performance Considerations

Indexing Strategy

Index foreign keys for join performance
Index commonly queried fields (document_type, jurisdiction, severity)
Full-text search index for sections (FTS5 in SQLite, GIN in PostgreSQL)

Batch Operations

Use bulk inserts for initial data load
Commit in batches of 1000 documents

Caching

Cache frequently accessed documents in Redis
Cache analysis results for 1 hour
Invalidate cache on document updates

Partitioning (PostgreSQL Only)

Partition analyses by month
Partition anomalies by severity and date

Backup and Recovery

SQLite

# Backup
sqlite3 data/oraculus.db ".backup data/oraculus_backup.db"

# Restore
cp data/oraculus_backup.db data/oraculus.db

PostgreSQL

# Backup
pg_dump oraculus > oraculus_backup.sql

# Restore
psql oraculus < oraculus_backup.sql

Security Considerations

Connection Security: Use SSL/TLS for PostgreSQL connections
Access Control: Implement role-based access control (RBAC)
Encryption at Rest: Encrypt database files (LUKS, BitLocker, or database-level)
Audit Logging: Enable database audit logs for compliance
Input Validation: Sanitize all inputs to prevent SQL injection (use parameterized queries)

Testing Strategy

Unit Tests

Test database operations (CRUD)
Test query filters and joins
Test transaction rollback

Integration Tests

Test full pipeline: ingest → analyze → store → query
Test concurrent access (write conflicts)
Test backup/restore procedures

Performance Tests

Benchmark query performance with 10K, 50K, 100K documents
Profile slow queries with EXPLAIN
Test embedding search latency

Future Enhancements

GraphQL API: Alternative to REST for flexible queries
Time-Series Analysis: Track anomaly trends over time
Data Lake Integration: Export to Parquet for analytics
Multi-Tenancy: Isolate data by organization/project
Streaming Updates: Real-time notifications via WebSockets

References

SQLAlchemy Documentation: https://docs.sqlalchemy.org/
PostgreSQL Full-Text Search: https://www.postgresql.org/docs/current/textsearch.html
FAISS: https://github.com/facebookresearch/faiss
Qdrant: https://qdrant.tech/
Chroma: https://www.trychroma.com/

Conclusion

The database-backed metadata storage will provide a robust foundation for scaling the Oraculus-DI-Auditor system to production workloads. By using SQLAlchemy for database abstraction, we maintain flexibility to support both SQLite (development) and PostgreSQL (production) deployments.

Recommended Next Steps:

Implement database abstraction layer with SQLAlchemy
Create schema migration scripts
Add database configuration to config.py
Update existing modules to use database storage
Add query API endpoints
Write integration tests

FilesExpand file tree

database-design.md

Latest commit

History