The Oraculus Decimus Intellect Analyst is a foundational architecture for legal document ingestion, normalization, embedding, and anomaly auditing. It supports large-scale, chronological & cross-referenced auditing of statutes, cases, and contracts.
Raw Documents → Ingest → Normalize → Embed → Analyze → Report
- Purpose: Read raw files into normalized JSON
- Inputs: Plain-text files, JSON documents, PDFs
- Outputs: Normalized JSON with standard schema
- Features:
- Multi-format support (TXT, JSON, PDF)
- Automatic file discovery
- Batch processing
- Purpose: Transform to canonical schema with text chunking
- Process:
- Break long text into overlapping chunks (default: 512 chars, 64 overlap)
- Generate chunk metadata (position, length)
- Standardize document structure
- Schema Fields:
id: Unique document identifiertitle: Document titlesource: Source informationdate: Document datetext: Full text contentcitations: Array of referenced documentschunks: Text chunks with metadata
- Purpose: Vectorize textual chunks for retrieval
- Modes:
- Local (default): Hash-based deterministic embeddings
- Model: Pluggable interface for transformer models (future)
- Features:
- Deterministic for reproducible tests
- No external dependencies in local mode
- 128-dimensional vectors by default
- Purpose: Store vectors and perform similarity search
- Storage: NumPy
.npyfiles for vectors and metadata - Search: Cosine similarity with configurable top-k
- Features:
- Persistent storage
- Metadata tracking
- Fast similarity search
- Purpose: Rule-based + ML detectors to find anomalies/inconsistencies
- Detectors:
- Long Sentence: Sentences exceeding 1000 characters
- Cross-Reference Mismatch: Citations in text not in citation array
- Contradictory Dates: Dates mentioned that don't match document date
- Output: Structured findings with severity levels
- Purpose: Human-readable audit artifacts
- Formats:
- JSON: Machine-readable with full provenance
- CSV: Spreadsheet-compatible summary
- Features:
- Timestamped reports
- Provenance metadata
- Flattened CSV for analysis
The config.py module defines standard paths:
REPO_ROOT: Repository root directoryDATA_DIR: Data storage directoryCASES_DIR: Normalized case documentsSTATUTES_DIR: Statute documentsSOURCES_DIR: Raw source files (gitignored)VECTORS_DIR: Embedding vectors
The cli.py module provides command-line access:
python -m oraculus_di_auditor.cli ingest --source data/sources- Semantic Analysis: Integrate transformer models for deep semantic understanding
- Database Backend: Replace file storage with PostgreSQL for scale
- Visualization: Graph visualization of document relationships
- ML Anomaly Detection: Train custom models on audit data
- Multi-language Support: Extend to non-English documents
- No External API Calls: All processing is local
- Data Privacy: Source files never leave local environment
- Gitignore Protection: Sensitive data directories are excluded
- Hash Verification: SHA-256 hashing for data integrity
- Unit Tests: Each module has comprehensive unit tests
- Integration Tests: End-to-end pipeline testing
- Fixtures: Sample data for reproducible testing
- Coverage Target: ≥ 90%
- Batch Processing: Process multiple documents efficiently
- Chunking: Configurable chunk size for memory management
- Vector Storage: NumPy for fast numerical operations
- Lazy Loading: Load data only when needed