Skip to content

Latest commit

 

History

History
79 lines (60 loc) · 2.54 KB

File metadata and controls

79 lines (60 loc) · 2.54 KB

Docudactyl Roadmap

Current Status: v0.4.0 (97% complete)

The Chapel HPC engine, Zig FFI layer, Idris2 ABI proofs, and all subsystems are implemented. The only remaining blocker is multi-locale testing on an HPC cluster.

Completed Milestones

v0.1.0 — Foundation

  • ✓ Julia extraction engine (now legacy)

  • ✓ OCaml Scheme transformer

  • ✓ Ada terminal UI

  • ✓ RSR-compliant repository structure

v0.2.0 — Chapel HPC Engine

  • ✓ Chapel distributed processing (Config, ManifestLoader, FaultHandler, ProgressReporter, ShardedOutput, ResultAggregator, Checkpoint)

  • ✓ Zig FFI layer with multi-format parser dispatch (PDF, Image, Audio, Video, EPUB, GeoSpatial)

  • ✓ Idris2 ABI proofs (Types, Layout, Foreign)

  • ✓ Generated C header (51 functions)

  • ✓ Integration tests

v0.3.0 — Performance Subsystems

  • ✓ 20 processing stages with Cap’n Proto binary output

  • ✓ NDJSON enriched manifests (eliminates 170M stat() calls)

  • ✓ Two-level caching: L1 LMDB per-locale + L2 Dragonfly cross-locale

  • ✓ Conduit preprocessing pipeline (magic-byte detection, SHA-256, validation)

  • ✓ I/O prefetcher (io_uring + posix_fadvise fallback)

  • ✓ Hardware crypto acceleration (SHA-NI, AVX2, AVX-512, AES-NI, ARM SHA2)

  • ✓ Checkpoint and resume

v0.4.0 — ML & GPU Integration (current)

  • ✓ GPU OCR coprocessor (PaddleOCR/Tesseract CUDA/CPU via dlopen)

  • ✓ ML inference engine (ONNX Runtime: NER, Whisper, ImageClassify, Layout, Handwriting)

  • ✓ Handle attachment pattern (ML + GPU OCR wired into parse path)

  • ✓ 40+ Zig integration tests covering all subsystem APIs

  • ✓ Containerfile (Wolfi runtime) and Slurm job script

  • ✓ Full Idris2 ABI coverage (14 types, 5 struct proofs, 51 FFI declarations)

  • ✓ Checkpoint protocol compliance (all 6 SCM files populated)

  • ✓ Author/copyright/license cleanup across all files

Remaining: v0.4.1

  • ❏ Multi-locale testing on HPC cluster (GASNet/IBV, 4+ nodes)

  • ❏ End-to-end benchmark with British Library sample dataset

Future: v1.0.0 — Production Release

  • ❏ Multi-locale validation at scale (64-512 nodes)

  • ❏ British Library pilot (170M items)

  • ❏ Performance tuning (target: 100 docs/s/node)

  • ❏ Production monitoring and alerting

  • ❏ Formal security audit of Zig FFI layer

Scale Estimates (British Library)

Metric Estimate

Items

170,000,000

Locales

64-512 nodes

Cold run (256 nodes + GPU)

~3.7 hours

Warm run (L1+L2 cache)

~4.4 minutes

Incremental (5% new)

~8 minutes

Output

~1.7 TB total