The largest open corpus of .docx files (~800K documents) for document processing research. Built by SuperDoc — DOCX editing and tooling.
This is a data pipeline monorepo with two runtimes:
- TypeScript (Bun) — infrastructure: scraping, extraction, embedding
- Python — data science: classification, export, publishing
apps/cli/ → corpus <command> (scrape, extract, embed, status)
apps/cdx-filter/ → AWS Lambda for Common Crawl CDX filtering
packages/shared/ → DB client (Bun.sql), R2 storage, UI helpers
packages/scraper/ → Downloads .docx from Common Crawl WARC archives
packages/extractor/ → Text extraction via Docling
packages/embedder/ → Embeddings via Google gemini-embedding-001
scripts/classification/ → ML classification pipeline (Python)
db/ → PostgreSQL schema + migrations
Each stage writes to the same PostgreSQL database (documents table):
- Scrape (TS) — Common Crawl → .docx files in R2 (
status = 'uploaded') - Extract (TS) — Docling → text in R2 (
extracted_at,word_count,language) - Embed (TS) — Google API → pgvector (
embedding,embedded_at) - Classify (Python) — ModernBERT → labels (
document_type,document_topic)
The scraper maintains exact parity between CDX URLs and database records: every URL in a crawl's CDX files has exactly one record in the documents table under that crawl_id.
uploaded— valid .docx saved to R2, ID is{contentHash}failed— WARC download failed or content is invalid docx, ID isfailed-{urlHash}(download error) or{contentHash}(validation error)duplicate— same content already exists under a different URL, ID isdup-{urlHash}
IDs are content-addressed for storage mapping (documents/{id}.docx):
| Scenario | ID | Reason |
|---|---|---|
| Uploaded | {sha256(content)} |
Maps to R2 storage key |
| Download failed | failed-{sha256(url)} |
No content available, use URL hash |
| Validation failed | {sha256(content)} |
Content exists but isn't valid docx |
| Content duplicate | dup-{sha256(url+crawlId)} |
Scoped per crawl so each crawl keeps its own dup record |
The scraper handles three dedup scenarios in order:
-
URL-dedup (instant, no download) — URL already in
processedHashesSet (md5 hashes loaded from all crawls at startup). Includes uploaded, duplicate, AND failed URLs by default. If the URL exists under a differentcrawl_id, creates a cross-crawlduplicaterecord under the current crawl. If already under the current crawl, silently skips. -
Content-dedup (requires WARC download) — URL is new but content hash matches an existing document. Creates a
duplicaterecord pointing to the original. -
Same-URL retry (within same crawl) — Same URL appears multiple times in CDX files (different WARC captures). After a successful WARC download, the URL is added to
processedHashesso subsequent entries are skipped.
When a WARC download succeeds, the scraper deletes any previous failed-{urlHash} record for that URL. This prevents duplicate records when a URL fails on one attempt but succeeds on a later retry (since the failed and successful records have different IDs).
Running the scraper on the same crawl again is safe:
--force: re-downloads everything from scratch--retry-failed: re-downloads only previously failed URLs- Default: all known URLs (uploaded + duplicate + failed) are skipped instantly
Single documents table in PostgreSQL (NeonDB) with pgvector. All pipeline stages write to this table.
- Connection:
DATABASE_URLenv var (Bun.sql for TS, psycopg2 for Python) - Schema:
db/schema.sql(canonical),db/migrations/(incremental) - Key columns:
id(SHA-256 hash),status,extracted_at,embedded_at,document_type,document_topic
Documents and extracted text live in Cloudflare R2:
documents/{hash}.docx— original filesextracted/{hash}.txt— extracted text
Text is also available at https://docxcorp.us/extracted/{id}.txt.
bun install # Install TS dependencies
bun run corpus scrape --crawl 3 # Scrape from Common Crawl
bun run corpus extract # Extract text
bun run corpus embed # Generate embeddings
bun run corpus status # Show pipeline stats- Use
bunfor all TS tooling (not node/npm/pnpm) - DB client is in
packages/shared/db.ts— all pipeline stages useDbClient - Storage abstraction in
packages/shared/storage.ts— R2 or local - Environment:
.envat project root (gitignored), see.env.example - Python scripts manage their own deps via
pyproject.toml