Skip to content

feat(tpcdi): add TPC-DI benchmark port with six engine implementations#86

Draft
tomz wants to merge 4 commits into
microsoft:mainfrom
tomz:pr4-tpcdi
Draft

feat(tpcdi): add TPC-DI benchmark port with six engine implementations#86
tomz wants to merge 4 commits into
microsoft:mainfrom
tomz:pr4-tpcdi

Conversation

@tomz
Copy link
Copy Markdown

@tomz tomz commented May 29, 2026

Part 4/5 of a stack: #1 (lint) → #2 (cloud engines) → #3 (cli) → #4 (tpcdi)#5 (databricks). Best reviewed in order; each builds on the previous.

What

Ports the TPC-DI data-integration benchmark with implementations across the six supported engines.

  • TPC-DI benchmark module + per-engine implementations.
  • Shared _load_resource_with_fallback helper added to benchmarks/base.py.
  • FinWire fixed-width record parsing with unit tests.

Why

Extends LakeBench beyond TPC-DS/H-style workloads to a data-integration benchmark. Cleanly isolated from the rest of the codebase.

Notes for reviewers

Largest PR in the stack (~a whole benchmark). It's self-contained and conceptually independent of the CLI — fine to review on its own thread, unhurried.

Tests

FinWire parsing unit-tested; suite green (178 passed, 1 skipped).

tomz added 4 commits May 29, 2026 12:32
Add a conservative ruff config (E/F/I/W), a pre-commit hook set, and a CI
lint job (lint + format both enforcing). Reformat the existing tree with
`ruff format` and replace ad-hoc print() diagnostics with module-level
loggers across datagen and timing helpers. Fix obvious nits surfaced by ruff
(import order, bare except -> except Exception, dead assignment).

Drop Python 3.8 support and move pyarrow to base dependencies (the core
results/timing modules import it unconditionally). Gitignore scratch/ for
workspace-specific scratchpads.

W291/W293 stay globally ignored because trailing whitespace inside multi-line
SQL string literals is intentional and not touched by `ruff format`.
Add remote/cloud engines that talk to managed Spark via protocol:

- Livy   — Fabric / Synapse / HDInsight via the Livy REST API, with
           session auto-recovery, per-query timeout, multi-part SHOW TABLES.
- SparkConnect — Spark Connect gRPC client.
- FabricSpark / SynapseSpark / HDISpark — workspace-tagged Spark subclasses.

Catalog plumbing shared by all engines:
- BaseEngine.list_databases() / list_tables() / get_table_columns() defaults,
  overridden for the Spark family, Livy and DuckDB.
- query_timeout_seconds attribute.
- transpile_and_qualify_query() rewritten with AST-based qualification that
  correctly handles 3-/4-part names (workspace.lakehouse.schema, Unity
  catalog.schema): builds quoted identifier chains via sqlglot, preserves the
  caller's catalog, and leaves CTE references untouched. Adds 9 multi-part
  tests (previously untested).

Column auto-remap is opt-in (auto_remap_columns=False) and warns loudly when
active — silently rewriting columns to match non-spec data hurts benchmark
reproducibility and can mask data-prep bugs.

Register Livy as a generic engine for TPC-H / TPC-DS / ClickBench.
…iles)

A purely additive command-line surface over the existing Python API.
Library consumers are unaffected.

- cli/ package: argparse plumbing, override merge (-E/--conf), output
  formatting (human/table/json/csv/yaml), dry-run/print-config, verbosity
  and meaningful exit codes.
- config.py: profile loader (~/.lakebench.json + ./lakebench.json), env-var
  expansion, `extends:` composition, validation, lazy engine/benchmark
  registries.

  resolve_engine() handles *_env credential references by honoring the engine
  signature: engines that accept the env-var NAME (Databricks, Livy) get it
  untouched and resolve the secret themselves; engines that accept the bare
  key (or **kwargs) get the resolved value. This avoids silently dropping the
  credential. Covered by tests/test_config.py.
- results.py / reporting.py / discover.py as before.
- Expose console_script entry point and livy/spark_connect extras + Fabric/
  Synapse/HDInsight aliases.

Docs: cli-quickstart, cli-reference, architecture, development.
A whole new benchmark following LakeBench's plug-in pattern: per-engine ETL
classes (DuckDB, Spark, Sail, Polars, Daft) plus canonical/duckdb DDL and an
audit-validation query.

- Extract a shared DDL-load fallback helper in benchmarks/base.py (also
  simplifies elt_bench).
- FinWire fixed-width parser + 5 unit tests; un-skip the CLI tpcdi mode test.

Full unit suite: 178 passing.
@tomz
Copy link
Copy Markdown
Author

tomz commented May 29, 2026

Part 4/5 of the stack in #82, following #85. Diff note: targets main (fork-only branches), so the diff currently includes #83+#84+#85; it narrows to just the TPC-DI work after those merge and I rebase. Merge order: #83#84#85#86#87.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant