feat(tpcdi): add TPC-DI benchmark port with six engine implementations by tomz · Pull Request #86 · microsoft/LakeBench

tomz · 2026-05-29T21:13:44Z

Part 4/5 of a stack: #1 (lint) → #2 (cloud engines) → #3 (cli) → #4 (tpcdi) → #5 (databricks). Best reviewed in order; each builds on the previous.

What

Ports the TPC-DI data-integration benchmark with implementations across the six supported engines.

TPC-DI benchmark module + per-engine implementations.
Shared _load_resource_with_fallback helper added to benchmarks/base.py.
FinWire fixed-width record parsing with unit tests.

Why

Extends LakeBench beyond TPC-DS/H-style workloads to a data-integration benchmark. Cleanly isolated from the rest of the codebase.

Notes for reviewers

Largest PR in the stack (~a whole benchmark). It's self-contained and conceptually independent of the CLI — fine to review on its own thread, unhurried.

Tests

FinWire parsing unit-tested; suite green (178 passed, 1 skipped).

Add a conservative ruff config (E/F/I/W), a pre-commit hook set, and a CI lint job (lint + format both enforcing). Reformat the existing tree with `ruff format` and replace ad-hoc print() diagnostics with module-level loggers across datagen and timing helpers. Fix obvious nits surfaced by ruff (import order, bare except -> except Exception, dead assignment). Drop Python 3.8 support and move pyarrow to base dependencies (the core results/timing modules import it unconditionally). Gitignore scratch/ for workspace-specific scratchpads. W291/W293 stay globally ignored because trailing whitespace inside multi-line SQL string literals is intentional and not touched by `ruff format`.

Add remote/cloud engines that talk to managed Spark via protocol: - Livy — Fabric / Synapse / HDInsight via the Livy REST API, with session auto-recovery, per-query timeout, multi-part SHOW TABLES. - SparkConnect — Spark Connect gRPC client. - FabricSpark / SynapseSpark / HDISpark — workspace-tagged Spark subclasses. Catalog plumbing shared by all engines: - BaseEngine.list_databases() / list_tables() / get_table_columns() defaults, overridden for the Spark family, Livy and DuckDB. - query_timeout_seconds attribute. - transpile_and_qualify_query() rewritten with AST-based qualification that correctly handles 3-/4-part names (workspace.lakehouse.schema, Unity catalog.schema): builds quoted identifier chains via sqlglot, preserves the caller's catalog, and leaves CTE references untouched. Adds 9 multi-part tests (previously untested). Column auto-remap is opt-in (auto_remap_columns=False) and warns loudly when active — silently rewriting columns to match non-spec data hurts benchmark reproducibility and can mask data-prep bugs. Register Livy as a generic engine for TPC-H / TPC-DS / ClickBench.

…iles) A purely additive command-line surface over the existing Python API. Library consumers are unaffected. - cli/ package: argparse plumbing, override merge (-E/--conf), output formatting (human/table/json/csv/yaml), dry-run/print-config, verbosity and meaningful exit codes. - config.py: profile loader (~/.lakebench.json + ./lakebench.json), env-var expansion, `extends:` composition, validation, lazy engine/benchmark registries. resolve_engine() handles *_env credential references by honoring the engine signature: engines that accept the env-var NAME (Databricks, Livy) get it untouched and resolve the secret themselves; engines that accept the bare key (or **kwargs) get the resolved value. This avoids silently dropping the credential. Covered by tests/test_config.py. - results.py / reporting.py / discover.py as before. - Expose console_script entry point and livy/spark_connect extras + Fabric/ Synapse/HDInsight aliases. Docs: cli-quickstart, cli-reference, architecture, development.

A whole new benchmark following LakeBench's plug-in pattern: per-engine ETL classes (DuckDB, Spark, Sail, Polars, Daft) plus canonical/duckdb DDL and an audit-validation query. - Extract a shared DDL-load fallback helper in benchmarks/base.py (also simplifies elt_bench). - FinWire fixed-width parser + 5 unit tests; un-skip the CLI tpcdi mode test. Full unit suite: 178 passing.

tomz · 2026-05-29T21:14:10Z

Part 4/5 of the stack in #82, following #85. Diff note: targets main (fork-only branches), so the diff currently includes #83+#84+#85; it narrows to just the TPC-DI work after those merge and I rebase. Merge order: #83 → #84 → #85 → #86 → #87.

tomz added 4 commits May 29, 2026 12:32

This was referenced May 29, 2026

feat(engines): add cloud Spark engines + multi-part catalog name support #84

Draft

feat(cli): add lakebench CLI (run/results/report/discover/doctor/profiles) #85

Draft

tomz mentioned this pull request May 29, 2026

feat(engines): add Databricks Connect engine #87

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tpcdi): add TPC-DI benchmark port with six engine implementations#86

feat(tpcdi): add TPC-DI benchmark port with six engine implementations#86
tomz wants to merge 4 commits into
microsoft:mainfrom
tomz:pr4-tpcdi

tomz commented May 29, 2026

Uh oh!

tomz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomz commented May 29, 2026

What

Why

Notes for reviewers

Tests

Uh oh!

tomz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant