feat(engines): add cloud Spark engines + multi-part catalog name support by tomz · Pull Request #84 · microsoft/LakeBench

tomz · 2026-05-29T21:13:37Z

Part 2/5 of a stack: #1 (lint) → #2 (cloud engines) → #3 (cli) → #4 (tpcdi) → #5 (databricks). Best reviewed in order; each builds on the previous.

What

Adds cloud Spark execution engines (Livy, Spark Connect) and rewrites table-name qualification to handle multi-part (catalog.schema.table) names correctly.

New cloud Spark engines (Livy submit, Spark Connect).
Rewrote transpile_and_qualify_query on the sqlglot AST instead of string munging. Correctly handles 3- and 4-part names, the catalog + dotted-schema case, is CTE-safe, and applies Spark-family quoting consistently.
Restored get_table_name_from_ddl (was inadvertently removed).
Column remapping is now opt-in. The previous silent fuzzy column auto-remap (Levenshtein > 0.85) violated benchmark reproducibility — it could quietly rewrite a query against a mismatched schema. It's now gated behind auto_remap_columns: bool = False and warns loudly when it fires.

Why

The original multi-part qualification path was untested and buggy (it silently dropped the catalog in the catalog + dotted-schema case and quoted inconsistently). The AST rewrite fixes that class of bug. Making column remap opt-in keeps benchmark runs faithful to the query as written.

Tests

Expanded tests/test_query_utils.py to 16 cases covering the multi-part / quoting / CTE scenarios. Suite green (32 passed).

Add a conservative ruff config (E/F/I/W), a pre-commit hook set, and a CI lint job (lint + format both enforcing). Reformat the existing tree with `ruff format` and replace ad-hoc print() diagnostics with module-level loggers across datagen and timing helpers. Fix obvious nits surfaced by ruff (import order, bare except -> except Exception, dead assignment). Drop Python 3.8 support and move pyarrow to base dependencies (the core results/timing modules import it unconditionally). Gitignore scratch/ for workspace-specific scratchpads. W291/W293 stay globally ignored because trailing whitespace inside multi-line SQL string literals is intentional and not touched by `ruff format`.

Add remote/cloud engines that talk to managed Spark via protocol: - Livy — Fabric / Synapse / HDInsight via the Livy REST API, with session auto-recovery, per-query timeout, multi-part SHOW TABLES. - SparkConnect — Spark Connect gRPC client. - FabricSpark / SynapseSpark / HDISpark — workspace-tagged Spark subclasses. Catalog plumbing shared by all engines: - BaseEngine.list_databases() / list_tables() / get_table_columns() defaults, overridden for the Spark family, Livy and DuckDB. - query_timeout_seconds attribute. - transpile_and_qualify_query() rewritten with AST-based qualification that correctly handles 3-/4-part names (workspace.lakehouse.schema, Unity catalog.schema): builds quoted identifier chains via sqlglot, preserves the caller's catalog, and leaves CTE references untouched. Adds 9 multi-part tests (previously untested). Column auto-remap is opt-in (auto_remap_columns=False) and warns loudly when active — silently rewriting columns to match non-spec data hurts benchmark reproducibility and can mask data-prep bugs. Register Livy as a generic engine for TPC-H / TPC-DS / ClickBench.

tomz · 2026-05-29T21:14:06Z

Part 2/5 of the stack in #82, following #83. Diff note: because the stack branches live on my fork, GitHub can only target main, so this PR's diff currently also includes #83's changes. Once #83 merges I'll rebase this onto main and the diff will narrow to just the cloud-engine work. Merge order: #83 → #84 → #85 → #86 → #87.

tomz added 2 commits May 29, 2026 12:32

This was referenced May 29, 2026

feat(cli): add lakebench CLI (run/results/report/discover/doctor/profiles) #85

Draft

feat(tpcdi): add TPC-DI benchmark port with six engine implementations #86

Draft

feat(engines): add Databricks Connect engine #87

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engines): add cloud Spark engines + multi-part catalog name support#84

feat(engines): add cloud Spark engines + multi-part catalog name support#84
tomz wants to merge 2 commits into
microsoft:mainfrom
tomz:pr2-cloud-engines

tomz commented May 29, 2026

Uh oh!

tomz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomz commented May 29, 2026

What

Why

Tests

Uh oh!

tomz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant