Skip to content

feat(engines): add cloud Spark engines + multi-part catalog name support#84

Draft
tomz wants to merge 2 commits into
microsoft:mainfrom
tomz:pr2-cloud-engines
Draft

feat(engines): add cloud Spark engines + multi-part catalog name support#84
tomz wants to merge 2 commits into
microsoft:mainfrom
tomz:pr2-cloud-engines

Conversation

@tomz
Copy link
Copy Markdown

@tomz tomz commented May 29, 2026

Part 2/5 of a stack: #1 (lint) → #2 (cloud engines)#3 (cli) → #4 (tpcdi) → #5 (databricks). Best reviewed in order; each builds on the previous.

What

Adds cloud Spark execution engines (Livy, Spark Connect) and rewrites table-name qualification to handle multi-part (catalog.schema.table) names correctly.

  • New cloud Spark engines (Livy submit, Spark Connect).
  • Rewrote transpile_and_qualify_query on the sqlglot AST instead of string munging. Correctly handles 3- and 4-part names, the catalog + dotted-schema case, is CTE-safe, and applies Spark-family quoting consistently.
  • Restored get_table_name_from_ddl (was inadvertently removed).
  • Column remapping is now opt-in. The previous silent fuzzy column auto-remap (Levenshtein > 0.85) violated benchmark reproducibility — it could quietly rewrite a query against a mismatched schema. It's now gated behind auto_remap_columns: bool = False and warns loudly when it fires.

Why

The original multi-part qualification path was untested and buggy (it silently dropped the catalog in the catalog + dotted-schema case and quoted inconsistently). The AST rewrite fixes that class of bug. Making column remap opt-in keeps benchmark runs faithful to the query as written.

Tests

Expanded tests/test_query_utils.py to 16 cases covering the multi-part / quoting / CTE scenarios. Suite green (32 passed).

tomz added 2 commits May 29, 2026 12:32
Add a conservative ruff config (E/F/I/W), a pre-commit hook set, and a CI
lint job (lint + format both enforcing). Reformat the existing tree with
`ruff format` and replace ad-hoc print() diagnostics with module-level
loggers across datagen and timing helpers. Fix obvious nits surfaced by ruff
(import order, bare except -> except Exception, dead assignment).

Drop Python 3.8 support and move pyarrow to base dependencies (the core
results/timing modules import it unconditionally). Gitignore scratch/ for
workspace-specific scratchpads.

W291/W293 stay globally ignored because trailing whitespace inside multi-line
SQL string literals is intentional and not touched by `ruff format`.
Add remote/cloud engines that talk to managed Spark via protocol:

- Livy   — Fabric / Synapse / HDInsight via the Livy REST API, with
           session auto-recovery, per-query timeout, multi-part SHOW TABLES.
- SparkConnect — Spark Connect gRPC client.
- FabricSpark / SynapseSpark / HDISpark — workspace-tagged Spark subclasses.

Catalog plumbing shared by all engines:
- BaseEngine.list_databases() / list_tables() / get_table_columns() defaults,
  overridden for the Spark family, Livy and DuckDB.
- query_timeout_seconds attribute.
- transpile_and_qualify_query() rewritten with AST-based qualification that
  correctly handles 3-/4-part names (workspace.lakehouse.schema, Unity
  catalog.schema): builds quoted identifier chains via sqlglot, preserves the
  caller's catalog, and leaves CTE references untouched. Adds 9 multi-part
  tests (previously untested).

Column auto-remap is opt-in (auto_remap_columns=False) and warns loudly when
active — silently rewriting columns to match non-spec data hurts benchmark
reproducibility and can mask data-prep bugs.

Register Livy as a generic engine for TPC-H / TPC-DS / ClickBench.
@tomz
Copy link
Copy Markdown
Author

tomz commented May 29, 2026

Part 2/5 of the stack in #82, following #83. Diff note: because the stack branches live on my fork, GitHub can only target main, so this PR's diff currently also includes #83's changes. Once #83 merges I'll rebase this onto main and the diff will narrow to just the cloud-engine work. Merge order: #83#84#85#86#87.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant