feat(engines): add cloud Spark engines + multi-part catalog name support#84
Draft
tomz wants to merge 2 commits into
Draft
feat(engines): add cloud Spark engines + multi-part catalog name support#84tomz wants to merge 2 commits into
tomz wants to merge 2 commits into
Conversation
Add a conservative ruff config (E/F/I/W), a pre-commit hook set, and a CI lint job (lint + format both enforcing). Reformat the existing tree with `ruff format` and replace ad-hoc print() diagnostics with module-level loggers across datagen and timing helpers. Fix obvious nits surfaced by ruff (import order, bare except -> except Exception, dead assignment). Drop Python 3.8 support and move pyarrow to base dependencies (the core results/timing modules import it unconditionally). Gitignore scratch/ for workspace-specific scratchpads. W291/W293 stay globally ignored because trailing whitespace inside multi-line SQL string literals is intentional and not touched by `ruff format`.
Add remote/cloud engines that talk to managed Spark via protocol:
- Livy — Fabric / Synapse / HDInsight via the Livy REST API, with
session auto-recovery, per-query timeout, multi-part SHOW TABLES.
- SparkConnect — Spark Connect gRPC client.
- FabricSpark / SynapseSpark / HDISpark — workspace-tagged Spark subclasses.
Catalog plumbing shared by all engines:
- BaseEngine.list_databases() / list_tables() / get_table_columns() defaults,
overridden for the Spark family, Livy and DuckDB.
- query_timeout_seconds attribute.
- transpile_and_qualify_query() rewritten with AST-based qualification that
correctly handles 3-/4-part names (workspace.lakehouse.schema, Unity
catalog.schema): builds quoted identifier chains via sqlglot, preserves the
caller's catalog, and leaves CTE references untouched. Adds 9 multi-part
tests (previously untested).
Column auto-remap is opt-in (auto_remap_columns=False) and warns loudly when
active — silently rewriting columns to match non-spec data hurts benchmark
reproducibility and can mask data-prep bugs.
Register Livy as a generic engine for TPC-H / TPC-DS / ClickBench.
Author
|
Part 2/5 of the stack in #82, following #83. Diff note: because the stack branches live on my fork, GitHub can only target |
This was referenced May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds cloud Spark execution engines (Livy, Spark Connect) and rewrites table-name qualification to handle multi-part (catalog.schema.table) names correctly.
transpile_and_qualify_queryon the sqlglot AST instead of string munging. Correctly handles 3- and 4-part names, the catalog + dotted-schema case, is CTE-safe, and applies Spark-family quoting consistently.get_table_name_from_ddl(was inadvertently removed).auto_remap_columns: bool = Falseand warns loudly when it fires.Why
The original multi-part qualification path was untested and buggy (it silently dropped the catalog in the catalog + dotted-schema case and quoted inconsistently). The AST rewrite fixes that class of bug. Making column remap opt-in keeps benchmark runs faithful to the query as written.
Tests
Expanded
tests/test_query_utils.pyto 16 cases covering the multi-part / quoting / CTE scenarios. Suite green (32 passed).