feat(gfql): native vectorized Polars engine for hop/chain (PR1 of 3)#1648
Open
lmeyerov wants to merge 3 commits into
Open
feat(gfql): native vectorized Polars engine for hop/chain (PR1 of 3)#1648lmeyerov wants to merge 3 commits into
lmeyerov wants to merge 3 commits into
Conversation
Add Engine.POLARS as an opt-in (engine='polars') native execution lane for the core GFQL traversals hop() and chain(), dispatched at the engine boundary so the production pandas/cuDF internals stay untouched. engine='auto' with polars input still coerces to pandas (no behavior change for existing users). Implementation (graphistry/compute/gfql/engine_polars/): - hop.py: vectorized BFS via semi/anti joins; forward/reverse/undirected, hops/ to_fixed_point, edge_match/source/destination_node_match, target_wave_front, return_as_wave_front seed semantics, endpoint materialization. - chain.py: forward/backward/combine orchestration + node/edge alias names, reusing the polars hop; single-hop edges (directed multi-hop chains supported). - predicates.py: filter_by_dict lowered to polars expressions (operator-identity dispatch), single-column pandas fallback for exotic predicates. Deferred (explicit NotImplementedError): variable-length/multi-hop edges, undirected edges in multi-edge chains, hop labels, node query=. Validated by differential parity vs the pandas engine (hop + chain suites and a randomized fuzzer) and benchmarked (benchmarks/gfql/pandas_vs_polars.py): polars wins at scale (up to ~2.5x on multi-edge chains at millions of edges; crossover ~50-100k rows). No pandas/cuDF regression (additive dispatch only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Correctness: - Native node materialization (ensure_nodes_polars) instead of the pandas-idiom materialize_nodes path (drop_duplicates/reset_index) — fixes crash on edges-only graphs (hop + chain). - Coerce pandas start_nodes to polars in chain_polars (was AttributeError). - Align join-key dtypes (cast endpoints/ids to the node-id dtype) so int/float node-id vs edge-endpoint graphs match pandas instead of raising SchemaError. - _apply_node_names now uses the backward-PRUNED steps for alias participation (was the forward, un-pruned frames) — fixes silently-wrong alias columns on multi-step / reverse / mid-filtered chains. - Guard target_wave_front-without-nodes (mirror pandas ValueError). Perf: reuse an existing edge-id binding in hop_polars (e.g. chain's __gfql_edge_index__) instead of synthesizing a second row index; defer visited_edges concat to a single post-loop unique. mypy: narrow Optional[str] node/source/destination bindings. Tests: parametrize all deferred-param NotImplementedError guards; add a committed randomized fuzzer; cover Between/IsIn/ge/le/eq/ne/contains/startswith/ endswith predicates and the exotic-predicate pandas fallback; add empty-graph, duplicate-edge multiplicity, edges-only, dtype-mismatch, and pandas-start_nodes cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…es-only test - test-polars CI job now emits a coverage artifact on py3.12 and the changed-line-coverage gate combines it, so the native polars engine lines (only exercised under engine='polars') are covered by the gate. - test_polars_chain_edges_only_runs: drop the pandas comparison — pandas itself raises in its concat internals on this degenerate edges-only/no-binding input on newer pandas (passed only on 3.10's older pandas). Assert the polars engine runs and returns the sensible materialized result instead. - Remove unused lazy_polars_import (dead code; engine imports polars directly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First PR in a 3-PR stack adding a native Polars execution engine to GFQL.
This PR covers the core traversals —
hop()andchain().Engine.POLARSadded as opt-in (engine='polars').engine='auto'with polars input still coerces to pandas, so existing users are unaffected.graphistry/compute/gfql/engine_polars/dispatched at thehop()/chain()boundary — the production pandas/cuDF internals are untouched (additive dispatch only).Covered
forward/reverse/undirected single-hop traversal · directed multi-hop chains · node/edge filter dicts + predicates ·
edge_match/source_node_match/destination_node_match·target_wave_front· alias names.Deferred (explicit
NotImplementedError→ useengine='pandas')variable-length/multi-hop edges · undirected edges in multi-edge chains · hop labels · node
query=.Correctness
Differential parity vs the pandas engine is the gate:
test_engine_polars_hop.py(133 cases) +test_engine_polars_chain.py(18 cases) — green on dgx-spark (CPU).test-polarsCI job viabin/test-polars.sh.Performance (
benchmarks/gfql/pandas_vs_polars.py, dgx-spark GB10, CPU)Polars wins at scale vs pandas — crossover ~50–100k rows:
Stack
🤖 Generated with Claude Code