feat(gfql): native vectorized Polars engine for hop/chain (PR1 of 3) by lmeyerov · Pull Request #1648 · graphistry/pygraphistry

lmeyerov · 2026-06-25T03:51:43Z

Summary

First PR in a 3-PR stack adding a native Polars execution engine to GFQL.
This PR covers the core traversals — hop() and chain().

Engine.POLARS added as opt-in (engine='polars'). engine='auto' with polars input still coerces to pandas, so existing users are unaffected.
New, isolated module graphistry/compute/gfql/engine_polars/ dispatched at the hop()/chain() boundary — the production pandas/cuDF internals are untouched (additive dispatch only).
Vectorization-first: BFS advances via semi/anti joins; predicates lower to polars expressions; no per-row Python work.

Covered

forward/reverse/undirected single-hop traversal · directed multi-hop chains · node/edge filter dicts + predicates · edge_match/source_node_match/destination_node_match · target_wave_front · alias names.

Deferred (explicit `NotImplementedError` → use `engine='pandas'`)

variable-length/multi-hop edges · undirected edges in multi-edge chains · hop labels · node query=.

Correctness

Differential parity vs the pandas engine is the gate:

test_engine_polars_hop.py (133 cases) + test_engine_polars_chain.py (18 cases) — green on dgx-spark (CPU).
Randomized fuzzer: 348/348 random (graph, chain) seeds on the supported surface.
Wired into the existing test-polars CI job via bin/test-polars.sh.
No pandas/cuDF regression (verified by interleaved old-vs-new pandas benchmark).

Performance (`benchmarks/gfql/pandas_vs_polars.py`, dgx-spark GB10, CPU)

Polars wins at scale vs pandas — crossover ~50–100k rows:

workload @500k/2.5M	pandas_ms	polars_ms	speedup
chain 2-edge	~1410	567	2.49x
chain n-e-n	~725	444	1.63x
hop1	347	280	1.24x

Stack

PR1 (this) — core traversals (hop/chain) + benchmarking.
PR2 — path/rows + Cypher-only features (stacked).
PR3 — pandas/polars/cuDF out-of-the-box benchmark comparison + optimization pass (stacked).

🤖 Generated with Claude Code

Add Engine.POLARS as an opt-in (engine='polars') native execution lane for the core GFQL traversals hop() and chain(), dispatched at the engine boundary so the production pandas/cuDF internals stay untouched. engine='auto' with polars input still coerces to pandas (no behavior change for existing users). Implementation (graphistry/compute/gfql/engine_polars/): - hop.py: vectorized BFS via semi/anti joins; forward/reverse/undirected, hops/ to_fixed_point, edge_match/source/destination_node_match, target_wave_front, return_as_wave_front seed semantics, endpoint materialization. - chain.py: forward/backward/combine orchestration + node/edge alias names, reusing the polars hop; single-hop edges (directed multi-hop chains supported). - predicates.py: filter_by_dict lowered to polars expressions (operator-identity dispatch), single-column pandas fallback for exotic predicates. Deferred (explicit NotImplementedError): variable-length/multi-hop edges, undirected edges in multi-edge chains, hop labels, node query=. Validated by differential parity vs the pandas engine (hop + chain suites and a randomized fuzzer) and benchmarked (benchmarks/gfql/pandas_vs_polars.py): polars wins at scale (up to ~2.5x on multi-edge chains at millions of edges; crossover ~50-100k rows). No pandas/cuDF regression (additive dispatch only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Correctness: - Native node materialization (ensure_nodes_polars) instead of the pandas-idiom materialize_nodes path (drop_duplicates/reset_index) — fixes crash on edges-only graphs (hop + chain). - Coerce pandas start_nodes to polars in chain_polars (was AttributeError). - Align join-key dtypes (cast endpoints/ids to the node-id dtype) so int/float node-id vs edge-endpoint graphs match pandas instead of raising SchemaError. - _apply_node_names now uses the backward-PRUNED steps for alias participation (was the forward, un-pruned frames) — fixes silently-wrong alias columns on multi-step / reverse / mid-filtered chains. - Guard target_wave_front-without-nodes (mirror pandas ValueError). Perf: reuse an existing edge-id binding in hop_polars (e.g. chain's __gfql_edge_index__) instead of synthesizing a second row index; defer visited_edges concat to a single post-loop unique. mypy: narrow Optional[str] node/source/destination bindings. Tests: parametrize all deferred-param NotImplementedError guards; add a committed randomized fuzzer; cover Between/IsIn/ge/le/eq/ne/contains/startswith/ endswith predicates and the exotic-predicate pandas fallback; add empty-graph, duplicate-edge multiplicity, edges-only, dtype-mismatch, and pandas-start_nodes cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…es-only test - test-polars CI job now emits a coverage artifact on py3.12 and the changed-line-coverage gate combines it, so the native polars engine lines (only exercised under engine='polars') are covered by the gate. - test_polars_chain_edges_only_runs: drop the pandas comparison — pandas itself raises in its concat internals on this degenerate edges-only/no-binding input on newer pandas (passed only on 3.10's older pandas). Assert the polars engine runs and returns the sensible materialized result instead. - Remove unused lazy_polars_import (dead code; engine imports polars directly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lmeyerov and others added 3 commits June 24, 2026 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gfql): native vectorized Polars engine for hop/chain (PR1 of 3)#1648

feat(gfql): native vectorized Polars engine for hop/chain (PR1 of 3)#1648
lmeyerov wants to merge 3 commits into
masterfrom
dev/gfql-polars-engine

lmeyerov commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lmeyerov commented Jun 25, 2026

Summary

Covered

Deferred (explicit NotImplementedError → use engine='pandas')

Correctness

Performance (benchmarks/gfql/pandas_vs_polars.py, dgx-spark GB10, CPU)

Stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Deferred (explicit `NotImplementedError` → use `engine='pandas'`)

Performance (`benchmarks/gfql/pandas_vs_polars.py`, dgx-spark GB10, CPU)