Skip to content

feat(gfql): native vectorized Polars engine for hop/chain (PR1 of 3)#1648

Open
lmeyerov wants to merge 3 commits into
masterfrom
dev/gfql-polars-engine
Open

feat(gfql): native vectorized Polars engine for hop/chain (PR1 of 3)#1648
lmeyerov wants to merge 3 commits into
masterfrom
dev/gfql-polars-engine

Conversation

@lmeyerov

Copy link
Copy Markdown
Contributor

Summary

First PR in a 3-PR stack adding a native Polars execution engine to GFQL.
This PR covers the core traversalshop() and chain().

  • Engine.POLARS added as opt-in (engine='polars'). engine='auto' with polars input still coerces to pandas, so existing users are unaffected.
  • New, isolated module graphistry/compute/gfql/engine_polars/ dispatched at the hop()/chain() boundary — the production pandas/cuDF internals are untouched (additive dispatch only).
  • Vectorization-first: BFS advances via semi/anti joins; predicates lower to polars expressions; no per-row Python work.

Covered

forward/reverse/undirected single-hop traversal · directed multi-hop chains · node/edge filter dicts + predicates · edge_match/source_node_match/destination_node_match · target_wave_front · alias names.

Deferred (explicit NotImplementedError → use engine='pandas')

variable-length/multi-hop edges · undirected edges in multi-edge chains · hop labels · node query=.

Correctness

Differential parity vs the pandas engine is the gate:

  • test_engine_polars_hop.py (133 cases) + test_engine_polars_chain.py (18 cases) — green on dgx-spark (CPU).
  • Randomized fuzzer: 348/348 random (graph, chain) seeds on the supported surface.
  • Wired into the existing test-polars CI job via bin/test-polars.sh.
  • No pandas/cuDF regression (verified by interleaved old-vs-new pandas benchmark).

Performance (benchmarks/gfql/pandas_vs_polars.py, dgx-spark GB10, CPU)

Polars wins at scale vs pandas — crossover ~50–100k rows:

workload @500k/2.5M pandas_ms polars_ms speedup
chain 2-edge ~1410 567 2.49x
chain n-e-n ~725 444 1.63x
hop1 347 280 1.24x

Stack

  • PR1 (this) — core traversals (hop/chain) + benchmarking.
  • PR2 — path/rows + Cypher-only features (stacked).
  • PR3 — pandas/polars/cuDF out-of-the-box benchmark comparison + optimization pass (stacked).

🤖 Generated with Claude Code

lmeyerov and others added 3 commits June 24, 2026 20:51
Add Engine.POLARS as an opt-in (engine='polars') native execution lane for the
core GFQL traversals hop() and chain(), dispatched at the engine boundary so the
production pandas/cuDF internals stay untouched. engine='auto' with polars input
still coerces to pandas (no behavior change for existing users).

Implementation (graphistry/compute/gfql/engine_polars/):
- hop.py: vectorized BFS via semi/anti joins; forward/reverse/undirected, hops/
  to_fixed_point, edge_match/source/destination_node_match, target_wave_front,
  return_as_wave_front seed semantics, endpoint materialization.
- chain.py: forward/backward/combine orchestration + node/edge alias names,
  reusing the polars hop; single-hop edges (directed multi-hop chains supported).
- predicates.py: filter_by_dict lowered to polars expressions (operator-identity
  dispatch), single-column pandas fallback for exotic predicates.

Deferred (explicit NotImplementedError): variable-length/multi-hop edges,
undirected edges in multi-edge chains, hop labels, node query=.

Validated by differential parity vs the pandas engine (hop + chain suites and a
randomized fuzzer) and benchmarked (benchmarks/gfql/pandas_vs_polars.py): polars
wins at scale (up to ~2.5x on multi-edge chains at millions of edges; crossover
~50-100k rows). No pandas/cuDF regression (additive dispatch only).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Correctness:
- Native node materialization (ensure_nodes_polars) instead of the pandas-idiom
  materialize_nodes path (drop_duplicates/reset_index) — fixes crash on
  edges-only graphs (hop + chain).
- Coerce pandas start_nodes to polars in chain_polars (was AttributeError).
- Align join-key dtypes (cast endpoints/ids to the node-id dtype) so int/float
  node-id vs edge-endpoint graphs match pandas instead of raising SchemaError.
- _apply_node_names now uses the backward-PRUNED steps for alias participation
  (was the forward, un-pruned frames) — fixes silently-wrong alias columns on
  multi-step / reverse / mid-filtered chains.
- Guard target_wave_front-without-nodes (mirror pandas ValueError).

Perf: reuse an existing edge-id binding in hop_polars (e.g. chain's
__gfql_edge_index__) instead of synthesizing a second row index; defer
visited_edges concat to a single post-loop unique.

mypy: narrow Optional[str] node/source/destination bindings.

Tests: parametrize all deferred-param NotImplementedError guards; add a
committed randomized fuzzer; cover Between/IsIn/ge/le/eq/ne/contains/startswith/
endswith predicates and the exotic-predicate pandas fallback; add empty-graph,
duplicate-edge multiplicity, edges-only, dtype-mismatch, and pandas-start_nodes
cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…es-only test

- test-polars CI job now emits a coverage artifact on py3.12 and the
  changed-line-coverage gate combines it, so the native polars engine lines
  (only exercised under engine='polars') are covered by the gate.
- test_polars_chain_edges_only_runs: drop the pandas comparison — pandas itself
  raises in its concat internals on this degenerate edges-only/no-binding input
  on newer pandas (passed only on 3.10's older pandas). Assert the polars engine
  runs and returns the sensible materialized result instead.
- Remove unused lazy_polars_import (dead code; engine imports polars directly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant