Implement operation extractor#26
Open
takaebato wants to merge 99 commits into
Open
Conversation
Codecov Report❌ Patch coverage is ❌ Your project check has failed because the head coverage (78.93%) is below the target coverage (90.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## feat-operation-extractor #26 +/- ##
=============================================================
- Coverage 96.96% 78.93% -18.03%
=============================================================
Files 8 23 +15
Lines 428 2108 +1680
=============================================================
+ Hits 415 1664 +1249
- Misses 13 444 +431 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Preserve the prior conventions and add: - Architecture: resolver scope arena feeding into extractor consumers, with catalog scoped to relation-level enrichment. - Vocabulary: TableRole / TableUsage / StatementKind as three distinct axes. - Inline comment policy: default to none; write only when the *why* is non-obvious. - Layered test helper convention (extract / extract_with / extract_with_catalog).
Adds ScopeKind { Body, Predicate } stamped on each pushed scope, and a
flow-extraction pass that walks bindings through the scope arena to
emit per-statement TableFlow edges.
- A scope's kind is read from RelationResolver.pending_scope_kind at
push time. Clause walkers (WHERE, HAVING, JOIN ON, QUALIFY, MERGE
ON, MERGE clause predicates, DELETE/UPDATE selection, CONNECT BY,
AsOf match_condition) wrap their child walks in a with_scope_kind
guard so nested subqueries are stamped Predicate. The guard is a
scoped mem::replace, so no visit_* signatures change.
- RelationResolution gains feeding_read_tables / write_target_tables
helpers. feeding_read_tables filters out Read bindings whose scope
chain contains any Predicate ancestor, so `INSERT INTO t SELECT FROM
s WHERE id IN (SELECT id FROM x)` exposes `s` but not `x` as a flow
source. `x` remains visible via table_operations.
- operation_extractor emits flows only for data-moving statements
(INSERT / UPDATE / MERGE / CREATE TABLE AS / CREATE VIEW); DELETE /
DROP / TRUNCATE / ALTER / bare SELECT produce no flows.
- CTE bodies are Body scopes, so a Read in a CTE body still feeds the
outer write target (`WITH cte AS (SELECT FROM s) INSERT INTO t SELECT
FROM cte` emits `s → t`). Deeper transitivity (recursive CTEs,
multi-hop indirection) is intentionally out of scope.
- TableFlow gains Hash so downstream consumers can dedup via HashSet.
- 15 new tests cover the emit / predicate-block / non-emit matrix.
Restructures StatementTableOperations around three parallel surfaces
(reads, writes, flows) and migrates the legacy CRUD extractor onto the
operation extractor.
- StatementTableOperations now exposes `reads: Vec<TableRead>`,
`writes: Vec<TableWrite>`, and `flows: Vec<TableFlow>` instead of a
single `table_operations` list keyed by a `TableRole` enum. A
multi-role table (e.g. `DELETE t1 FROM t1` — t1 is both deletion
target and row source) appears in both lists.
- `TableOperation`, `TableRole`, and the `primary_role` collapse are
removed from the public API. `TableRole` survives as a `pub(crate)`
resolver-internal enum (binding metadata); the public surface speaks
reads/writes directly.
- `crud_table_extractor` becomes a thin shim over
`TableOperationExtractor::extract_from_statement`, bucketing
`reads` / `writes` into CRUD positions. The only AST inspection that
remains is MERGE clause classification (target placement depends on
which WHEN actions appear); a follow-up may surface those via
`merge_actions` if a second consumer needs them.
- Behavioral diffs from the legacy CRUD impl:
- UPDATE with JOIN — joined tables move from `update_tables` to
`read_tables` (only the head of `update.table` is the actual
write target).
- DELETE FROM t USING ... — the FROM target no longer appears in
`read_tables`. Both diffs match the SQL semantics; legacy quirks
are intentionally dropped.
- Drops dead code uncovered by the migration: `extractor/helper.rs`
(alias resolution / set diff helpers), `TableExtractor::
extract_from_table_node`, `RelationResolver::resolve_table_node`,
`RelationResolution::table_bindings`, and the `TableBinding`
view struct.
- Net: -545 lines (843 deleted, 298 added). 184 tests pass.
`TableReference` previously conflated physical identity
(catalog.schema.name) with use-site decoration (the alias given at
this occurrence). That made `Eq` / `Hash` use-site-dependent — `t1`
and `t1 AS x` hashed differently and would not dedup in a `HashSet`,
which is the wrong behavior for lineage aggregation and
cross-statement comparison.
`TableReference` is now identity-only:
```rust
pub struct TableReference {
pub catalog: Option<Ident>,
pub schema: Option<Ident>,
pub name: Ident,
}
```
- Resolver bindings carry alias as a separate `Option<Ident>` field
on `RelationBinding::Table`, used for name resolution but not
surfaced via the public API.
- `bind_base_table` takes alias as a separate parameter; new helpers
`TableReference::from_insert_with_alias` and `from_table_factor_with_alias`
return the identity / alias pair when constructing from sqlparser AST.
- `lookup_table_schema` no longer needs to alias-strip before the
catalog lookup — `TableReference` is already a valid catalog key.
- `Display` drops the `AS <alias>` suffix (e.g. `c1.s1.t1, t2`
instead of `c1.s1.t1 AS a1, t2`).
- `extract_tables` / `extract_crud_tables` / `extract_table_operations`
outputs no longer carry alias info. Test expectations simplified.
Net: -95 lines (200 deleted, 105 added). 184 tests pass.
`UnsupportedTableFactor`, `AmbiguousColumn`, `CatalogRequired`, and `DynamicSql` were aspirational placeholders; nothing currently emits them. The enum is `#[non_exhaustive]`, so they can be re-added later without a breaking change for downstream matchers.
Earlier wording collapsed both into a single "default to none" rule, which conflicted with the Rust convention of writing rustdoc on public items. Split into two: - `///` / `//!` rustdoc on public items is encouraged — it's the published API surface (cargo doc / docs.rs / IDE hovers). - Inline `//` comments should be concise and well-structured; an example is welcome when it clarifies. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stub-only first cut of `extract_column_operations` — establishes the public API shape so consumers can start integrating while later phases fill in the actual column tracking. - New `column_operation_extractor` module with `StatementColumnOperations` mirroring `StatementTableOperations` (`reads` / `writes` / `flows` + `diagnostics`). - `ColumnReference` is identity-only (`table: Option<TableReference>`, `name`), mirroring the table-level identity-vs-use-site split. - `ColumnTarget` enum distinguishes a `Persisted` column (INSERT / UPDATE / MERGE / CTAS / CREATE VIEW target) from a `QueryOutput` (transient projection result) so anonymous outputs from computed expressions can be identified by position. - `ColumnFlow` carries `source` / `target` / `kind`. MVP `kind` variants are `Passthrough` and `Computed`; the full predicate- influence set (`Filter`, `Join`, `GroupBy`, …) lands as the classification tightens in later phases. - Extractor currently returns empty `reads` / `writes` / `flows` for every statement and emits an `UnsupportedStatement` diagnostic for statement kinds outside the operation-extraction scope. - `classify_statement` is now `pub(super)` so both extractors share it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends RelationResolver to collect raw column references during its single AST walk, and uses that output in the column extractor for qualified reads. INSERT explicit column lists and UPDATE SET targets become writes scoped to the INSERT/UPDATE target. - RelationResolution gains `column_refs: Vec<RawColumnRef>` — identifier parts plus the scope_id where they appeared. Scope-chain resolution (for unqualified refs) is deferred. - Column extractor filters resolver output down to qualified refs, parses parts to TableReference + name, and emits ColumnRead. Writes come from statement-specific AST inspection (INSERT.columns, UPDATE.assignments) and stay scoped to the persistent target table. - flows stays empty; flow construction lands once reads/writes are rich enough. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds RelationResolution::resolve_unqualified_column with standard SQL inner-shadows-outer semantics: walks innermost-first and stops at the first scope with any candidate. Returns the owning table when exactly one binding could carry the column (real Table or synthesized TableReference for Cte / DerivedTable / TableFunction); 0 or 2+ candidates leave the column with table: None. "Could carry" is a single-rule filter: Known schemas must list the column, Unknown schemas always qualify. The two cases give the strictness gradient the catalog promises — without a catalog Table schemas stay Unknown and single-table scopes resolve unconditionally (best-effort, matches catalog: None's implicit promise); with a catalog Table schemas come back Known and false positives like a typo'd `count` are filtered out. Column extractor's collect_reads now routes parts.len() == 1 through the resolver. Unresolved refs surface as ColumnRead with table: None so the column name stays visible. Tests cover the unqualified resolution surface (single binding, multi-binding ambiguity, CTE/derived synthesized table, alias binding, inner shadowing) plus a small `catalog_strict` module that demonstrates the catalog-driven false-positive elimination end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Inverts the projection-edge construction direction. visit_select
collects per-projection-item facts (source_refs, name, bare flag)
into a ProjectionGroup and pushes it to a per-query buffer; the
ResolvedQuery returned by resolve_query owns the collected groups.
Callers decide what edges to emit:
- visit_insert pairs each group's items positionally with the
INSERT target columns and emits Persisted edges. UNION sources
surface as multiple groups, so every branch pairs against the
same target columns — INSERT INTO t (a, b) SELECT x, y FROM s1
UNION ALL SELECT p, q FROM s2 now correctly emits four edges.
- resolve_query_emitting_query_output (a thin wrapper) handles
the default case: top-level Statement::Query, CTE bodies,
derived tables, scalar / predicate / function subqueries,
pipe-operator queries, CTAS / CREATE VIEW / ALTER VIEW source
queries. Each emits QueryOutput edges for its projections.
- SetExpr::Query bubbles the inner Query's projections into the
enclosing buffer via extend_projections so a parenthesized
INSERT source still pairs correctly.
The InsertTargetOverride state machine is gone — no more
install / take / restore dance, no anchor checks. visit_select no
longer touches flow_edges at all; only visit_insert, visit_update,
and emit_query_output_edges produce them.
with_branch_scope wraps each branch of SetExpr::SetOperation in
its own scope so name resolution doesn't see sibling branches'
FROM bindings — matching SQL's per-SELECT name resolution. A
single SELECT continues to bind to the enclosing query scope, so
existing table-extraction ordering is preserved.
Adds a regression test for UNION INSERT positional pairing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Statements that reference a CTE or derived table now surface
end-to-end column flows — references substitute through the
intermediate's body projections recursively until the chain bottoms
out at a base table. Reads apply the same lens: references whose
owning binding is synthetic (Cte / DerivedTable / TableFunction) are
dropped, leaving only references to real storage.
- Cte and DerivedTable bindings carry `body_projections` captured
from the body's `ResolvedQuery`. CTE / derived bodies now call raw
`resolve_query` (not the QueryOutput-emitting wrapper), so no
intermediate edges land in `flow_edges`; the body's projections
are stored on the binding for composition to consume.
- `RawColumnRef` gains walk-time `resolved` (owning table) and
`synthetic` (binding kind) fields. Resolution runs while scope
state is still authoritative, which matters for multi-CTE chains
where later bindings would otherwise ambify earlier inner refs.
- Two resolver post-passes on `into_relation_resolution`:
- `composed_flow_edges` substitutes synthetic-owned sources via
`body_projections`, AND'ing the outer edge's `bare` flag with
the body item's so `passthrough through computed` becomes
`computed`. Bounded by `MAX_COMPOSITION_DEPTH` as a cycle guard.
- `real_column_refs` filters `column_refs` by `raw.synthetic`,
dropping references that point at intermediates.
- The extractor's `resolve_raw_ref` collapses to a 1:1 mapping of
`(raw.resolved, raw.parts.last())` — no scope-chain walk at
extract time.
- `FROM cte AS c` re-binds the alias-bound Cte with the original
CTE's `body_projections` (via `cte_body_projections`) so
composition reaches through the alias too.
- Recursive CTEs keep their empty `body_projections` (fixpoint
capture deferred); composition falls back to leaving the ref
pointing at the recursive binding, which the reads filter then
drops.
Tests added: passthrough composition, computed-kind propagation,
INSERT end-to-end, CTE chain (qualified outer), repeated CTE
reference, derived-table composition, recursive-CTE no-crash, plus
two flipped expectations confirming synthetic refs no longer surface
in reads.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Each ColumnRead now carries `kinds: Vec<ReadKind>` recording the syntactic clause(s) the reference appeared in. First two variants: - Projection — SELECT list, UPDATE SET RHS, INSERT VALUES expr, INSERT source SELECT projection, scalar subquery's projection. - Filter — WHERE, HAVING, QUALIFY, JOIN ON, AsOf match condition, MERGE ON, CONNECT BY / START WITH, pipe-operator `|> WHERE`. ReadKind is `#[non_exhaustive]`; future variants (GroupBy / Sort / Window) land in later 5.6 sub-phases. `kinds` is `Vec` to make room for USING / NATURAL JOIN merged columns (one ref → multiple roles) without an API break. Mechanics: the resolver gains `pending_read_kind: ReadKind` (default Projection), set by clause-walking helpers and consumed by `record_column_ref`. `with_read_kind(kind, f)` saves / restores the field; `with_filter_clause(f)` is the convenience that combines `Filter` read-kind with the existing `Predicate` scope-kind — every former `with_scope_kind(Predicate, ...)` callsite (WHERE / HAVING / QUALIFY / JOIN ON / AsOf / MERGE ON / CONNECT BY / pipe `|> WHERE`) becomes `with_filter_clause(...)`. `resolve_query` saves / restores `pending_read_kind` (reset to Projection on entry), so a predicate subquery inherits Filter at its boundary but its own body's projection refs stay Projection. `RawColumnRef` mirrors `kinds`, captured at walk time. The extractor's `collect_reads` copies `raw.kinds` into `ColumnRead`. Tests: existing tests with WHERE / JOIN ON updated to expect `filter_read(...)` for those refs. New tests cover same column in projection + WHERE (two entries, two kinds), subquery WHERE refs stay Filter without leaking outer Projection, and MERGE ON refs classify as Filter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two small cleanups, no behavior change: - Rename `pending_scope_kind` / `pending_read_kind` to `current_scope_kind` / `current_read_kind`. The fields are context-level defaults that get stamped onto *every* subsequent scope push / column-ref record while in effect, not single-use queued values — `current_*` reflects that better, and matches the existing `current_projections` field's naming. - Drop `mem::replace` for Copy-field save/restore. The original `let prev = mem::replace(&mut self.field, new); ... self.field = prev;` pattern is the textbook move for owning fields; for Copy fields it produces identical code as `let prev = self.field; self.field = new;` and reads cleaner. The remaining `mem::take` / `mem::replace` calls in `resolve_query` operate on `current_projections: Vec<…>` where moving out of the field genuinely requires it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends ReadKind with `GroupBy` and `Sort`. Walking sites that now stamp these kinds: - visit_select: GROUP BY clause body (incl. ROLLUP / CUBE / GROUPING SETS modifiers) → GroupBy. - visit_select: CLUSTER BY / DISTRIBUTE BY (Hive / Spark partitioning directives) → GroupBy. They decide row clustering across shuffle, conceptually closer to GROUP BY than to value flow. - visit_select: SORT BY clause → Sort. - resolve_query: top-level ORDER BY → Sort. - visit_pipe_operator: |> ORDER BY → Sort. - visit_pipe_operator: |> AGGREGATE's group_by_expr → GroupBy (the aggregate args stay Projection since they're value-producing). Tests cover bare GROUP BY, ORDER BY, GROUP BY + HAVING combo (each clause carries its own kind), ROLLUP modifier, and a subquery in GROUP BY (whose own projection refs reset to Projection — the resolve_query boundary stops kind inheritance, consistent with the filter-subquery handling in 5.6a). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends ReadKind with Window for refs inside an OVER (...) spec — PARTITION BY, the window's ORDER BY, and frame bound expressions. The aggregate's own argument (the `x` in `SUM(x) OVER (...)`) stays Projection since it's value-producing. `visit_window_spec` wraps its body in `with_read_kind(Window, ...)`, so both inline `OVER (...)` clauses on aggregate calls and named windows (`WINDOW w AS (...)`) get classified through the same path. Tests cover PARTITION BY only, ORDER BY only, and the combined form, plus implicit verification that the aggregate arg keeps Projection kind. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends ColumnFlowKind with `Aggregation`, distinguishing aggregate function projections (`SUM(a)`, `COUNT(DISTINCT b)`, `PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x)`, ...) from generic expression projections. Walk-time `bare: bool` is replaced with a 3-state `kind: ColumnFlowKind` on `ProjectionItem` and `FlowEdge`, so the classification travels through composition uniformly. Detection uses two complementary signals so the maintenance burden of a hard-coded name list is bounded: 1. SQL-spec structural markers, attached only to aggregates per the standard — `FILTER (WHERE ...)` clause, `WITHIN GROUP (...)` clause, and `DISTINCT` inside the function arg list. These catch dialect-specific aggregates not in our name list and never misfire on scalar functions. 2. Name match (case-insensitive, last name segment) against a union list of common SQL aggregates across ANSI / Postgres / MySQL / BigQuery / Snowflake / Redshift. Covers the bare form `SUM(x)` etc. that carries no structural marker. Window-only functions (ROW_NUMBER / RANK / LAG / LEAD / NTILE / FIRST_VALUE / LAST_VALUE) are intentionally excluded — they meaningfully aggregate only within a window. The top-level expression check fires only for a bare aggregate call; `SUM(a) + 1` is `BinaryOp` at the top, so it stays Computed. Composition (`compose_flow_kinds`): Aggregation dominates either side; Passthrough survives only when both sides are Passthrough; otherwise Computed. A CTE that aggregates surfaces as Aggregation even when the outer projection forwards or further computes the result. UPDATE SET RHS uses the same classifier; UPDATE rarely allows top-level aggregates, but the kind field is now consistent across all flow-emission paths. Tests cover bare aggregate, aliased aggregate, aggregate wrapped in BinaryOp (Computed fallback), INSERT-SELECT propagation, CTE composition (inner-Aggregation + outer-Passthrough / outer-Computed → both Aggregation), and the two structural-marker paths (DISTINCT args, FILTER clause). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Postgres ON CONFLICT and MySQL ON DUPLICATE KEY UPDATE both live in `Insert.on: Option<OnInsert>`, which visit_insert previously didn't touch. So conflict-action SET targets never surfaced as writes and EXCLUDED.<col> refs were silently dropped from flows. Wire it up: - New `visit_insert_on` dispatches by `OnInsert` variant and reuses the existing `emit_assignment_flows` helper that UPDATE / MERGE WHEN MATCHED already use, so the per-assignment semantics stay identical across all three sites. - Postgres: bind `EXCLUDED` as a synthetic derived-table with the INSERT target's effective columns as its schema. Refs through it filter out of `reads` (synthetic-binding filter) but still emit Persisted flow edges into the target. EXCLUDED's source role surfaces in flow sources as Some(EXCLUDED) — composition into the source projection is deferred. - MySQL: no EXCLUDED binding. `VALUES(<col>)` is parsed as a regular function call, so the inner ref resolves to the INSERT target naturally; binding EXCLUDED would make those refs ambiguous. - DoUpdate's optional WHERE walks via `with_filter_clause` so refs inside it get `ReadKind::Filter`. - `collect_writes` extended: SET targets in DoUpdate or DuplicateKeyUpdate add ColumnWrites on the INSERT target table. Existing pinned-down on_conflict tests rewritten to reflect the new behavior; one new test for the DO UPDATE WHERE path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`emit_persisted_to_created`'s inferred-name fallback used the current ProjectionGroup's item names, so each UNION branch contributed its own name. Result: `CREATE TABLE dst AS SELECT a FROM t1 UNION SELECT b FROM t2` produced two writes (dst.a and dst.b) and two flows into different targets — diverging from the SQL standard, which says the result schema follows the left branch only. Pre-compute the left branch's item names from the first group and pair every branch's items against those same names by position. Same shape as INSERT-SELECT-UNION's positional pairing. Behavior for non-UNION sources is unchanged (the first group is the only group). Existing pinned-down test for the anomaly is rewritten to the correct (post-fix) shape: `writes = [dst.a]`, both branches' source flows feed `Persisted(dst.a)`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Build EXCLUDED's synthetic binding with body_projections derived from
the INSERT source's projections — same shape CTEs use for flow
composition. Each source ProjectionGroup is cloned and its items
renamed positionally to the INSERT target column names, so
substitute_source's name-match lookup finds them when an
`EXCLUDED.<col>` ref appears in DO UPDATE SET.
Net effect: `EXCLUDED.<col>` flow sources compose all the way to the
source SELECT's base table refs instead of stopping at the synthetic
EXCLUDED binding. For example
INSERT INTO t (a, b) SELECT x, y FROM s
ON CONFLICT (a) DO UPDATE SET b = EXCLUDED.b
previously emitted `EXCLUDED.b → Persisted(t.b)`; now emits
`s.y → Persisted(t.b)` — the same base-table source as the
INSERT-pairing flow.
Behavior preserved for INSERT VALUES (no source projections →
empty body_projections → composition bottoms out at EXCLUDED). Two
new tests cover the composition: UNION source (EXCLUDED fans out
to each branch's position-N item) and aggregate source (composed
flow kind stays Aggregation via the dominant rule).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Postgres / Sqlite RETURNING projects from the target table's affected rows, but the resolver previously didn't walk it — refs in RETURNING were silently dropped. New `visit_returning` helper builds one `ProjectionGroup` from the items (reusing `build_projection_item`) and emits a `QueryOutput` flow per item, same shape as a top-level SELECT projection. Called from `visit_insert` / `visit_update` / `visit_delete` after their main walk completes. For INSERT, RETURNING walks BEFORE the on-clause so any EXCLUDED binding isn't yet in scope — RETURNING projects from the target table, never from the would-be-inserted pseudo-row, and an in-scope EXCLUDED would ambify unqualified refs that collide with INSERT column names. Seven tests covering: INSERT VALUES + RETURNING basic / aliased / computed / wildcard-suppressed; UPDATE / DELETE with RETURNING; INSERT SELECT with RETURNING (source scope already popped by then). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The resolver already handles `Expr::GroupingSets / Cube / Rollup` uniformly inside a `with_read_kind(GroupBy)` wrapper, so refs in either modifier carry `ReadKind::GroupBy` automatically. Existing test coverage was ROLLUP only; add three more to lock in the same behavior for CUBE, GROUPING SETS (including the empty-set member), and a mixed GROUP BY with plain exprs alongside ROLLUP. GROUPING SETS surfaces the same column more than once in `reads` when it appears in multiple sets — that's faithful to walk order and reflects the SQL meaning that each set is its own grouping. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`collect_writes` previously left ALTER TABLE empty on the column-op
surface — only the table itself appeared in table-level writes.
Inspect the operation list and emit a ColumnWrite per affected
column for the six column-naming variants:
- AddColumn → write(target, column_def.name)
- DropColumn → write(target, name) for each
- RenameColumn → BOTH old and new (rename moves data; both names
useful for column-history consumers)
- ChangeColumn → old + new (same as rename), or just one when
the names are unchanged
- ModifyColumn → write(target, col_name)
- AlterColumn → write(target, column_name)
Schema-level operations (constraints, partitions, RENAME TABLE)
contribute no column writes — they still surface as a table-level
write target via the existing resolver bind.
`AlterTableOperation` is `#[non_exhaustive]`, so the wildcard arm
treats new variants as "no column contribution" until explicitly
modeled. Six tests cover the supported variants plus the
constraint case.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three tests covering the catalog/schema/name qualifier path: - `c1.s1.t1.col` resolves to a TableReference with all three qualifier fields populated. - A bare column against a 3-part-qualified table inherits the full qualifier in the resolved ColumnReference. - 5+ part refs (e.g. `extra.c.s.t.col`) hit the qualifier decoder's 3-part cap and surface as table: None — struct-field access on a fully qualified column isn't modeled. No resolver changes — the existing TableReference::try_from_name already accepted up to 3 qualifier parts. Tests pin the behavior down so a future qualifier-decoder change (4-part / catalog-vs-db distinction / etc.) produces a clear diff. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
VALUES can stand in for a row-source in three non-INSERT positions: SELECT FROM (VALUES …) AS t(x, y) (derived), WITH cte(x, y) AS (VALUES …) (CTE body), and rows containing column refs that resolve via scope-chain walk-up. Pinned-down behavior: VALUES contributes no ProjectionItems (its literal-only rows have no source refs to capture), so flow sources bottom out at the synthetic binding name (`t.x`, `cte.id`, `v.x`) with no further composition. Reads stay empty because the synthetic-binding filter drops refs through derived / CTE bindings. A column ref inside a VALUES row still gets walked, and the resolver's permissive scope-chain rule lets it pick up outer FROM tables — useful for lineage from this rare-but-valid construct. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
WITH cte AS (...) INSERT/UPDATE/DELETE/MERGE parses as a top-level
Statement::Query wrapping a SetExpr::{Insert|Update|Delete|Merge}.
Three fixes make the column / table operation surfaces correct for
this shape:
- classify_statement and collect_writes unwrap the SetExpr DML
wrapper and reclassify against the inner statement, so the
StatementKind is the verb the user wrote (not Select) and writes
follow the inner DML.
- A `FROM cte` reference now re-binds the CTE under its use-site
name in the current scope (carrying schema + body_projections),
not just for the aliased `FROM cte AS c` case. This gives
unqualified refs a single in-scope candidate instead of walking
up and ambifying against an outer-bound DML target, while keeping
catalog-aware strictness (a Known schema still rejects unknown
columns) and flow composition through the CTE body.
- visit_set_expr runs the wrapped DML in its own branch scope so the
DML target binding doesn't share the enclosing query's scope with
the CTEs — `DELETE FROM t WHERE id IN (SELECT id FROM cte)` now
resolves the predicate `id` unambiguously to t.
New `cte_schema` accessor mirrors `cte_body_projections`.
Also adds two window-frame tests (literal `ROWS BETWEEN 3 PRECEDING
AND CURRENT ROW` and `UNBOUNDED PRECEDING/FOLLOWING`) confirming
frame bounds with no column refs add nothing while staying inside
the Window read-kind wrapper.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Four test-only additions pinning down standard-SQL shapes the resolver already handles: - Scalar subquery in projection: `SELECT a, (SELECT max(x) FROM s) AS m FROM t` emits the subquery's own QueryOutput edge (Aggregation from max) plus the outer projection's pairing — the latter is Computed, not Aggregation, because from the outer scope the item is a subquery expression and kind composition only merges through CTE / derived bindings, not a projection-level scalar subquery. - Simple CASE with a column WHEN pattern (`CASE x WHEN y THEN a ELSE b END`): both operand `x` and pattern `y` carry Conditional; results stay plain Projection. - Set-op trailing ORDER BY: the column resolves to None (Sort kind) because the order-by runs in the outer query scope after both branch scopes pop — it references a UNION output, not a base table. Trailing LIMIT literal adds nothing. - IS NULL / IS NOT NULL predicates: the column is a Filter read. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Collapse the column-operation surfaces to the minimal distinctions
that are standalone-actionable, recovering everything else
structurally instead of stamping it during the walk.
- ColumnFlowKind drops to {Passthrough, Transformation}. The old
Aggregation / Computed split was lossy at the edges (window
aggregates, value-preserving STRING_AGG) and unnecessary for the
dependency / impact-analysis core. Composition keeps the chain a
Transformation whenever any step transforms; the helper now reads
"both Passthrough -> Passthrough, else Transformation".
- reads / writes become plain `Vec<ColumnReference>` (occurrence
based, duplicates kept — a column read twice appears twice). The
ReadKind / ColumnRead / ColumnWrite types are deleted: the
syntactic clause tag was never standalone-actionable and several
shapes (PIVOT, function args) had no honest classification. The
value-vs-filter distinction is now structural — a value
contributor is a `flows` source, a filter-only column is in
`reads` but not `flows`.
- Nested subqueries (scalar / EXISTS / IN / derived / PIVOT) resolve
raw via resolve_query, so no intermediate QueryOutput edge survives
for them; only CTE / derived bindings compose end-to-end through
stored body projections.
- VisitContext shrinks to `scope_kind` only (the structural bit that
gates table-flow exclusion); read_kind / in_case_condition and the
with_read_kind / with_case_condition helpers are gone. The
aggregate-name / structural-marker classification machinery in
projection.rs is removed with them — Transformation no longer
sub-classifies.
Docs (crate-level, README, example) and the whole-value test suite
are rewritten to the new model; test helpers reduce to read / write /
unresolved / col returning ColumnReference and flow_passthrough /
flow_transformation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TableRead / TableWrite were single-field newtypes wrapping a TableReference, reserved for "future positional / usage enrichment (FROM vs Predicate vs Join)". That distinction is the value-vs-filter one the model now recovers structurally — a table is a `flows` source if it feeds the write target, a predicate-only table is in `reads` but not `flows` — so the wrapper guards a slot that will not be filled. They also weren't `#[non_exhaustive]`, so they bought no SemVer headroom for adding fields later. Drop both structs; `reads` / `writes` become `Vec<TableReference>`, mirroring the column surfaces (`Vec<ColumnReference>`). Construction collapses to `resolution.read_tables()` / `write_tables()` directly, and the crud extractor stops unwrapping `.table` off each entry. Tests fold the `read` / `write` helper aliases into the existing `table(...)` builder. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adopt data-lineage terminology for the value-movement surface and shed the redundant `Statement` prefix on the per-statement result types, aligning the public API with the project's dependency / lineage framing. - `flows` surface → `lineage`. The directed `source → target` edges are column / table data lineage; "lineage" is the standard, discoverable term. Edge types follow: `ColumnFlow` → `ColumnLineageEdge`, `TableFlow` → `TableLineageEdge` (an edge is a lineage *edge*, not a lineage); `ColumnFlowKind` → `ColumnLineageKind`. - `StatementTableOperations` → `TableOperation`, `StatementColumnOperations` → `ColumnOperation`. The `Statement` prefix added nothing — every extractor returns `Vec<Result<X>>` (one element per statement), so per-statement-ness is already in the return shape. Singular matches the per-statement result unit (cf. the existing `TableExtraction`). The container stays `*Operation`, not `*Lineage`: it also carries `reads` / `writes` / `statement_kind` and is non-empty for statements with no lineage at all (DROP / TRUNCATE / DELETE). - Internal emit / extract helpers and tests rename in step (`extract_table_flows` → `extract_table_lineage`, etc.). The resolver-internal `FlowEdge` / `FlowTargetSpec` keep their names — they are not part of the public surface. - `ColumnTarget` / `Persisted` / `QueryOutput` and the `extract_*_operations` function names are intentionally unchanged. Docs (crate-level, README, module headers, the WildcardSuppressed diagnostic message) move to the lineage wording. CLAUDE.md is also brought up to date with the prior column-simplification work it had fallen behind on (ReadKind / ColumnRead / TableRead / Aggregation / Computed references, the dropped VisitContext fields, and the removed aggregate-classification note). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the single `Diagnostic` / `DiagnosticKind` with per-granularity
types so a result can only represent the conditions its extraction can
actually produce.
- `TableLevelDiagnostic` / `TableLevelDiagnosticKind { UnsupportedStatement }`
for the table-level surfaces; `ColumnLevelDiagnostic` /
`ColumnLevelDiagnosticKind { UnsupportedStatement, WildcardSuppressed,
AmbiguousColumn, UnresolvedColumn }` for `extract_column_operations`.
The split is by type, so a table-level result literally cannot carry a
column-only condition (e.g. a suppressed wildcard, which leaves column
lineage incomplete but doesn't affect the table set).
- The resolver produces the column-level superset; table-level surfaces
project it down through `ColumnLevelDiagnostic::to_table_level`, an
exhaustive match that drops the column-resolution kinds and forces a
decision when a new kind is added. This fixes the prior leak where a
catalog-backed `extract_table_operations` surfaced `AmbiguousColumn` /
`UnresolvedColumn` and every `SELECT *` surfaced `WildcardSuppressed`
at table granularity, where neither is meaningful.
- `ColumnLevelDiagnosticKind` documents the tool-side (coverage) vs
input-side (resolution) split: `UnsupportedStatement` / `Wildcard` are
gaps on our side; `Ambiguous` / `Unresolved` are gaps in the input
(a real engine would also reject them) — not lint, just an annotation
of why a reference was left `table: None`.
- `CrudTables` gains a `diagnostics: Vec<TableLevelDiagnostic>` field,
forwarded from the underlying table-level extraction (it previously
dropped them).
Also drop `#[non_exhaustive]` from `StatementKind` / `ColumnLineageKind`
(the diagnostic enums are introduced without it): while the crate is
pre-1.0, adding a variant should be a visible breaking change so
consumers re-acknowledge the new case instead of silently routing it to
a wildcard arm. Internal matches stay exhaustive. Re-add at the 1.0
freeze (removing it later is non-breaking; adding it is breaking, so the
1.0 boundary is the place).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The crate docs and README explained that ambiguous / unresolved column diagnostics are suppressed without a catalog, but not the consequence: unqualified columns across multiple in-scope tables (`SELECT x FROM a JOIN b`) resolve to `table: None`, so column lineage degrades catalog-free while table-level extraction stays robust. Spell that out, and note the suppression is to avoid flooding the output with noise (every `Unknown` schema could contain anything). Also drop the stale README "Aggregate detection" limitation — flow kinds collapsed to Passthrough / Transformation, so there is no aggregate name-list classification to misfire anymore. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`Persisted` was slightly wrong for views — a view doesn't persist data, only its definition. The model treats a view's columns identically to a table's (a `table`-qualified `ColumnReference`), so every non-`QueryOutput` target is uniformly "a column of a named relation" (base table or view). `Relation` names that accurately and matches the crate's existing relation vocabulary (`RelationSchema`, the `relation` module, `TableReference`); `QueryOutput` continues to carve out the transient top-level-SELECT case. Public-only rename: the resolver-internal `FlowTargetSpec::Persisted` keeps its name (not part of the public surface) and still maps to the public `ColumnTarget::Relation` at extraction time. Docs, the example, and tests follow; the test helper `persisted(...)` becomes `relation(...)`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier the public surface moved to lineage / `Relation` vocabulary but the resolver internals still spoke "flow" / "Persisted", so a reader greps one name and finds the other. Bring the internals in line (all `pub(crate)` / private — no public API change): - `FlowEdge` → `LineageEdge`, `FlowTargetSpec` → `LineageTargetSpec`, and its `Persisted` variant → `Relation` (mirrors the public `ColumnTarget::Relation`; the extractor still maps internal → public at conversion time). - `flow_edges` → `lineage_edges`, `push_flow_edge` → `push_lineage_edge`, `composed_flow_edges` → `composed_lineage_edges`, `compose_flow_kinds` → `compose_lineage_kinds`, `persisted_target_writes` → `relation_target_writes`, `emit_persisted_to_created` → `emit_relation_to_created`. - Module `resolver/flow.rs` → `resolver/lineage.rs` (`mod flow` → `mod lineage`); doc-link `[`flow`]` updated. - Doc / comment prose follows (lineage edges, relation targets); the public crate-doc limitation heading "Flow kind" → "Lineage kind". Also fixes a stale "Aggregation dominates" note left from the flow-kind collapse. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rename the test builders to match the lineage / kind vocabulary: column-level `flow_passthrough` / `flow_transformation` → `passthrough` / `transformation` (the kind is the distinguishing part), and the table-level `flow(src, tgt)` builder → `edge(src, tgt)` (`lineage(...)` would collide with the `lineage:` field name). Test-only; no behavior change. Descriptive test *function* names that use "flow" as plain English (e.g. `predicate_subquery_does_not_feed_flow`, `..._emits_no_flow`) are left as-is — there "flow" reads as the behavior, not a type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Purge "flow" from test function names for vocabulary consistency: noun uses → "lineage" / "lineage edge" (e.g. emits_no_flow → emits_no_lineage, emits_one_flow_per_branch → emits_one_lineage_edge_per_branch); verb uses rephrased (cte_data_flows_through_to_write_target → cte_data_reaches_write_target, scalar_subquery..._flows_only_to_outer → ..._feeds_only_outer, ..._refs_surface_and_flow_as_transformation → ..._refs_surface_and_transform). Also the invariants helper flow_relation_table → edge_relation_table and the test relation_flow_targets... → relation_lineage_targets.... Test-only; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Finish the lineage-vocabulary alignment: rename loop-variable / doc-
example bindings `flow` → `edge` (they iterate `ops.lineage`), and
reword comment prose off "flow" — noun uses → "lineage edge" /
"lineage", verb uses reworded ("flows into" → "feeds into", "data
flows through" → "data moves through", "Composition flows past" →
"Composition reaches past"). Examples and the crate-doc example are
included. The only "flow" left is sqlparser's `ListAggOnOverflow`
(unrelated — over*flow*). No behavior change.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the last stale "flow" wording from the README prose (the code
examples already use the current API): the column feature/usage notes
now say `lineage` edges "carry a kind" rather than a "flow-kind", and
the recursive-CTE limitation says "lineage composition". Fixes a minor
grammar slip ("`lineage` form" → "forms"). No example changes — those
already match the renamed `lineage` field and `ColumnLevelDiagnosticKind`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`ResolvedQuery.scope_id` was never read (only stored), so drop the field — `resolve_query` keeps the `push_query_scope` call for its scope-stack side effect but no longer binds the returned id. Removing it also clears the last genuinely-needed `#[allow(dead_code)]`. The other six allows (on ScopeKind / RelationSchema / Column / Binding / Scope / Resolution) were stale — nothing under them is actually dead — so drop all seven. Now the resolver carries no dead-code suppressions, so a future unused field/variant surfaces immediately instead of hiding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two internal one-field types that no longer earn their keep:
- `VisitContext` had shrunk to a single `scope_kind` field after the
read_kind / in_case_condition removal. Inline it as a `scope_kind`
field on `Resolver` and drop the generic `with_context` save/restore;
`with_filter_clause` now saves/restores `scope_kind` directly and
`with_branch_scope` reads it directly. The useful scoped helpers stay.
- `Column { name: Ident }` was a pure newtype over `Ident`. Use `Ident`
directly in `RelationSchema::Known(Vec<Ident>)` — the resolver only
needs the column's identity, and the richer `ColumnSchema` already
exists on the catalog side for callers that have types.
Net -60 lines, no behavior change.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`ColumnReference` is a fundamental identity type (it even wraps `Option<TableReference>`), not column-extraction-specific — but it lived in `column_operation_extractor.rs` while its sibling `TableReference` sat in `relation.rs`. Move it so both identity-only reference types share the relation-model module (which now lives up to its "relation model types" doc). The lineage-specific types (`ColumnTarget` / `ColumnLineageEdge` / `ColumnLineageKind`) stay with the column extractor. Public paths are unchanged — both are re-exported at the crate root — so `sql_insight::ColumnReference` is unaffected. Also unifies the now-imported `Ident` path in relation.rs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`TableReference` / `ColumnReference` are not relations — a relation is a tuple set, and these aren't even schemas (no attributes, no domains). They are *qualified names* that denote a table / column in a catalog: pure identity. Rename the module `relation` → `reference` so it matches the `*Reference` types it holds and what they actually are; update the module doc accordingly. (`RelationSchema` and `ColumnTarget::Relation` keep "relation" — those genuinely concern a relation's schema / a named relation.) Internal move plus the public module path `sql_insight::relation::*` → `sql_insight::reference::*`; the crate-root re-exports (`sql_insight::TableReference` etc.) are unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`BindingKey` modelled quoting as a separate namespace (`Quoted` vs `Unquoted` variants), so `"id"` and unquoted `id` never matched even though they denote the same column. That conflated quoting with identity. Quoting actually controls only whether the name is case-folded: fold unquoted names (lowercase, PostgreSQL / MySQL convention), keep quoted ones exact, then compare the normalized strings. Now `"id"` == `id` and `"ID"` != `id`, matching SQL semantics. Collapse the enum to a `BindingKey(String)` newtype holding the normalized form. All call sites already went through `from_ident` + equality / map-key, so the change is localized. (Fold *direction* is still a dialect approximation — orthogonal, unchanged.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous comment justified the rule as "PostgreSQL / MySQL convention", which is imprecise (MySQL doesn't fold — it compares case-insensitively, and treats quoting as escaping). Reframe around the actual basis after surveying the supported dialects: - unquoted → lowercase yields case-insensitive matching, the common denominator across every supported dialect except ClickHouse (which is over-matched, soundly); fold direction only matters at the quoted/unquoted edge, and lowercase follows the popular majority. - quoted → exact is the ANSI / PostgreSQL behavior; the MySQL / BigQuery / SQLite family treat quoting as escaping, so this is stricter for quoted names — accepted as rare. Also notes this is one fixed rule (not varied by dialect or table-vs-column) and that faithful per-dialect resolution is deferred. Doc-only; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A qualified ref like `u.a` over `FROM t1 AS u` previously surfaced the alias `u` as its table; it now canonicalizes to the binding's alias-free real table `t1`, matching how unqualified refs already resolve. Synthetic bindings (CTE / derived / table function) keep the qualifier verbatim so lineage composition can re-find them by name, and multi-segment (schema/catalog-qualified) names pass through untouched. Also bundles two related tidy-ups: - Change `ColumnSchema.name` from `Ident` to `String`. A catalog provides column identities and matching is case-insensitive by default, so the public Catalog type no longer needs sqlparser's quote-style / span. - Document on `Catalog::columns` that identifier case-folding is the implementation's responsibility — the resolver passes table names as written and does not normalize them. Adds tests for alias canonicalization and for case / quote matching at the catalog and CTE-composition surfaces. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two clarifications to the public Catalog docs: - The catalog is open-world: a table it returns no columns for means "schema unknown", not "nonexistent", so a misspelled / unknown table name is never flagged — it surfaces as an ordinary read / write with an unknown schema. Column-level strictness only applies where a known schema is in scope. - Identifier case-folding is the implementation's responsibility only for the table lookup itself; the returned column names are matched against SQL column references by the resolver's own fixed rule (unquoted folds to lowercase, quoted is exact), independent of the implementation and the dialect. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Newer stable clippy flags the `if ct.query.is_some()` inside the `Statement::CreateTable` arm (collapsible_match). Move the condition into a match guard. Behaviour is unchanged: a plain `CREATE TABLE` (no query) now falls through to the trailing no-op arm, exactly as the empty `if` did before — only CTAS emits writes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.