Skip to content

Implement operation extractor#26

Open
takaebato wants to merge 99 commits into
feat-operation-extractorfrom
scope-resolver
Open

Implement operation extractor#26
takaebato wants to merge 99 commits into
feat-operation-extractorfrom
scope-resolver

Conversation

@takaebato
Copy link
Copy Markdown
Owner

No description provided.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 74.84922% with 417 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.93%. Comparing base (84b673e) to head (c2fc96a).

Files with missing lines Patch % Lines
sql-insight/src/resolver/expr.rs 38.73% 223 Missing ⚠️
sql-insight/src/resolver/table.rs 46.66% 120 Missing ⚠️
sql-insight/src/resolver/statement.rs 89.27% 28 Missing ⚠️
sql-insight/src/resolver/query.rs 87.13% 22 Missing ⚠️
...insight/src/extractor/table_operation_extractor.rs 89.55% 7 Missing ⚠️
sql-insight/src/resolver/binding.rs 97.12% 6 Missing ⚠️
sql-insight/src/resolver/composition.rs 91.22% 5 Missing ⚠️
sql-insight/src/reference.rs 95.23% 2 Missing ⚠️
sql-insight/src/resolver/projection.rs 91.66% 2 Missing ⚠️
sql-insight/src/resolver/column_ref.rs 99.01% 1 Missing ⚠️
... and 1 more

❌ Your project check has failed because the head coverage (78.93%) is below the target coverage (90.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@                      Coverage Diff                      @@
##           feat-operation-extractor      #26       +/-   ##
=============================================================
- Coverage                     96.96%   78.93%   -18.03%     
=============================================================
  Files                             8       23       +15     
  Lines                           428     2108     +1680     
=============================================================
+ Hits                            415     1664     +1249     
- Misses                           13      444      +431     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@takaebato takaebato changed the base branch from master to feat-operation-extractor May 16, 2026 12:42
takaebato and others added 19 commits May 16, 2026 21:59
Preserve the prior conventions and add:
- Architecture: resolver scope arena feeding into extractor consumers,
  with catalog scoped to relation-level enrichment.
- Vocabulary: TableRole / TableUsage / StatementKind as three distinct
  axes.
- Inline comment policy: default to none; write only when the *why* is
  non-obvious.
- Layered test helper convention (extract / extract_with /
  extract_with_catalog).
Adds ScopeKind { Body, Predicate } stamped on each pushed scope, and a
flow-extraction pass that walks bindings through the scope arena to
emit per-statement TableFlow edges.

- A scope's kind is read from RelationResolver.pending_scope_kind at
  push time. Clause walkers (WHERE, HAVING, JOIN ON, QUALIFY, MERGE
  ON, MERGE clause predicates, DELETE/UPDATE selection, CONNECT BY,
  AsOf match_condition) wrap their child walks in a with_scope_kind
  guard so nested subqueries are stamped Predicate. The guard is a
  scoped mem::replace, so no visit_* signatures change.
- RelationResolution gains feeding_read_tables / write_target_tables
  helpers. feeding_read_tables filters out Read bindings whose scope
  chain contains any Predicate ancestor, so `INSERT INTO t SELECT FROM
  s WHERE id IN (SELECT id FROM x)` exposes `s` but not `x` as a flow
  source. `x` remains visible via table_operations.
- operation_extractor emits flows only for data-moving statements
  (INSERT / UPDATE / MERGE / CREATE TABLE AS / CREATE VIEW); DELETE /
  DROP / TRUNCATE / ALTER / bare SELECT produce no flows.
- CTE bodies are Body scopes, so a Read in a CTE body still feeds the
  outer write target (`WITH cte AS (SELECT FROM s) INSERT INTO t SELECT
  FROM cte` emits `s → t`). Deeper transitivity (recursive CTEs,
  multi-hop indirection) is intentionally out of scope.
- TableFlow gains Hash so downstream consumers can dedup via HashSet.
- 15 new tests cover the emit / predicate-block / non-emit matrix.
Restructures StatementTableOperations around three parallel surfaces
(reads, writes, flows) and migrates the legacy CRUD extractor onto the
operation extractor.

- StatementTableOperations now exposes `reads: Vec<TableRead>`,
  `writes: Vec<TableWrite>`, and `flows: Vec<TableFlow>` instead of a
  single `table_operations` list keyed by a `TableRole` enum. A
  multi-role table (e.g. `DELETE t1 FROM t1` — t1 is both deletion
  target and row source) appears in both lists.
- `TableOperation`, `TableRole`, and the `primary_role` collapse are
  removed from the public API. `TableRole` survives as a `pub(crate)`
  resolver-internal enum (binding metadata); the public surface speaks
  reads/writes directly.
- `crud_table_extractor` becomes a thin shim over
  `TableOperationExtractor::extract_from_statement`, bucketing
  `reads` / `writes` into CRUD positions. The only AST inspection that
  remains is MERGE clause classification (target placement depends on
  which WHEN actions appear); a follow-up may surface those via
  `merge_actions` if a second consumer needs them.
- Behavioral diffs from the legacy CRUD impl:
  - UPDATE with JOIN — joined tables move from `update_tables` to
    `read_tables` (only the head of `update.table` is the actual
    write target).
  - DELETE FROM t USING ... — the FROM target no longer appears in
    `read_tables`. Both diffs match the SQL semantics; legacy quirks
    are intentionally dropped.
- Drops dead code uncovered by the migration: `extractor/helper.rs`
  (alias resolution / set diff helpers), `TableExtractor::
  extract_from_table_node`, `RelationResolver::resolve_table_node`,
  `RelationResolution::table_bindings`, and the `TableBinding`
  view struct.
- Net: -545 lines (843 deleted, 298 added). 184 tests pass.
`TableReference` previously conflated physical identity
(catalog.schema.name) with use-site decoration (the alias given at
this occurrence). That made `Eq` / `Hash` use-site-dependent — `t1`
and `t1 AS x` hashed differently and would not dedup in a `HashSet`,
which is the wrong behavior for lineage aggregation and
cross-statement comparison.

`TableReference` is now identity-only:

```rust
pub struct TableReference {
    pub catalog: Option<Ident>,
    pub schema: Option<Ident>,
    pub name: Ident,
}
```

- Resolver bindings carry alias as a separate `Option<Ident>` field
  on `RelationBinding::Table`, used for name resolution but not
  surfaced via the public API.
- `bind_base_table` takes alias as a separate parameter; new helpers
  `TableReference::from_insert_with_alias` and `from_table_factor_with_alias`
  return the identity / alias pair when constructing from sqlparser AST.
- `lookup_table_schema` no longer needs to alias-strip before the
  catalog lookup — `TableReference` is already a valid catalog key.
- `Display` drops the `AS <alias>` suffix (e.g. `c1.s1.t1, t2`
  instead of `c1.s1.t1 AS a1, t2`).
- `extract_tables` / `extract_crud_tables` / `extract_table_operations`
  outputs no longer carry alias info. Test expectations simplified.

Net: -95 lines (200 deleted, 105 added). 184 tests pass.
`UnsupportedTableFactor`, `AmbiguousColumn`, `CatalogRequired`, and
`DynamicSql` were aspirational placeholders; nothing currently emits
them. The enum is `#[non_exhaustive]`, so they can be re-added later
without a breaking change for downstream matchers.
Earlier wording collapsed both into a single "default to none" rule,
which conflicted with the Rust convention of writing rustdoc on public
items. Split into two:

- `///` / `//!` rustdoc on public items is encouraged — it's the
  published API surface (cargo doc / docs.rs / IDE hovers).
- Inline `//` comments should be concise and well-structured; an
  example is welcome when it clarifies.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stub-only first cut of `extract_column_operations` — establishes the
public API shape so consumers can start integrating while later phases
fill in the actual column tracking.

- New `column_operation_extractor` module with
  `StatementColumnOperations` mirroring `StatementTableOperations`
  (`reads` / `writes` / `flows` + `diagnostics`).
- `ColumnReference` is identity-only (`table: Option<TableReference>`,
  `name`), mirroring the table-level identity-vs-use-site split.
- `ColumnTarget` enum distinguishes a `Persisted` column (INSERT /
  UPDATE / MERGE / CTAS / CREATE VIEW target) from a `QueryOutput`
  (transient projection result) so anonymous outputs from computed
  expressions can be identified by position.
- `ColumnFlow` carries `source` / `target` / `kind`. MVP `kind`
  variants are `Passthrough` and `Computed`; the full predicate-
  influence set (`Filter`, `Join`, `GroupBy`, …) lands as the
  classification tightens in later phases.
- Extractor currently returns empty `reads` / `writes` / `flows` for
  every statement and emits an `UnsupportedStatement` diagnostic for
  statement kinds outside the operation-extraction scope.
- `classify_statement` is now `pub(super)` so both extractors share it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends RelationResolver to collect raw column references during its
single AST walk, and uses that output in the column extractor for
qualified reads. INSERT explicit column lists and UPDATE SET targets
become writes scoped to the INSERT/UPDATE target.

- RelationResolution gains `column_refs: Vec<RawColumnRef>` —
  identifier parts plus the scope_id where they appeared. Scope-chain
  resolution (for unqualified refs) is deferred.
- Column extractor filters resolver output down to qualified refs,
  parses parts to TableReference + name, and emits ColumnRead. Writes
  come from statement-specific AST inspection (INSERT.columns,
  UPDATE.assignments) and stay scoped to the persistent target table.
- flows stays empty; flow construction lands once reads/writes are
  rich enough.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds RelationResolution::resolve_unqualified_column with standard SQL
inner-shadows-outer semantics: walks innermost-first and stops at the
first scope with any candidate. Returns the owning table when exactly
one binding could carry the column (real Table or synthesized
TableReference for Cte / DerivedTable / TableFunction); 0 or 2+
candidates leave the column with table: None.

"Could carry" is a single-rule filter: Known schemas must list the
column, Unknown schemas always qualify. The two cases give the
strictness gradient the catalog promises — without a catalog Table
schemas stay Unknown and single-table scopes resolve unconditionally
(best-effort, matches catalog: None's implicit promise); with a
catalog Table schemas come back Known and false positives like a
typo'd `count` are filtered out.

Column extractor's collect_reads now routes parts.len() == 1 through
the resolver. Unresolved refs surface as ColumnRead with table: None
so the column name stays visible.

Tests cover the unqualified resolution surface (single binding,
multi-binding ambiguity, CTE/derived synthesized table, alias
binding, inner shadowing) plus a small `catalog_strict` module that
demonstrates the catalog-driven false-positive elimination end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Inverts the projection-edge construction direction. visit_select
collects per-projection-item facts (source_refs, name, bare flag)
into a ProjectionGroup and pushes it to a per-query buffer; the
ResolvedQuery returned by resolve_query owns the collected groups.
Callers decide what edges to emit:

  - visit_insert pairs each group's items positionally with the
    INSERT target columns and emits Persisted edges. UNION sources
    surface as multiple groups, so every branch pairs against the
    same target columns — INSERT INTO t (a, b) SELECT x, y FROM s1
    UNION ALL SELECT p, q FROM s2 now correctly emits four edges.
  - resolve_query_emitting_query_output (a thin wrapper) handles
    the default case: top-level Statement::Query, CTE bodies,
    derived tables, scalar / predicate / function subqueries,
    pipe-operator queries, CTAS / CREATE VIEW / ALTER VIEW source
    queries. Each emits QueryOutput edges for its projections.
  - SetExpr::Query bubbles the inner Query's projections into the
    enclosing buffer via extend_projections so a parenthesized
    INSERT source still pairs correctly.

The InsertTargetOverride state machine is gone — no more
install / take / restore dance, no anchor checks. visit_select no
longer touches flow_edges at all; only visit_insert, visit_update,
and emit_query_output_edges produce them.

with_branch_scope wraps each branch of SetExpr::SetOperation in
its own scope so name resolution doesn't see sibling branches'
FROM bindings — matching SQL's per-SELECT name resolution. A
single SELECT continues to bind to the enclosing query scope, so
existing table-extraction ordering is preserved.

Adds a regression test for UNION INSERT positional pairing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Statements that reference a CTE or derived table now surface
end-to-end column flows — references substitute through the
intermediate's body projections recursively until the chain bottoms
out at a base table. Reads apply the same lens: references whose
owning binding is synthetic (Cte / DerivedTable / TableFunction) are
dropped, leaving only references to real storage.

- Cte and DerivedTable bindings carry `body_projections` captured
  from the body's `ResolvedQuery`. CTE / derived bodies now call raw
  `resolve_query` (not the QueryOutput-emitting wrapper), so no
  intermediate edges land in `flow_edges`; the body's projections
  are stored on the binding for composition to consume.
- `RawColumnRef` gains walk-time `resolved` (owning table) and
  `synthetic` (binding kind) fields. Resolution runs while scope
  state is still authoritative, which matters for multi-CTE chains
  where later bindings would otherwise ambify earlier inner refs.
- Two resolver post-passes on `into_relation_resolution`:
  - `composed_flow_edges` substitutes synthetic-owned sources via
    `body_projections`, AND'ing the outer edge's `bare` flag with
    the body item's so `passthrough through computed` becomes
    `computed`. Bounded by `MAX_COMPOSITION_DEPTH` as a cycle guard.
  - `real_column_refs` filters `column_refs` by `raw.synthetic`,
    dropping references that point at intermediates.
- The extractor's `resolve_raw_ref` collapses to a 1:1 mapping of
  `(raw.resolved, raw.parts.last())` — no scope-chain walk at
  extract time.
- `FROM cte AS c` re-binds the alias-bound Cte with the original
  CTE's `body_projections` (via `cte_body_projections`) so
  composition reaches through the alias too.
- Recursive CTEs keep their empty `body_projections` (fixpoint
  capture deferred); composition falls back to leaving the ref
  pointing at the recursive binding, which the reads filter then
  drops.

Tests added: passthrough composition, computed-kind propagation,
INSERT end-to-end, CTE chain (qualified outer), repeated CTE
reference, derived-table composition, recursive-CTE no-crash, plus
two flipped expectations confirming synthetic refs no longer surface
in reads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Each ColumnRead now carries `kinds: Vec<ReadKind>` recording the
syntactic clause(s) the reference appeared in. First two variants:

- Projection — SELECT list, UPDATE SET RHS, INSERT VALUES expr,
  INSERT source SELECT projection, scalar subquery's projection.
- Filter — WHERE, HAVING, QUALIFY, JOIN ON, AsOf match condition,
  MERGE ON, CONNECT BY / START WITH, pipe-operator `|> WHERE`.

ReadKind is `#[non_exhaustive]`; future variants (GroupBy / Sort /
Window) land in later 5.6 sub-phases. `kinds` is `Vec` to make
room for USING / NATURAL JOIN merged columns (one ref → multiple
roles) without an API break.

Mechanics: the resolver gains `pending_read_kind: ReadKind`
(default Projection), set by clause-walking helpers and consumed by
`record_column_ref`. `with_read_kind(kind, f)` saves / restores the
field; `with_filter_clause(f)` is the convenience that combines
`Filter` read-kind with the existing `Predicate` scope-kind — every
former `with_scope_kind(Predicate, ...)` callsite (WHERE / HAVING /
QUALIFY / JOIN ON / AsOf / MERGE ON / CONNECT BY / pipe `|> WHERE`)
becomes `with_filter_clause(...)`.

`resolve_query` saves / restores `pending_read_kind` (reset to
Projection on entry), so a predicate subquery inherits Filter at
its boundary but its own body's projection refs stay Projection.

`RawColumnRef` mirrors `kinds`, captured at walk time. The
extractor's `collect_reads` copies `raw.kinds` into `ColumnRead`.

Tests: existing tests with WHERE / JOIN ON updated to expect
`filter_read(...)` for those refs. New tests cover same column in
projection + WHERE (two entries, two kinds), subquery WHERE refs
stay Filter without leaking outer Projection, and MERGE ON refs
classify as Filter.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two small cleanups, no behavior change:

- Rename `pending_scope_kind` / `pending_read_kind` to
  `current_scope_kind` / `current_read_kind`. The fields are
  context-level defaults that get stamped onto *every* subsequent
  scope push / column-ref record while in effect, not single-use
  queued values — `current_*` reflects that better, and matches the
  existing `current_projections` field's naming.

- Drop `mem::replace` for Copy-field save/restore. The original
  `let prev = mem::replace(&mut self.field, new); ... self.field = prev;`
  pattern is the textbook move for owning fields; for Copy fields
  it produces identical code as `let prev = self.field; self.field = new;`
  and reads cleaner. The remaining `mem::take` / `mem::replace` calls
  in `resolve_query` operate on `current_projections: Vec<…>` where
  moving out of the field genuinely requires it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends ReadKind with `GroupBy` and `Sort`. Walking sites that now
stamp these kinds:

- visit_select: GROUP BY clause body (incl. ROLLUP / CUBE / GROUPING
  SETS modifiers) → GroupBy.
- visit_select: CLUSTER BY / DISTRIBUTE BY (Hive / Spark partitioning
  directives) → GroupBy. They decide row clustering across shuffle,
  conceptually closer to GROUP BY than to value flow.
- visit_select: SORT BY clause → Sort.
- resolve_query: top-level ORDER BY → Sort.
- visit_pipe_operator: |> ORDER BY → Sort.
- visit_pipe_operator: |> AGGREGATE's group_by_expr → GroupBy (the
  aggregate args stay Projection since they're value-producing).

Tests cover bare GROUP BY, ORDER BY, GROUP BY + HAVING combo (each
clause carries its own kind), ROLLUP modifier, and a subquery in
GROUP BY (whose own projection refs reset to Projection — the
resolve_query boundary stops kind inheritance, consistent with the
filter-subquery handling in 5.6a).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends ReadKind with Window for refs inside an OVER (...) spec —
PARTITION BY, the window's ORDER BY, and frame bound expressions.
The aggregate's own argument (the `x` in `SUM(x) OVER (...)`) stays
Projection since it's value-producing.

`visit_window_spec` wraps its body in `with_read_kind(Window, ...)`,
so both inline `OVER (...)` clauses on aggregate calls and named
windows (`WINDOW w AS (...)`) get classified through the same path.

Tests cover PARTITION BY only, ORDER BY only, and the combined
form, plus implicit verification that the aggregate arg keeps
Projection kind.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends ColumnFlowKind with `Aggregation`, distinguishing aggregate
function projections (`SUM(a)`, `COUNT(DISTINCT b)`,
`PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x)`, ...) from generic
expression projections. Walk-time `bare: bool` is replaced with a
3-state `kind: ColumnFlowKind` on `ProjectionItem` and `FlowEdge`,
so the classification travels through composition uniformly.

Detection uses two complementary signals so the maintenance burden
of a hard-coded name list is bounded:

1. SQL-spec structural markers, attached only to aggregates per the
   standard — `FILTER (WHERE ...)` clause, `WITHIN GROUP (...)`
   clause, and `DISTINCT` inside the function arg list. These catch
   dialect-specific aggregates not in our name list and never
   misfire on scalar functions.
2. Name match (case-insensitive, last name segment) against a union
   list of common SQL aggregates across ANSI / Postgres / MySQL /
   BigQuery / Snowflake / Redshift. Covers the bare form `SUM(x)`
   etc. that carries no structural marker. Window-only functions
   (ROW_NUMBER / RANK / LAG / LEAD / NTILE / FIRST_VALUE /
   LAST_VALUE) are intentionally excluded — they meaningfully
   aggregate only within a window.

The top-level expression check fires only for a bare aggregate
call; `SUM(a) + 1` is `BinaryOp` at the top, so it stays Computed.

Composition (`compose_flow_kinds`): Aggregation dominates either
side; Passthrough survives only when both sides are Passthrough;
otherwise Computed. A CTE that aggregates surfaces as Aggregation
even when the outer projection forwards or further computes the
result.

UPDATE SET RHS uses the same classifier; UPDATE rarely allows
top-level aggregates, but the kind field is now consistent across
all flow-emission paths.

Tests cover bare aggregate, aliased aggregate, aggregate wrapped in
BinaryOp (Computed fallback), INSERT-SELECT propagation, CTE
composition (inner-Aggregation + outer-Passthrough / outer-Computed
→ both Aggregation), and the two structural-marker paths
(DISTINCT args, FILTER clause).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
takaebato and others added 30 commits May 19, 2026 23:13
Postgres ON CONFLICT and MySQL ON DUPLICATE KEY UPDATE both live in
`Insert.on: Option<OnInsert>`, which visit_insert previously didn't
touch. So conflict-action SET targets never surfaced as writes and
EXCLUDED.<col> refs were silently dropped from flows. Wire it up:

- New `visit_insert_on` dispatches by `OnInsert` variant and reuses
  the existing `emit_assignment_flows` helper that UPDATE / MERGE
  WHEN MATCHED already use, so the per-assignment semantics stay
  identical across all three sites.
- Postgres: bind `EXCLUDED` as a synthetic derived-table with the
  INSERT target's effective columns as its schema. Refs through it
  filter out of `reads` (synthetic-binding filter) but still emit
  Persisted flow edges into the target. EXCLUDED's source role
  surfaces in flow sources as Some(EXCLUDED) — composition into the
  source projection is deferred.
- MySQL: no EXCLUDED binding. `VALUES(<col>)` is parsed as a regular
  function call, so the inner ref resolves to the INSERT target
  naturally; binding EXCLUDED would make those refs ambiguous.
- DoUpdate's optional WHERE walks via `with_filter_clause` so refs
  inside it get `ReadKind::Filter`.
- `collect_writes` extended: SET targets in DoUpdate or
  DuplicateKeyUpdate add ColumnWrites on the INSERT target table.

Existing pinned-down on_conflict tests rewritten to reflect the new
behavior; one new test for the DO UPDATE WHERE path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`emit_persisted_to_created`'s inferred-name fallback used the
current ProjectionGroup's item names, so each UNION branch
contributed its own name. Result: `CREATE TABLE dst AS SELECT a
FROM t1 UNION SELECT b FROM t2` produced two writes (dst.a and
dst.b) and two flows into different targets — diverging from the
SQL standard, which says the result schema follows the left branch
only.

Pre-compute the left branch's item names from the first group and
pair every branch's items against those same names by position.
Same shape as INSERT-SELECT-UNION's positional pairing. Behavior
for non-UNION sources is unchanged (the first group is the only
group).

Existing pinned-down test for the anomaly is rewritten to the
correct (post-fix) shape: `writes = [dst.a]`, both branches' source
flows feed `Persisted(dst.a)`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Build EXCLUDED's synthetic binding with body_projections derived from
the INSERT source's projections — same shape CTEs use for flow
composition. Each source ProjectionGroup is cloned and its items
renamed positionally to the INSERT target column names, so
substitute_source's name-match lookup finds them when an
`EXCLUDED.<col>` ref appears in DO UPDATE SET.

Net effect: `EXCLUDED.<col>` flow sources compose all the way to the
source SELECT's base table refs instead of stopping at the synthetic
EXCLUDED binding. For example
  INSERT INTO t (a, b) SELECT x, y FROM s
    ON CONFLICT (a) DO UPDATE SET b = EXCLUDED.b
previously emitted `EXCLUDED.b → Persisted(t.b)`; now emits
`s.y → Persisted(t.b)` — the same base-table source as the
INSERT-pairing flow.

Behavior preserved for INSERT VALUES (no source projections →
empty body_projections → composition bottoms out at EXCLUDED). Two
new tests cover the composition: UNION source (EXCLUDED fans out
to each branch's position-N item) and aggregate source (composed
flow kind stays Aggregation via the dominant rule).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Postgres / Sqlite RETURNING projects from the target table's affected
rows, but the resolver previously didn't walk it — refs in RETURNING
were silently dropped.

New `visit_returning` helper builds one `ProjectionGroup` from the
items (reusing `build_projection_item`) and emits a `QueryOutput`
flow per item, same shape as a top-level SELECT projection. Called
from `visit_insert` / `visit_update` / `visit_delete` after their
main walk completes.

For INSERT, RETURNING walks BEFORE the on-clause so any EXCLUDED
binding isn't yet in scope — RETURNING projects from the target
table, never from the would-be-inserted pseudo-row, and an
in-scope EXCLUDED would ambify unqualified refs that collide with
INSERT column names.

Seven tests covering: INSERT VALUES + RETURNING basic / aliased /
computed / wildcard-suppressed; UPDATE / DELETE with RETURNING;
INSERT SELECT with RETURNING (source scope already popped by then).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The resolver already handles `Expr::GroupingSets / Cube / Rollup`
uniformly inside a `with_read_kind(GroupBy)` wrapper, so refs in
either modifier carry `ReadKind::GroupBy` automatically. Existing
test coverage was ROLLUP only; add three more to lock in the same
behavior for CUBE, GROUPING SETS (including the empty-set member),
and a mixed GROUP BY with plain exprs alongside ROLLUP.

GROUPING SETS surfaces the same column more than once in `reads`
when it appears in multiple sets — that's faithful to walk order
and reflects the SQL meaning that each set is its own grouping.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`collect_writes` previously left ALTER TABLE empty on the column-op
surface — only the table itself appeared in table-level writes.
Inspect the operation list and emit a ColumnWrite per affected
column for the six column-naming variants:

- AddColumn      → write(target, column_def.name)
- DropColumn     → write(target, name) for each
- RenameColumn   → BOTH old and new (rename moves data; both names
                   useful for column-history consumers)
- ChangeColumn   → old + new (same as rename), or just one when
                   the names are unchanged
- ModifyColumn   → write(target, col_name)
- AlterColumn    → write(target, column_name)

Schema-level operations (constraints, partitions, RENAME TABLE)
contribute no column writes — they still surface as a table-level
write target via the existing resolver bind.

`AlterTableOperation` is `#[non_exhaustive]`, so the wildcard arm
treats new variants as "no column contribution" until explicitly
modeled. Six tests cover the supported variants plus the
constraint case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three tests covering the catalog/schema/name qualifier path:
- `c1.s1.t1.col` resolves to a TableReference with all three
  qualifier fields populated.
- A bare column against a 3-part-qualified table inherits the full
  qualifier in the resolved ColumnReference.
- 5+ part refs (e.g. `extra.c.s.t.col`) hit the qualifier decoder's
  3-part cap and surface as table: None — struct-field access on a
  fully qualified column isn't modeled.

No resolver changes — the existing TableReference::try_from_name
already accepted up to 3 qualifier parts. Tests pin the behavior
down so a future qualifier-decoder change (4-part / catalog-vs-db
distinction / etc.) produces a clear diff.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
VALUES can stand in for a row-source in three non-INSERT positions:
SELECT FROM (VALUES …) AS t(x, y) (derived), WITH cte(x, y) AS
(VALUES …) (CTE body), and rows containing column refs that resolve
via scope-chain walk-up.

Pinned-down behavior: VALUES contributes no ProjectionItems (its
literal-only rows have no source refs to capture), so flow sources
bottom out at the synthetic binding name (`t.x`, `cte.id`, `v.x`)
with no further composition. Reads stay empty because the
synthetic-binding filter drops refs through derived / CTE
bindings. A column ref inside a VALUES row still gets walked, and
the resolver's permissive scope-chain rule lets it pick up outer
FROM tables — useful for lineage from this rare-but-valid construct.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
WITH cte AS (...) INSERT/UPDATE/DELETE/MERGE parses as a top-level
Statement::Query wrapping a SetExpr::{Insert|Update|Delete|Merge}.
Three fixes make the column / table operation surfaces correct for
this shape:

- classify_statement and collect_writes unwrap the SetExpr DML
  wrapper and reclassify against the inner statement, so the
  StatementKind is the verb the user wrote (not Select) and writes
  follow the inner DML.
- A `FROM cte` reference now re-binds the CTE under its use-site
  name in the current scope (carrying schema + body_projections),
  not just for the aliased `FROM cte AS c` case. This gives
  unqualified refs a single in-scope candidate instead of walking
  up and ambifying against an outer-bound DML target, while keeping
  catalog-aware strictness (a Known schema still rejects unknown
  columns) and flow composition through the CTE body.
- visit_set_expr runs the wrapped DML in its own branch scope so the
  DML target binding doesn't share the enclosing query's scope with
  the CTEs — `DELETE FROM t WHERE id IN (SELECT id FROM cte)` now
  resolves the predicate `id` unambiguously to t.

New `cte_schema` accessor mirrors `cte_body_projections`.

Also adds two window-frame tests (literal `ROWS BETWEEN 3 PRECEDING
AND CURRENT ROW` and `UNBOUNDED PRECEDING/FOLLOWING`) confirming
frame bounds with no column refs add nothing while staying inside
the Window read-kind wrapper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Four test-only additions pinning down standard-SQL shapes the
resolver already handles:

- Scalar subquery in projection: `SELECT a, (SELECT max(x) FROM s)
  AS m FROM t` emits the subquery's own QueryOutput edge (Aggregation
  from max) plus the outer projection's pairing — the latter is
  Computed, not Aggregation, because from the outer scope the item
  is a subquery expression and kind composition only merges through
  CTE / derived bindings, not a projection-level scalar subquery.
- Simple CASE with a column WHEN pattern (`CASE x WHEN y THEN a ELSE
  b END`): both operand `x` and pattern `y` carry Conditional;
  results stay plain Projection.
- Set-op trailing ORDER BY: the column resolves to None (Sort kind)
  because the order-by runs in the outer query scope after both
  branch scopes pop — it references a UNION output, not a base
  table. Trailing LIMIT literal adds nothing.
- IS NULL / IS NOT NULL predicates: the column is a Filter read.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Collapse the column-operation surfaces to the minimal distinctions
that are standalone-actionable, recovering everything else
structurally instead of stamping it during the walk.

- ColumnFlowKind drops to {Passthrough, Transformation}. The old
  Aggregation / Computed split was lossy at the edges (window
  aggregates, value-preserving STRING_AGG) and unnecessary for the
  dependency / impact-analysis core. Composition keeps the chain a
  Transformation whenever any step transforms; the helper now reads
  "both Passthrough -> Passthrough, else Transformation".
- reads / writes become plain `Vec<ColumnReference>` (occurrence
  based, duplicates kept — a column read twice appears twice). The
  ReadKind / ColumnRead / ColumnWrite types are deleted: the
  syntactic clause tag was never standalone-actionable and several
  shapes (PIVOT, function args) had no honest classification. The
  value-vs-filter distinction is now structural — a value
  contributor is a `flows` source, a filter-only column is in
  `reads` but not `flows`.
- Nested subqueries (scalar / EXISTS / IN / derived / PIVOT) resolve
  raw via resolve_query, so no intermediate QueryOutput edge survives
  for them; only CTE / derived bindings compose end-to-end through
  stored body projections.
- VisitContext shrinks to `scope_kind` only (the structural bit that
  gates table-flow exclusion); read_kind / in_case_condition and the
  with_read_kind / with_case_condition helpers are gone. The
  aggregate-name / structural-marker classification machinery in
  projection.rs is removed with them — Transformation no longer
  sub-classifies.

Docs (crate-level, README, example) and the whole-value test suite
are rewritten to the new model; test helpers reduce to read / write /
unresolved / col returning ColumnReference and flow_passthrough /
flow_transformation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TableRead / TableWrite were single-field newtypes wrapping a
TableReference, reserved for "future positional / usage enrichment
(FROM vs Predicate vs Join)". That distinction is the value-vs-filter
one the model now recovers structurally — a table is a `flows` source
if it feeds the write target, a predicate-only table is in `reads`
but not `flows` — so the wrapper guards a slot that will not be
filled. They also weren't `#[non_exhaustive]`, so they bought no
SemVer headroom for adding fields later.

Drop both structs; `reads` / `writes` become `Vec<TableReference>`,
mirroring the column surfaces (`Vec<ColumnReference>`). Construction
collapses to `resolution.read_tables()` / `write_tables()` directly,
and the crud extractor stops unwrapping `.table` off each entry.
Tests fold the `read` / `write` helper aliases into the existing
`table(...)` builder.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adopt data-lineage terminology for the value-movement surface and
shed the redundant `Statement` prefix on the per-statement result
types, aligning the public API with the project's dependency /
lineage framing.

- `flows` surface → `lineage`. The directed `source → target` edges
  are column / table data lineage; "lineage" is the standard,
  discoverable term. Edge types follow: `ColumnFlow` →
  `ColumnLineageEdge`, `TableFlow` → `TableLineageEdge` (an edge is a
  lineage *edge*, not a lineage); `ColumnFlowKind` →
  `ColumnLineageKind`.
- `StatementTableOperations` → `TableOperation`,
  `StatementColumnOperations` → `ColumnOperation`. The `Statement`
  prefix added nothing — every extractor returns `Vec<Result<X>>`
  (one element per statement), so per-statement-ness is already in
  the return shape. Singular matches the per-statement result unit
  (cf. the existing `TableExtraction`). The container stays
  `*Operation`, not `*Lineage`: it also carries `reads` / `writes` /
  `statement_kind` and is non-empty for statements with no lineage
  at all (DROP / TRUNCATE / DELETE).
- Internal emit / extract helpers and tests rename in step
  (`extract_table_flows` → `extract_table_lineage`, etc.). The
  resolver-internal `FlowEdge` / `FlowTargetSpec` keep their names —
  they are not part of the public surface.
- `ColumnTarget` / `Persisted` / `QueryOutput` and the
  `extract_*_operations` function names are intentionally unchanged.

Docs (crate-level, README, module headers, the WildcardSuppressed
diagnostic message) move to the lineage wording. CLAUDE.md is also
brought up to date with the prior column-simplification work it had
fallen behind on (ReadKind / ColumnRead / TableRead / Aggregation /
Computed references, the dropped VisitContext fields, and the removed
aggregate-classification note).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the single `Diagnostic` / `DiagnosticKind` with per-granularity
types so a result can only represent the conditions its extraction can
actually produce.

- `TableLevelDiagnostic` / `TableLevelDiagnosticKind { UnsupportedStatement }`
  for the table-level surfaces; `ColumnLevelDiagnostic` /
  `ColumnLevelDiagnosticKind { UnsupportedStatement, WildcardSuppressed,
  AmbiguousColumn, UnresolvedColumn }` for `extract_column_operations`.
  The split is by type, so a table-level result literally cannot carry a
  column-only condition (e.g. a suppressed wildcard, which leaves column
  lineage incomplete but doesn't affect the table set).
- The resolver produces the column-level superset; table-level surfaces
  project it down through `ColumnLevelDiagnostic::to_table_level`, an
  exhaustive match that drops the column-resolution kinds and forces a
  decision when a new kind is added. This fixes the prior leak where a
  catalog-backed `extract_table_operations` surfaced `AmbiguousColumn` /
  `UnresolvedColumn` and every `SELECT *` surfaced `WildcardSuppressed`
  at table granularity, where neither is meaningful.
- `ColumnLevelDiagnosticKind` documents the tool-side (coverage) vs
  input-side (resolution) split: `UnsupportedStatement` / `Wildcard` are
  gaps on our side; `Ambiguous` / `Unresolved` are gaps in the input
  (a real engine would also reject them) — not lint, just an annotation
  of why a reference was left `table: None`.
- `CrudTables` gains a `diagnostics: Vec<TableLevelDiagnostic>` field,
  forwarded from the underlying table-level extraction (it previously
  dropped them).

Also drop `#[non_exhaustive]` from `StatementKind` / `ColumnLineageKind`
(the diagnostic enums are introduced without it): while the crate is
pre-1.0, adding a variant should be a visible breaking change so
consumers re-acknowledge the new case instead of silently routing it to
a wildcard arm. Internal matches stay exhaustive. Re-add at the 1.0
freeze (removing it later is non-breaking; adding it is breaking, so the
1.0 boundary is the place).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The crate docs and README explained that ambiguous / unresolved
column diagnostics are suppressed without a catalog, but not the
consequence: unqualified columns across multiple in-scope tables
(`SELECT x FROM a JOIN b`) resolve to `table: None`, so column
lineage degrades catalog-free while table-level extraction stays
robust. Spell that out, and note the suppression is to avoid flooding
the output with noise (every `Unknown` schema could contain anything).

Also drop the stale README "Aggregate detection" limitation — flow
kinds collapsed to Passthrough / Transformation, so there is no
aggregate name-list classification to misfire anymore.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`Persisted` was slightly wrong for views — a view doesn't persist
data, only its definition. The model treats a view's columns
identically to a table's (a `table`-qualified `ColumnReference`), so
every non-`QueryOutput` target is uniformly "a column of a named
relation" (base table or view). `Relation` names that accurately and
matches the crate's existing relation vocabulary (`RelationSchema`,
the `relation` module, `TableReference`); `QueryOutput` continues to
carve out the transient top-level-SELECT case.

Public-only rename: the resolver-internal `FlowTargetSpec::Persisted`
keeps its name (not part of the public surface) and still maps to the
public `ColumnTarget::Relation` at extraction time. Docs, the example,
and tests follow; the test helper `persisted(...)` becomes
`relation(...)`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier the public surface moved to lineage / `Relation` vocabulary
but the resolver internals still spoke "flow" / "Persisted", so a
reader greps one name and finds the other. Bring the internals in
line (all `pub(crate)` / private — no public API change):

- `FlowEdge` → `LineageEdge`, `FlowTargetSpec` → `LineageTargetSpec`,
  and its `Persisted` variant → `Relation` (mirrors the public
  `ColumnTarget::Relation`; the extractor still maps internal → public
  at conversion time).
- `flow_edges` → `lineage_edges`, `push_flow_edge` →
  `push_lineage_edge`, `composed_flow_edges` → `composed_lineage_edges`,
  `compose_flow_kinds` → `compose_lineage_kinds`,
  `persisted_target_writes` → `relation_target_writes`,
  `emit_persisted_to_created` → `emit_relation_to_created`.
- Module `resolver/flow.rs` → `resolver/lineage.rs` (`mod flow` →
  `mod lineage`); doc-link `[`flow`]` updated.
- Doc / comment prose follows (lineage edges, relation targets); the
  public crate-doc limitation heading "Flow kind" → "Lineage kind".
  Also fixes a stale "Aggregation dominates" note left from the
  flow-kind collapse.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rename the test builders to match the lineage / kind vocabulary:
column-level `flow_passthrough` / `flow_transformation` →
`passthrough` / `transformation` (the kind is the distinguishing
part), and the table-level `flow(src, tgt)` builder → `edge(src, tgt)`
(`lineage(...)` would collide with the `lineage:` field name).
Test-only; no behavior change.

Descriptive test *function* names that use "flow" as plain English
(e.g. `predicate_subquery_does_not_feed_flow`, `..._emits_no_flow`)
are left as-is — there "flow" reads as the behavior, not a type.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Purge "flow" from test function names for vocabulary consistency:
noun uses → "lineage" / "lineage edge" (e.g. emits_no_flow →
emits_no_lineage, emits_one_flow_per_branch →
emits_one_lineage_edge_per_branch); verb uses rephrased
(cte_data_flows_through_to_write_target → cte_data_reaches_write_target,
scalar_subquery..._flows_only_to_outer → ..._feeds_only_outer,
..._refs_surface_and_flow_as_transformation → ..._refs_surface_and_transform).
Also the invariants helper flow_relation_table → edge_relation_table
and the test relation_flow_targets... → relation_lineage_targets....
Test-only; no behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Finish the lineage-vocabulary alignment: rename loop-variable / doc-
example bindings `flow` → `edge` (they iterate `ops.lineage`), and
reword comment prose off "flow" — noun uses → "lineage edge" /
"lineage", verb uses reworded ("flows into" → "feeds into", "data
flows through" → "data moves through", "Composition flows past" →
"Composition reaches past"). Examples and the crate-doc example are
included. The only "flow" left is sqlparser's `ListAggOnOverflow`
(unrelated — over*flow*). No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the last stale "flow" wording from the README prose (the code
examples already use the current API): the column feature/usage notes
now say `lineage` edges "carry a kind" rather than a "flow-kind", and
the recursive-CTE limitation says "lineage composition". Fixes a minor
grammar slip ("`lineage` form" → "forms"). No example changes — those
already match the renamed `lineage` field and `ColumnLevelDiagnosticKind`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`ResolvedQuery.scope_id` was never read (only stored), so drop the
field — `resolve_query` keeps the `push_query_scope` call for its
scope-stack side effect but no longer binds the returned id.

Removing it also clears the last genuinely-needed `#[allow(dead_code)]`.
The other six allows (on ScopeKind / RelationSchema / Column / Binding /
Scope / Resolution) were stale — nothing under them is actually dead —
so drop all seven. Now the resolver carries no dead-code suppressions,
so a future unused field/variant surfaces immediately instead of hiding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two internal one-field types that no longer earn their keep:

- `VisitContext` had shrunk to a single `scope_kind` field after the
  read_kind / in_case_condition removal. Inline it as a `scope_kind`
  field on `Resolver` and drop the generic `with_context` save/restore;
  `with_filter_clause` now saves/restores `scope_kind` directly and
  `with_branch_scope` reads it directly. The useful scoped helpers stay.
- `Column { name: Ident }` was a pure newtype over `Ident`. Use `Ident`
  directly in `RelationSchema::Known(Vec<Ident>)` — the resolver only
  needs the column's identity, and the richer `ColumnSchema` already
  exists on the catalog side for callers that have types.

Net -60 lines, no behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`ColumnReference` is a fundamental identity type (it even wraps
`Option<TableReference>`), not column-extraction-specific — but it
lived in `column_operation_extractor.rs` while its sibling
`TableReference` sat in `relation.rs`. Move it so both identity-only
reference types share the relation-model module (which now lives up to
its "relation model types" doc). The lineage-specific types
(`ColumnTarget` / `ColumnLineageEdge` / `ColumnLineageKind`) stay with
the column extractor.

Public paths are unchanged — both are re-exported at the crate root —
so `sql_insight::ColumnReference` is unaffected. Also unifies the
now-imported `Ident` path in relation.rs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`TableReference` / `ColumnReference` are not relations — a relation is
a tuple set, and these aren't even schemas (no attributes, no domains).
They are *qualified names* that denote a table / column in a catalog:
pure identity. Rename the module `relation` → `reference` so it matches
the `*Reference` types it holds and what they actually are; update the
module doc accordingly. (`RelationSchema` and `ColumnTarget::Relation`
keep "relation" — those genuinely concern a relation's schema / a named
relation.)

Internal move plus the public module path `sql_insight::relation::*` →
`sql_insight::reference::*`; the crate-root re-exports
(`sql_insight::TableReference` etc.) are unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`BindingKey` modelled quoting as a separate namespace (`Quoted` vs
`Unquoted` variants), so `"id"` and unquoted `id` never matched even
though they denote the same column. That conflated quoting with
identity. Quoting actually controls only whether the name is
case-folded: fold unquoted names (lowercase, PostgreSQL / MySQL
convention), keep quoted ones exact, then compare the normalized
strings. Now `"id"` == `id` and `"ID"` != `id`, matching SQL semantics.

Collapse the enum to a `BindingKey(String)` newtype holding the
normalized form. All call sites already went through `from_ident` +
equality / map-key, so the change is localized. (Fold *direction* is
still a dialect approximation — orthogonal, unchanged.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous comment justified the rule as "PostgreSQL / MySQL
convention", which is imprecise (MySQL doesn't fold — it compares
case-insensitively, and treats quoting as escaping). Reframe around the
actual basis after surveying the supported dialects:

- unquoted → lowercase yields case-insensitive matching, the common
  denominator across every supported dialect except ClickHouse (which
  is over-matched, soundly); fold direction only matters at the
  quoted/unquoted edge, and lowercase follows the popular majority.
- quoted → exact is the ANSI / PostgreSQL behavior; the MySQL /
  BigQuery / SQLite family treat quoting as escaping, so this is
  stricter for quoted names — accepted as rare.

Also notes this is one fixed rule (not varied by dialect or
table-vs-column) and that faithful per-dialect resolution is deferred.
Doc-only; no behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A qualified ref like `u.a` over `FROM t1 AS u` previously surfaced the
alias `u` as its table; it now canonicalizes to the binding's alias-free
real table `t1`, matching how unqualified refs already resolve. Synthetic
bindings (CTE / derived / table function) keep the qualifier verbatim so
lineage composition can re-find them by name, and multi-segment
(schema/catalog-qualified) names pass through untouched.

Also bundles two related tidy-ups:

- Change `ColumnSchema.name` from `Ident` to `String`. A catalog provides
  column identities and matching is case-insensitive by default, so the
  public Catalog type no longer needs sqlparser's quote-style / span.
- Document on `Catalog::columns` that identifier case-folding is the
  implementation's responsibility — the resolver passes table names as
  written and does not normalize them.

Adds tests for alias canonicalization and for case / quote matching at
the catalog and CTE-composition surfaces.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two clarifications to the public Catalog docs:

- The catalog is open-world: a table it returns no columns for means
  "schema unknown", not "nonexistent", so a misspelled / unknown table
  name is never flagged — it surfaces as an ordinary read / write with an
  unknown schema. Column-level strictness only applies where a known
  schema is in scope.
- Identifier case-folding is the implementation's responsibility only for
  the table lookup itself; the returned column names are matched against
  SQL column references by the resolver's own fixed rule (unquoted folds
  to lowercase, quoted is exact), independent of the implementation and
  the dialect.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Newer stable clippy flags the `if ct.query.is_some()` inside the
`Statement::CreateTable` arm (collapsible_match). Move the condition into
a match guard. Behaviour is unchanged: a plain `CREATE TABLE` (no query)
now falls through to the trailing no-op arm, exactly as the empty `if`
did before — only CTAS emits writes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant