[Hackathon] feat: permission-gated data cleaning agent for Texera#5101
Open
eugenegujing wants to merge 14 commits into
Open
[Hackathon] feat: permission-gated data cleaning agent for Texera#5101eugenegujing wants to merge 14 commits into
eugenegujing wants to merge 14 commits into
Conversation
First checkpoint for the DataGuard data-cleaning agent: adds the shared types contract (DataQualityIssue, FixProposal, DecisionLogEntry, AutoAllowRule, plus supporting unions) and the read-only profile_dataset scanner with four heuristics (missing, placeholder, duplicate-ID, out-of-range). DataGuard auto-launches when a dataset is added to the workflow and asks Claude-Code-style approval before applying each fix; see README_DataGuard_Texera.md.
Finishes the agent-side DataGuard MVP: LLM-driven suggest_fix, mutating apply_fix, the requestApproval/awaitDecision/resolveDecision gating layer on TexeraAgent (pendingApproval step + WS decision message + auto-allow "remember" rules), write_decision_log (RFC-4180 CSV), bias-check (per-group retention), and a ~50-row polluted diabetes demo CSV. 122 new test cases, all four DataGuard tools registered into createTools().
Adds the chat-panel UX: a standalone PermissionPromptComponent that renders inline on any ReActStep with pendingApproval (Allow / Deny / Modify / Allow & remember), AgentService.sendDecision wiring the new WS message, and DataGuardAutoTriggerService that fires when a dataset-reading operator (CSVFileScan, TableFileScan, JSONFileScan, ParallelCSVFileScan) is added to the workflow. AgentPanelComponent subscribes and surfaces a notification — full agent-creation flow remains a follow-up.
The dataguard/ folder had 8 source + 8 test files in one directory and was getting hard to scan. Move the seven DataGuard-specific test files into a __tests__/ subdirectory and update their relative imports (one extra ../). The types/dataguard.test.ts stays put — it's in src/types/, not under dataguard/, and that folder isn't crowded. bun test still auto-discovers; all 159 tests still pass.
…nd-to-end MVP polish
Major iteration on top of the four committed DataGuard MVP commits. Ships the
auto-trigger checklist as the primary UX (chat flow stays wired for any future
DataGuard-via-LLM path but isn't used by the user-facing flow).
Detector model: five categories (was six). The z-score outlier detector was
dropped — clustered legitimate extremes (e.g. sustained high glucose readings)
were being flagged en masse with no good way for the user to say "those are
real." The old `out_of_range` was renamed `outlier` and keeps its
validRanges-based definition. `duplicate_id` now auto-infers the ID column
from name patterns (`id`, `*_id`, `*Id`, `id_*`, `uid`) so the auto-trigger's
empty-body `/scan` still catches duplicates without UI configuration.
Fix-operation model: `flag` removed entirely (was a no-op against the data;
caused LakeFS "No changes detected" errors after Apply). `warning` added to
RiskTier — concrete fix, always prompts, no "remember." `replace_value` now
supports `rowIndices` targeting (deterministic; wins over value-based `match`),
which fixes a class of silent no-ops where LLM-rounded match values didn't
equal the actual cells. Suggest-fix prompt explicitly passes
`affectedRowIndices` and instructs `rowIndices`-based replace for outliers.
Permission contract: `/apply-batch` body schema is `verdict: "allow" | "deny"`
with `additionalProperties: false`. The Elysia app is built with
`normalize: false` so unknown legacy fields aren't silently stripped before
validation. Global onError converts `code === "VALIDATION"` to HTTP 400.
Runtime check rejects `{verdict: "deny", remember: true}`. WS decision handler
narrowed the same way. `modifiedAction` removed everywhere, including the
decision-log CSV header (9 columns now).
Checklist UX:
- Auto-trigger CSV-only (`CSVFileScan`, `ParallelCSVFileScan`). JSON/Parquet
were trigger-set members but `loadFromOperatorFile` blindly Papa-parsed,
producing garbage.
- Floating, draggable panel via Angular CDK; cdkDragBoundary="body" so it
can't get lost behind the toolbar.
- 📍 locate button per row: cycles through `affectedRowIndices` on repeat
clicks (per-row cursor in a `Map<issueId, number>`, wraps at length).
Tooltip previews the next click position.
- Result-panel integration via `DataGuardRowNavigatorService` (ReplaySubject
with 500ms TTL for cold-mount). `ResultTableFrameComponent` adds a private
`pageRendered$ Subject` so the highlight only fires after the new page
actually renders, not on an arbitrary 100ms timer. Cross-operator race,
viewport-resize during page swap, columnDef-vs-header drift, destroyed-view
NG0911 — all guarded.
- After-Apply auto-rescan: the panel re-runs `/scan` against the new dataset
version so users see real residue instead of stale entries.
- Floating reopen icon when panel is closed & shield is ON. Click → fresh
scan. Pipeline concurrency is gated by `currentPipeline: Promise<void> |
null`: auto-trigger drops silently on slot conflict (spam suppression);
user-initiated awaits the in-flight pipeline (with a "queuing your scan…"
toast) — at most one `/scan` POST in flight at a time.
- Toolbar 🛡 shield (per-workflow ON/OFF, localStorage).
Backend: server normalize:false; `/dataguard/load-demo-dataset` deleted as
dead code (frontend never called it). DataGuardSession's `flaggedRows` field
plus the post-apply `acknowledgedIssues` split removed alongside `flag`.
`missing-detection.ts` is the single source of truth for missing/placeholder
checks; `applyFix` threads `ScanOptions.missingTokens` through to `impute`.
With-approval gates `warning` identically to `high` (never auto-allows, even
with a remembered rule).
Frontend service ownership: the auto-trigger orchestration subscription moved
off `AgentPanelComponent` onto `DataGuardChecklistComponent` (the natural
consumer of its output). `selectedCount`/`deniedCount`/`pendingCount` are
cached on each state push instead of being three filter walks per CD tick.
README_DataGuard.md is the consolidated feature spec, kept in repo root.
Tests: 199 pass / 0 fail (419 expect calls), agent-service typecheck clean,
frontend `tsc --noEmit` clean. New regression-locks include: `replace_value`
with `rowIndices` (no more byte-identical export), `inferIdColumn` for all
id-name patterns, "clustered large readings are NOT auto-outliers", warning
tier prompts even with remembered rule, modify-verdict + modifiedAction +
`{deny, remember:true}` rejection, pipeline serialization (two user-initiated
rescans never invoke `/scan` concurrently), per-row locate cycle math.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The demo CSVs (diabetes_messy + the six single-category fixtures) and their README are kept on disk for local hand-testing but shouldn't live in upstream. git rm --cached preserves the files locally; new .gitignore prevents accidental re-staging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removed the LiteLLM proxy / Python venv / bun-install setup instructions from README_DataGuard.md. The flow walkthrough that lived under §14.3 is preserved as the new §14 "End-to-end flow" so the doc still tells a user what to expect after they click around. Downstream section numbers are unchanged (§15 Testing → §19 Post-MVP follow-ups). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… enforcement Substantial follow-up iteration on the locate UX, file-type coverage, and contract hardening. All work on top of feat/dataguard-mvp commit 5cd0160. ## File-type coverage (auto-trigger dispatcher) `loadFromOperatorFile` refactored into a parser-dispatcher (`PARSERS` map + `DatasetParser` type). Three operator types now in the trigger set: - CSVFileScan / CSVOldFileScan → parseCsv (CSVOld threads customDelimiter from operator properties so `;` / `\t` / etc. are honored) - JSONLFileScan → parseJsonl (new module, nested-object flatten to dot-notation columns; arrays stringified as single cells; non-object lines warn-and-skip; CRLF tolerated; collision rule: nested-owned paths always win over literal-dotted top-level keys regardless of JSON source order, via two-pass scan) Format-aware write-back: server-side GET /dataguard/export-jsonl serializes the in-memory session back to JSONL (canonical column key order; `undefined` → `null` for lossless round-trip). Frontend writeBackAsNewVersion branches on source operator type to pull the right export endpoint. ParallelCSVFileScan dropped from the trigger set — Texera disables it in the operator registry (LogicalOp.scala:171 commented out). One-line re-add if/when re-enabled. ## Locate feature — split path CSV operators (`CSVFileScan`, `CSVOldFileScan`) use the original simple synchronous index-based locate. Texera runs them with a single worker so display order matches file-byte source order; the index DataGuard computed against parseCsv's output is correct as-is. Cursor advances synchronously, navigate is fire-and-forget. JSONLFileScan uses a new fingerprint + flash-confirmed path because Texera parallelizes JSONL scans and shuffles display order: - `rowFingerprint(row, columns)` is byte-identical on agent-service and frontend: alphabetical column sort + per-cell `JSON.stringify(String(v))` + empty-string concat. The `String()` step before stringify is critical — Texera widens schema to string when a column has mixed types, so `JSON.stringify(45)` vs `JSON.stringify("45")` would mismatch without it. - DataQualityIssue carries `affectedRowKeys[]` 1:1 aligned with `affectedRowIndices[]`. Profiler emits both. Frontend `findRowByKey` scans display rows for the matching fingerprint, paginates up to 10 pages, falls back silently to the index path if not found. - Result-table-frame `currentLocateToken` cancellation kills the rapid- click race — every async resumption checks the captured token and bails (emitting flashResult: false exactly once so the awaiter's Promise resolves rather than hanging). - `navigate()` returns Promise<boolean>; subscribes to flashResult$ via firstValueFrom(race(filtered, timer(36s))) BEFORE publishing to nav$, so synchronous fast-path emissions are caught. - Checklist component awaits the Promise and only advances `locateCursors` on flashed===true. Empty clicks (timeout / supersede / out-of-bounds) leave the cursor put, eliminating "skipped a row then jumped back" UX. Per-row cursor (`Map<issueId, number>`) survives benign state re-emits via `purgeStaleCursors` (only ids not in the live entry set get evicted — never wholesale clear, which previously corrupted cursors on benign setState patches between clicks). ## Detector + ID-inference improvements `inferIdColumn` heuristic for the auto-trigger's empty-body /scan now recognizes dotted-notation names produced by JSONL flatten (`user.id`, `customer.uid`, `nested.user.id`, `Account.ID` case-insensitive) in addition to underscore variants (`sample_id`, `userId`, `id_card`, etc.). Without this, dup-ID detection silently no-op'd on JSONL data. ## Contract enforcement Apply-batch body schema strictly rejects: - verdict: "modify" (cut by #11a — silently lied to users) - modifiedAction field - {verdict: "deny", remember: true} Elysia built with `normalize: false` so unknown legacy fields aren't silently stripped before validation. Global onError converts VALIDATION to HTTP 400. WS decision handler narrowed similarly. ## Test counts agent-service: typecheck clean / 217 pass / 0 fail / 457 expect frontend: tsc --noEmit clean frontend specs: 36 navigator + 15 jsonl + 2 checklist + 7 auto-trigger ## CONTRIBUTING compliance ✅ sbt scalafixAll --check ✅ sbt scalafmtCheckAll ✅ yarn format:ci (after yarn format:fix on 15 files) ✅ Apache license headers on all new files (.ts/.html/.scss) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s, no-op fix guard - profile-dataset: gate auto-IQR behind `enableOutlierDetection` (default false). validRanges still fires unconditionally as a per-column override. IQR detector code preserved so re-enable is a single-flag flip after convergence behavior is fully refined. - apply-fix replace_value: skip writes when the proposed replacement equals the current cell (cellEquals guard, both rowIndices and value-match branches). Fixes iterative-cleanup convergence — without the guard, v3 onwards re-applied byte-identical CSVs and LakeFS aborted the commit with "No changes detected." - locate cycle: add `rowKeyOccurrence` + `findNthRowByKey` so clicking "locate" on duplicate-fingerprint rows cycles through all matches rather than always pinning to the first occurrence. Cumulative match counter surfaces "2 of 4" in the result panel.
Texera's JSONLScanSourceOpExec parses through Jackson; `JsonNullNode.asText()`
returns the literal string `"null"` (4 chars), not Java null. The result-panel
row for a source `{"score": null}` therefore arrives at the frontend as
`{score: "null"}` while the profiler preserved real JS null — fingerprints
diverged, `findRowByKey` returned -1 for every null cell, and the silent
byte-order index fallback in result-table-frame landed on a worker-shuffled
display row (e.g., U037 score=88 instead of U007 Grace score=null).
Fix: normalize all missing forms to a single bare `null` fingerprint token
on both sides — real null, undefined, NaN, empty/whitespace strings, the
literal string `"null"`, and the standard missing-token set (`na`, `n/a`,
`nan`, `none`), all case- and trim-insensitive. The profiler routes through
the shared `isCellMissing` predicate; the frontend mirrors with an inline
duplicate (no shared module across services).
Round-1..5 locate invariants intact: `String(v)` numeric coercion preserved
on the non-missing path, `rowKeyOccurrence` cycling untouched, CSV path
unaffected.
When the fingerprint walk exhausts every page without finding the target
row, result-table-frame used to silently fall back to highlighting the
file-byte-order index — which is right for single-worker CSV but wrong
for multi-worker JSONL (worker-shuffle lands an unrelated row at that
position). Earlier versions toasted on every miss and were too noisy,
but that was because the pre-`isCellMissing`-fingerprint contract had
been mismatching 100% of the time on null cells; now that round-6
normalises null/"null"/missing forms across the wire, a real miss only
happens when the data has drifted from the scan.
The table frame now emits `flashed: false` on walk exhaustion; the
checklist caller toasts ("data may have changed since the scan — try
Scan again") only when a rowKey was sent, so the CSV no-rowKey path
stays silent and rapid-double-click supersession doesn't spam.
Tests: extend the existing "JSONL cursor stays put on false" spec to
assert the toast fires, and add a CSV-path test that asserts it does
NOT fire when no rowKey was sent.
After iterative Apply rounds normalize a string column to its canonical
form (e.g., `region: "South"` everywhere), the inconsistent_label detector
can keep re-flagging the column whenever the LLM proposes a mapping that
includes the canonical entry itself (`{south: "South"}` against rows that
are already `"South"`). The standardize branch was incrementing
`rowsAffected` on every mapping-key hit regardless of whether
`mapping[v] === v`, so the frontend pushed a byte-identical CSV/JSONL to
LakeFS and got "No changes detected in dataset. Version creation aborted"
— the same convergence failure the round-yesterday `cellEquals` guard
fixed for replace_value.
Add the same guard to `case "standardize"`. `affected` increments only
when the cell genuinely changes.
trim_whitespace already guards (`trimmed !== v`); impute is safe by
construction in the normal path (only missing cells are visited); the
all-missing-column edge case is a known follow-up and not in scope here.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5101 +/- ##
============================================
+ Coverage 42.85% 43.75% +0.90%
- Complexity 2207 2208 +1
============================================
Files 1045 1054 +9
Lines 40146 41374 +1228
Branches 4240 4242 +2
============================================
+ Hits 17203 18103 +900
- Misses 21878 22198 +320
- Partials 1065 1073 +8
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- agent-service: prettier --write across 21 DataGuard files (CI was running format:check, not just format). Backend tests unchanged at 231/0/500. - frontend: add Apache 2.0 headers to the two permission-prompt component partials that skywalking-eyes flagged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DataGuard — permission-gated data cleaning for Texera
Introduction
DataGuard brings the permission UX to data instead of code. Drop a
CSVFileScan,CSVOldFileScan, orJSONLFileScanonto the canvas and a floating checklist slides in: each detected issue gets one row, one risk-tier badge, one concrete proposed fix, and one Allow/Skip control. Nothing mutates the dataset until you click Fix N & run. Approved fixes are written back as a new dataset version through LakeFS, the operator is repointed at the cleaned data, the workflow auto-runs, and DataGuard immediately re-scans so you can chase the next round.Four detectors run on every scan: missing values, placeholder sentinels, duplicate IDs, inconsistent label spellings.
The problem
Data cleaning fails today in two opposite, equally bad ways:
DataGuard's bet is that the interaction model is the missing piece, not the algorithms. Treat each cleaning decision the way Claude Code treats a file edit: explain the evidence, propose the action, ask permission, log the answer.
What ships in this PR
Four detectors, four enabled by default
na,n/a,null,none,nan, case-insensitive)999,-1), string sentinels (unknown)id,*_id,*Id, dotted paths likeuser.id)trim + lowercasekeys collide (Male/male/MALE)Every fix asks before it lands
Every proposal carries a risk tier (
low/medium/high/warning) that drives both the UI affordance and the permission gate.lowis pre-checked;mediumis pending;highandwarningcan never be auto-approved, even when an "always allow" rule exists for the issue type — some decisions should never be automated away. The legacy "modify" verdict (free-text override) is rejected at both HTTP and WebSocket boundaries; it executed the original action while pretending to honor the user's edit, so we cut it rather than ship a lie.Auto-trigger is the onboarding
The user does nothing special — drop a dataset operator and the checklist appears. The scan runs the deterministic profiler and the LLM-backed proposal step server-side, bypassing the agent's ReAct loop, so the LLM can't decide to call a destructive tool and vaporize the workflow.
Iterative cleanup loop
Click Fix N & run and the loop is
v1 → Apply → v2 → Scan → Apply → v3 → …, with the panel auto-rescanning the cleaned data after each round. A toolbar shield toggles DataGuard per workflow; when the panel is closed but the shield is on, a small floating icon hangs on the canvas as a one-click rescan.Locate-cycle for multi-row issues
Each issue row has a pin. Click it and the Result Panel scrolls to the affected row and flashes the cell. If the issue affects multiple rows — a duplicate-ID cluster, say, with four offending rows — the pin cycles: first click 1 of 4, second 2 of 4, then 3 of 4, 4 of 4, and the fifth wraps back to 1 of 4. The cursor is per-issue, so two different issues don't fight over the focus, and the tooltip previews the next position before you click.
What got fixed on this branch (the unglamorous moments)
JsonNullNode.asText()returns the literal string"null"— not Java null. Before this fix, locating a row with any null cell silently fell through to a byte-order index path and flashed whatever shuffled display row happened to sit at that position. Both sides now canonicalize every missing form (null,undefined,NaN,"", the literal string"null", andna/n/a/nan/nonecase-insensitive) to a single fingerprint token before comparison.replace_valueandstandardize. When the LLM re-proposes a mapping whose target equals the cell that's already there (e.g.,{south: "South"}after round 2 already standardized it), the applier now skips the no-op write. Without this, LakeFS aborts the version commit with "No changes detected in dataset" mid-iteration and the Apply loop dies on the second pass.What's next
FileScan, andTextInput.How was this PR tested?
Automated
cd agent-service && bun test→ 231 pass / 0 fail / 500 expect calls across 20 files (profiler, applier, permission gate with-approval, decision log, apply-batch end-to-end, JSONL parser, fingerprint contract, dataguard tool surface).cd agent-service && bun run typecheck→ exit 0.cd frontend && npx tsc --noEmit→ exit 0.cd frontend && npx ng test --watch=falsescoped to DataGuard specs → 60+ tests pass acrossdataguard-checklist,data-guard-row-navigator, anddata-guard-results.service. Two unrelated specs in this directory (data-guard-jsonl.spec.ts,data-guard-auto-trigger.service.spec.ts) trip on a pre-existing vitest / jsdomBlob.text()infrastructure gap; this is environment plumbing, not a regression introduced by this PR (the underlying code paths are covered by the agent-service side).sbt "scalafixAll --check"andsbt scalafmtCheckAll→ exit 0 (this PR touches no Scala).Manual end-to-end
CSVFileScanon a dirty dataset with known missing values, anage = 999placeholder, duplicate IDs, and mixed-case labels → checklist auto-opens; all four enabled detectors fire with the right risk tiers.JSONLFileScanon a file with nested objects and explicitnullcells in a numeric column → flatten policy producesuser.id-style columns; clicking the pin on a null-cell row lands on the correct row (not the wrong worker-shuffled row).validRangespayload via the scan options → outlier detector activates and flags out-of-range values as WARNING.Was this PR authored or co-authored using generative AI tooling?
Yes. Generated-by: Anthropic Claude Opus 4.7 via Claude Code CLI. AI was used throughout the development cycle — design exploration, implementation, test authoring, code review. Team Feature is implemented. Every decision was reviewed and approved by a human author before being committed; the AI did not autonomously merge or push.
DataGuard-720p.mp4