[Hackathon] feat: permission-gated data cleaning agent for Texera by eugenegujing · Pull Request #5101 · apache/texera

eugenegujing · 2026-05-16T13:42:55Z

DataGuard — permission-gated data cleaning for Texera

Introduction

DataGuard brings the permission UX to data instead of code. Drop a CSVFileScan, CSVOldFileScan, or JSONLFileScan onto the canvas and a floating checklist slides in: each detected issue gets one row, one risk-tier badge, one concrete proposed fix, and one Allow/Skip control. Nothing mutates the dataset until you click Fix N & run. Approved fixes are written back as a new dataset version through LakeFS, the operator is repointed at the cleaned data, the workflow auto-runs, and DataGuard immediately re-scans so you can chase the next round.

Four detectors run on every scan: missing values, placeholder sentinels, duplicate IDs, inconsistent label spellings.

The problem

Data cleaning fails today in two opposite, equally bad ways:

Manual pandas in a notebook — opaque, unauditable, no provenance, doesn't survive the person who wrote it.
One-click auto-clean tools — black-box decisions, no explanation, no human in the loop for the high-impact moves (drop rows, resolve conflicting IDs, clamp a value that might be a real rare case).

DataGuard's bet is that the interaction model is the missing piece, not the algorithms. Treat each cleaning decision the way Claude Code treats a file edit: explain the evidence, propose the action, ask permission, log the answer.

What ships in this PR

Four detectors, four enabled by default

Detector	Default	Notes
Missing values	on	nulls, empties, configurable tokens (`na`, `n/a`, `null`, `none`, `nan`, case-insensitive)
Placeholder values	on	numeric sentinels (`999`, `-1`), string sentinels (`unknown`)
Duplicate IDs	on	honors an explicit ID column; otherwise infers from name patterns (`id`, `_id`, `Id`, dotted paths like `user.id`)
Inconsistent labels	on	low-cardinality string columns where `trim + lowercase` keys collide (`Male` / `male` / `MALE`)

Every fix asks before it lands

Every proposal carries a risk tier (low / medium / high / warning) that drives both the UI affordance and the permission gate. low is pre-checked; medium is pending; high and warning can never be auto-approved, even when an "always allow" rule exists for the issue type — some decisions should never be automated away. The legacy "modify" verdict (free-text override) is rejected at both HTTP and WebSocket boundaries; it executed the original action while pretending to honor the user's edit, so we cut it rather than ship a lie.

Auto-trigger is the onboarding

The user does nothing special — drop a dataset operator and the checklist appears. The scan runs the deterministic profiler and the LLM-backed proposal step server-side, bypassing the agent's ReAct loop, so the LLM can't decide to call a destructive tool and vaporize the workflow.

Iterative cleanup loop

Click Fix N & run and the loop is v1 → Apply → v2 → Scan → Apply → v3 → …, with the panel auto-rescanning the cleaned data after each round. A toolbar shield toggles DataGuard per workflow; when the panel is closed but the shield is on, a small floating icon hangs on the canvas as a one-click rescan.

Locate-cycle for multi-row issues

Each issue row has a pin. Click it and the Result Panel scrolls to the affected row and flashes the cell. If the issue affects multiple rows — a duplicate-ID cluster, say, with four offending rows — the pin cycles: first click 1 of 4, second 2 of 4, then 3 of 4, 4 of 4, and the fifth wraps back to 1 of 4. The cursor is per-issue, so two different issues don't fight over the focus, and the tooltip previews the next position before you click.

What got fixed on this branch (the unglamorous moments)

Null-cell fingerprint normalization. Texera's JSONL scan parses through Jackson, and JsonNullNode.asText() returns the literal string "null" — not Java null. Before this fix, locating a row with any null cell silently fell through to a byte-order index path and flashed whatever shuffled display row happened to sit at that position. Both sides now canonicalize every missing form (null, undefined, NaN, "", the literal string "null", and na / n/a / nan / none case-insensitive) to a single fingerprint token before comparison.
Silent fingerprint fallback → "row not found" toast. When the fingerprint walk legitimately exhausts (drift, post-Apply schema change), the result panel used to land on the wrong row without saying anything. A silent miss-into-wrong-row is strictly worse than no flash, so it now surfaces a toast instead.
No-op write guard for replace_value and standardize. When the LLM re-proposes a mapping whose target equals the cell that's already there (e.g., {south: "South"} after round 2 already standardized it), the applier now skips the no-op write. Without this, LakeFS aborts the version commit with "No changes detected in dataset" mid-iteration and the Apply loop dies on the second pass.

What's next

Broader operator coverage — Allow for Arrow, generic FileScan, and TextInput.

How was this PR tested?

Automated

cd agent-service && bun test → 231 pass / 0 fail / 500 expect calls across 20 files (profiler, applier, permission gate with-approval, decision log, apply-batch end-to-end, JSONL parser, fingerprint contract, dataguard tool surface).
cd agent-service && bun run typecheck → exit 0.
cd frontend && npx tsc --noEmit → exit 0.
cd frontend && npx ng test --watch=false scoped to DataGuard specs → 60+ tests pass across dataguard-checklist, data-guard-row-navigator, and data-guard-results.service. Two unrelated specs in this directory (data-guard-jsonl.spec.ts, data-guard-auto-trigger.service.spec.ts) trip on a pre-existing vitest / jsdom Blob.text() infrastructure gap; this is environment plumbing, not a regression introduced by this PR (the underlying code paths are covered by the agent-service side).
sbt "scalafixAll --check" and sbt scalafmtCheckAll → exit 0 (this PR touches no Scala).
Prettier check on all DataGuard frontend files → clean.

Manual end-to-end

Drop CSVFileScan on a dirty dataset with known missing values, an age = 999 placeholder, duplicate IDs, and mixed-case labels → checklist auto-opens; all four enabled detectors fire with the right risk tiers.
Drop JSONLFileScan on a file with nested objects and explicit null cells in a numeric column → flatten policy produces user.id-style columns; clicking the pin on a null-cell row lands on the correct row (not the wrong worker-shuffled row).
Click Fix N & run with a mix of approved and skipped rows → new dataset version is created, operator is repointed, workflow re-runs, panel auto-rescans showing only residuals.
Click the pin on a duplicate-ID cluster repeatedly → highlight cycles 1 → 2 → 3 → 4 → 1.
Toggle the toolbar shield off → auto-trigger goes silent. Toggle on, click the floating icon → fresh scan starts even mid-flight (queued, never concurrent).
Trigger a no-op standardize / replace_value path (already-canonical column) → apply succeeds without a LakeFS "no changes detected" error.
Supply a validRanges payload via the scan options → outlier detector activates and flags out-of-range values as WARNING.
Confirm HIGH / WARNING risk tiers never auto-approve, even with "always allow" toggled on the issue type.

Was this PR authored or co-authored using generative AI tooling?

Yes. Generated-by: Anthropic Claude Opus 4.7 via Claude Code CLI. AI was used throughout the development cycle — design exploration, implementation, test authoring, code review. Team Feature is implemented. Every decision was reviewed and approved by a human author before being committed; the AI did not autonomously merge or push.

DataGuard-720p.mp4

First checkpoint for the DataGuard data-cleaning agent: adds the shared types contract (DataQualityIssue, FixProposal, DecisionLogEntry, AutoAllowRule, plus supporting unions) and the read-only profile_dataset scanner with four heuristics (missing, placeholder, duplicate-ID, out-of-range). DataGuard auto-launches when a dataset is added to the workflow and asks Claude-Code-style approval before applying each fix; see README_DataGuard_Texera.md.

Finishes the agent-side DataGuard MVP: LLM-driven suggest_fix, mutating apply_fix, the requestApproval/awaitDecision/resolveDecision gating layer on TexeraAgent (pendingApproval step + WS decision message + auto-allow "remember" rules), write_decision_log (RFC-4180 CSV), bias-check (per-group retention), and a ~50-row polluted diabetes demo CSV. 122 new test cases, all four DataGuard tools registered into createTools().

Adds the chat-panel UX: a standalone PermissionPromptComponent that renders inline on any ReActStep with pendingApproval (Allow / Deny / Modify / Allow & remember), AgentService.sendDecision wiring the new WS message, and DataGuardAutoTriggerService that fires when a dataset-reading operator (CSVFileScan, TableFileScan, JSONFileScan, ParallelCSVFileScan) is added to the workflow. AgentPanelComponent subscribes and surfaces a notification — full agent-creation flow remains a follow-up.

The dataguard/ folder had 8 source + 8 test files in one directory and was getting hard to scan. Move the seven DataGuard-specific test files into a __tests__/ subdirectory and update their relative imports (one extra ../). The types/dataguard.test.ts stays put — it's in src/types/, not under dataguard/, and that folder isn't crowded. bun test still auto-discovers; all 159 tests still pass.

…nd-to-end MVP polish Major iteration on top of the four committed DataGuard MVP commits. Ships the auto-trigger checklist as the primary UX (chat flow stays wired for any future DataGuard-via-LLM path but isn't used by the user-facing flow). Detector model: five categories (was six). The z-score outlier detector was dropped — clustered legitimate extremes (e.g. sustained high glucose readings) were being flagged en masse with no good way for the user to say "those are real." The old `out_of_range` was renamed `outlier` and keeps its validRanges-based definition. `duplicate_id` now auto-infers the ID column from name patterns (`id`, `*_id`, `*Id`, `id_*`, `uid`) so the auto-trigger's empty-body `/scan` still catches duplicates without UI configuration. Fix-operation model: `flag` removed entirely (was a no-op against the data; caused LakeFS "No changes detected" errors after Apply). `warning` added to RiskTier — concrete fix, always prompts, no "remember." `replace_value` now supports `rowIndices` targeting (deterministic; wins over value-based `match`), which fixes a class of silent no-ops where LLM-rounded match values didn't equal the actual cells. Suggest-fix prompt explicitly passes `affectedRowIndices` and instructs `rowIndices`-based replace for outliers. Permission contract: `/apply-batch` body schema is `verdict: "allow" | "deny"` with `additionalProperties: false`. The Elysia app is built with `normalize: false` so unknown legacy fields aren't silently stripped before validation. Global onError converts `code === "VALIDATION"` to HTTP 400. Runtime check rejects `{verdict: "deny", remember: true}`. WS decision handler narrowed the same way. `modifiedAction` removed everywhere, including the decision-log CSV header (9 columns now). Checklist UX: - Auto-trigger CSV-only (`CSVFileScan`, `ParallelCSVFileScan`). JSON/Parquet were trigger-set members but `loadFromOperatorFile` blindly Papa-parsed, producing garbage. - Floating, draggable panel via Angular CDK; cdkDragBoundary="body" so it can't get lost behind the toolbar. - 📍 locate button per row: cycles through `affectedRowIndices` on repeat clicks (per-row cursor in a `Map<issueId, number>`, wraps at length). Tooltip previews the next click position. - Result-panel integration via `DataGuardRowNavigatorService` (ReplaySubject with 500ms TTL for cold-mount). `ResultTableFrameComponent` adds a private `pageRendered$ Subject` so the highlight only fires after the new page actually renders, not on an arbitrary 100ms timer. Cross-operator race, viewport-resize during page swap, columnDef-vs-header drift, destroyed-view NG0911 — all guarded. - After-Apply auto-rescan: the panel re-runs `/scan` against the new dataset version so users see real residue instead of stale entries. - Floating reopen icon when panel is closed & shield is ON. Click → fresh scan. Pipeline concurrency is gated by `currentPipeline: Promise<void> | null`: auto-trigger drops silently on slot conflict (spam suppression); user-initiated awaits the in-flight pipeline (with a "queuing your scan…" toast) — at most one `/scan` POST in flight at a time. - Toolbar 🛡 shield (per-workflow ON/OFF, localStorage). Backend: server normalize:false; `/dataguard/load-demo-dataset` deleted as dead code (frontend never called it). DataGuardSession's `flaggedRows` field plus the post-apply `acknowledgedIssues` split removed alongside `flag`. `missing-detection.ts` is the single source of truth for missing/placeholder checks; `applyFix` threads `ScanOptions.missingTokens` through to `impute`. With-approval gates `warning` identically to `high` (never auto-allows, even with a remembered rule). Frontend service ownership: the auto-trigger orchestration subscription moved off `AgentPanelComponent` onto `DataGuardChecklistComponent` (the natural consumer of its output). `selectedCount`/`deniedCount`/`pendingCount` are cached on each state push instead of being three filter walks per CD tick. README_DataGuard.md is the consolidated feature spec, kept in repo root. Tests: 199 pass / 0 fail (419 expect calls), agent-service typecheck clean, frontend `tsc --noEmit` clean. New regression-locks include: `replace_value` with `rowIndices` (no more byte-identical export), `inferIdColumn` for all id-name patterns, "clustered large readings are NOT auto-outliers", warning tier prompts even with remembered rule, modify-verdict + modifiedAction + `{deny, remember:true}` rejection, pipeline serialization (two user-initiated rescans never invoke `/scan` concurrently), per-row locate cycle math. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The demo CSVs (diabetes_messy + the six single-category fixtures) and their README are kept on disk for local hand-testing but shouldn't live in upstream. git rm --cached preserves the files locally; new .gitignore prevents accidental re-staging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Removed the LiteLLM proxy / Python venv / bun-install setup instructions from README_DataGuard.md. The flow walkthrough that lived under §14.3 is preserved as the new §14 "End-to-end flow" so the doc still tells a user what to expect after they click around. Downstream section numbers are unchanged (§15 Testing → §19 Post-MVP follow-ups). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… enforcement Substantial follow-up iteration on the locate UX, file-type coverage, and contract hardening. All work on top of feat/dataguard-mvp commit 5cd0160. ## File-type coverage (auto-trigger dispatcher) `loadFromOperatorFile` refactored into a parser-dispatcher (`PARSERS` map + `DatasetParser` type). Three operator types now in the trigger set: - CSVFileScan / CSVOldFileScan → parseCsv (CSVOld threads customDelimiter from operator properties so `;` / `\t` / etc. are honored) - JSONLFileScan → parseJsonl (new module, nested-object flatten to dot-notation columns; arrays stringified as single cells; non-object lines warn-and-skip; CRLF tolerated; collision rule: nested-owned paths always win over literal-dotted top-level keys regardless of JSON source order, via two-pass scan) Format-aware write-back: server-side GET /dataguard/export-jsonl serializes the in-memory session back to JSONL (canonical column key order; `undefined` → `null` for lossless round-trip). Frontend writeBackAsNewVersion branches on source operator type to pull the right export endpoint. ParallelCSVFileScan dropped from the trigger set — Texera disables it in the operator registry (LogicalOp.scala:171 commented out). One-line re-add if/when re-enabled. ## Locate feature — split path CSV operators (`CSVFileScan`, `CSVOldFileScan`) use the original simple synchronous index-based locate. Texera runs them with a single worker so display order matches file-byte source order; the index DataGuard computed against parseCsv's output is correct as-is. Cursor advances synchronously, navigate is fire-and-forget. JSONLFileScan uses a new fingerprint + flash-confirmed path because Texera parallelizes JSONL scans and shuffles display order: - `rowFingerprint(row, columns)` is byte-identical on agent-service and frontend: alphabetical column sort + per-cell `JSON.stringify(String(v))` + empty-string concat. The `String()` step before stringify is critical — Texera widens schema to string when a column has mixed types, so `JSON.stringify(45)` vs `JSON.stringify("45")` would mismatch without it. - DataQualityIssue carries `affectedRowKeys[]` 1:1 aligned with `affectedRowIndices[]`. Profiler emits both. Frontend `findRowByKey` scans display rows for the matching fingerprint, paginates up to 10 pages, falls back silently to the index path if not found. - Result-table-frame `currentLocateToken` cancellation kills the rapid- click race — every async resumption checks the captured token and bails (emitting flashResult: false exactly once so the awaiter's Promise resolves rather than hanging). - `navigate()` returns Promise<boolean>; subscribes to flashResult$ via firstValueFrom(race(filtered, timer(36s))) BEFORE publishing to nav$, so synchronous fast-path emissions are caught. - Checklist component awaits the Promise and only advances `locateCursors` on flashed===true. Empty clicks (timeout / supersede / out-of-bounds) leave the cursor put, eliminating "skipped a row then jumped back" UX. Per-row cursor (`Map<issueId, number>`) survives benign state re-emits via `purgeStaleCursors` (only ids not in the live entry set get evicted — never wholesale clear, which previously corrupted cursors on benign setState patches between clicks). ## Detector + ID-inference improvements `inferIdColumn` heuristic for the auto-trigger's empty-body /scan now recognizes dotted-notation names produced by JSONL flatten (`user.id`, `customer.uid`, `nested.user.id`, `Account.ID` case-insensitive) in addition to underscore variants (`sample_id`, `userId`, `id_card`, etc.). Without this, dup-ID detection silently no-op'd on JSONL data. ## Contract enforcement Apply-batch body schema strictly rejects: - verdict: "modify" (cut by #11a — silently lied to users) - modifiedAction field - {verdict: "deny", remember: true} Elysia built with `normalize: false` so unknown legacy fields aren't silently stripped before validation. Global onError converts VALIDATION to HTTP 400. WS decision handler narrowed similarly. ## Test counts agent-service: typecheck clean / 217 pass / 0 fail / 457 expect frontend: tsc --noEmit clean frontend specs: 36 navigator + 15 jsonl + 2 checklist + 7 auto-trigger ## CONTRIBUTING compliance ✅ sbt scalafixAll --check ✅ sbt scalafmtCheckAll ✅ yarn format:ci (after yarn format:fix on 15 files) ✅ Apache license headers on all new files (.ts/.html/.scss) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…s, no-op fix guard - profile-dataset: gate auto-IQR behind `enableOutlierDetection` (default false). validRanges still fires unconditionally as a per-column override. IQR detector code preserved so re-enable is a single-flag flip after convergence behavior is fully refined. - apply-fix replace_value: skip writes when the proposed replacement equals the current cell (cellEquals guard, both rowIndices and value-match branches). Fixes iterative-cleanup convergence — without the guard, v3 onwards re-applied byte-identical CSVs and LakeFS aborted the commit with "No changes detected." - locate cycle: add `rowKeyOccurrence` + `findNthRowByKey` so clicking "locate" on duplicate-fingerprint rows cycles through all matches rather than always pinning to the first occurrence. Cumulative match counter surfaces "2 of 4" in the result panel.

Texera's JSONLScanSourceOpExec parses through Jackson; `JsonNullNode.asText()` returns the literal string `"null"` (4 chars), not Java null. The result-panel row for a source `{"score": null}` therefore arrives at the frontend as `{score: "null"}` while the profiler preserved real JS null — fingerprints diverged, `findRowByKey` returned -1 for every null cell, and the silent byte-order index fallback in result-table-frame landed on a worker-shuffled display row (e.g., U037 score=88 instead of U007 Grace score=null). Fix: normalize all missing forms to a single bare `null` fingerprint token on both sides — real null, undefined, NaN, empty/whitespace strings, the literal string `"null"`, and the standard missing-token set (`na`, `n/a`, `nan`, `none`), all case- and trim-insensitive. The profiler routes through the shared `isCellMissing` predicate; the frontend mirrors with an inline duplicate (no shared module across services). Round-1..5 locate invariants intact: `String(v)` numeric coercion preserved on the non-missing path, `rowKeyOccurrence` cycling untouched, CSV path unaffected.

When the fingerprint walk exhausts every page without finding the target row, result-table-frame used to silently fall back to highlighting the file-byte-order index — which is right for single-worker CSV but wrong for multi-worker JSONL (worker-shuffle lands an unrelated row at that position). Earlier versions toasted on every miss and were too noisy, but that was because the pre-`isCellMissing`-fingerprint contract had been mismatching 100% of the time on null cells; now that round-6 normalises null/"null"/missing forms across the wire, a real miss only happens when the data has drifted from the scan. The table frame now emits `flashed: false` on walk exhaustion; the checklist caller toasts ("data may have changed since the scan — try Scan again") only when a rowKey was sent, so the CSV no-rowKey path stays silent and rapid-double-click supersession doesn't spam. Tests: extend the existing "JSONL cursor stays put on false" spec to assert the toast fires, and add a CSV-path test that asserts it does NOT fire when no rowKey was sent.

After iterative Apply rounds normalize a string column to its canonical form (e.g., `region: "South"` everywhere), the inconsistent_label detector can keep re-flagging the column whenever the LLM proposes a mapping that includes the canonical entry itself (`{south: "South"}` against rows that are already `"South"`). The standardize branch was incrementing `rowsAffected` on every mapping-key hit regardless of whether `mapping[v] === v`, so the frontend pushed a byte-identical CSV/JSONL to LakeFS and got "No changes detected in dataset. Version creation aborted" — the same convergence failure the round-yesterday `cellEquals` guard fixed for replace_value. Add the same guard to `case "standardize"`. `affected` increments only when the cell genuinely changes. trim_whitespace already guards (`trimmed !== v`); impute is safe by construction in the normal path (only missing cells are visited); the all-missing-column edge case is a known follow-up and not in scope here.

codecov-commenter · 2026-05-16T13:46:03Z

Codecov Report

❌ Patch coverage is 73.10575% with 323 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.75%. Comparing base (7bb102e) to head (ad43d2e).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
agent-service/src/server.ts	41.15%	153 Missing ⚠️
agent-service/src/agent/texera-agent.ts	23.86%	67 Missing ⚠️
...rvice/src/agent/tools/dataguard/dataguard-tools.ts	46.77%	66 Missing ⚠️
...ent-service/src/agent/tools/dataguard/apply-fix.ts	90.90%	14 Missing ⚠️
...nt-service/src/agent/tools/dataguard/bias-check.ts	87.17%	10 Missing ⚠️
...-service/src/agent/tools/dataguard/decision-log.ts	87.23%	6 Missing ⚠️
...ice/src/agent/tools/dataguard/dataguard-session.ts	94.44%	4 Missing ⚠️
...rvice/src/agent/tools/dataguard/profile-dataset.ts	98.91%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5101      +/-   ##
============================================
+ Coverage     42.85%   43.75%   +0.90%     
- Complexity     2207     2208       +1     
============================================
  Files          1045     1054       +9     
  Lines         40146    41374    +1228     
  Branches       4240     4242       +2     
============================================
+ Hits          17203    18103     +900     
- Misses        21878    22198     +320     
- Partials       1065     1073       +8

Flag	Coverage Δ		*Carryforward flag
access-control-service	`39.53% <ø> (ø)`
agent-service	`44.03% <73.10%> (+10.30%)`	⬆️
amber	`43.79% <ø> (+0.05%)`	⬆️
computing-unit-managing-service	`0.00% <ø> (ø)`
config-service	`0.00% <ø> (ø)`
file-service	`32.18% <ø> (ø)`
frontend	`33.92% <ø> (-0.02%)`	⬇️	Carriedforward from 87c8744
python	`88.99% <ø> (ø)`		Carriedforward from 87c8744
workflow-compiling-service	`56.81% <ø> (+9.09%)`	⬆️

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- agent-service: prettier --write across 21 DataGuard files (CI was running format:check, not just format). Backend tests unchanged at 231/0/500. - frontend: add Apache 2.0 headers to the two permission-prompt component partials that skywalking-eyes flagged.

eugenegujing and others added 13 commits May 14, 2026 19:07

style(dataguard): apply prettier formatting

8b9cf1d

github-actions Bot assigned eugenegujing May 16, 2026

github-actions Bot added feature frontend Changes related to the frontend GUI docs Changes related to documentations common agent-service labels May 16, 2026

eugenegujing changed the title ~~[Hackathon] feat: add DataGuard for automatic Data Cleaning~~ feat(dataguard): permission-gated data cleaning agent for Texera May 16, 2026

eugenegujing changed the title ~~feat(dataguard): permission-gated data cleaning agent for Texera~~ [Hackathon] feat: permission-gated data cleaning agent for Texera May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hackathon] feat: permission-gated data cleaning agent for Texera#5101

[Hackathon] feat: permission-gated data cleaning agent for Texera#5101
eugenegujing wants to merge 14 commits into
apache:mainfrom
eugenegujing:feat/dataguard-mvp

eugenegujing commented May 16, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eugenegujing commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DataGuard — permission-gated data cleaning for Texera

Introduction

The problem

What ships in this PR

Four detectors, four enabled by default

Every fix asks before it lands

Auto-trigger is the onboarding

Iterative cleanup loop

Locate-cycle for multi-row issues

What got fixed on this branch (the unglamorous moments)

What's next

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

codecov-commenter commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eugenegujing commented May 16, 2026 •

edited

Loading

codecov-commenter commented May 16, 2026 •

edited

Loading