Skip to content

[Hackathon] feat: permission-gated data cleaning agent for Texera#5101

Open
eugenegujing wants to merge 14 commits into
apache:mainfrom
eugenegujing:feat/dataguard-mvp
Open

[Hackathon] feat: permission-gated data cleaning agent for Texera#5101
eugenegujing wants to merge 14 commits into
apache:mainfrom
eugenegujing:feat/dataguard-mvp

Conversation

@eugenegujing
Copy link
Copy Markdown
Contributor

@eugenegujing eugenegujing commented May 16, 2026

DataGuard — permission-gated data cleaning for Texera

Introduction

DataGuard brings the permission UX to data instead of code. Drop a CSVFileScan, CSVOldFileScan, or JSONLFileScan onto the canvas and a floating checklist slides in: each detected issue gets one row, one risk-tier badge, one concrete proposed fix, and one Allow/Skip control. Nothing mutates the dataset until you click Fix N & run. Approved fixes are written back as a new dataset version through LakeFS, the operator is repointed at the cleaned data, the workflow auto-runs, and DataGuard immediately re-scans so you can chase the next round.

Four detectors run on every scan: missing values, placeholder sentinels, duplicate IDs, inconsistent label spellings.

The problem

Data cleaning fails today in two opposite, equally bad ways:

  • Manual pandas in a notebook — opaque, unauditable, no provenance, doesn't survive the person who wrote it.
  • One-click auto-clean tools — black-box decisions, no explanation, no human in the loop for the high-impact moves (drop rows, resolve conflicting IDs, clamp a value that might be a real rare case).

DataGuard's bet is that the interaction model is the missing piece, not the algorithms. Treat each cleaning decision the way Claude Code treats a file edit: explain the evidence, propose the action, ask permission, log the answer.

What ships in this PR

Four detectors, four enabled by default

Detector Default Notes
Missing values on nulls, empties, configurable tokens (na, n/a, null, none, nan, case-insensitive)
Placeholder values on numeric sentinels (999, -1), string sentinels (unknown)
Duplicate IDs on honors an explicit ID column; otherwise infers from name patterns (id, *_id, *Id, dotted paths like user.id)
Inconsistent labels on low-cardinality string columns where trim + lowercase keys collide (Male / male / MALE)

Every fix asks before it lands

Every proposal carries a risk tier (low / medium / high / warning) that drives both the UI affordance and the permission gate. low is pre-checked; medium is pending; high and warning can never be auto-approved, even when an "always allow" rule exists for the issue type — some decisions should never be automated away. The legacy "modify" verdict (free-text override) is rejected at both HTTP and WebSocket boundaries; it executed the original action while pretending to honor the user's edit, so we cut it rather than ship a lie.

Auto-trigger is the onboarding

The user does nothing special — drop a dataset operator and the checklist appears. The scan runs the deterministic profiler and the LLM-backed proposal step server-side, bypassing the agent's ReAct loop, so the LLM can't decide to call a destructive tool and vaporize the workflow.

Iterative cleanup loop

Click Fix N & run and the loop is v1 → Apply → v2 → Scan → Apply → v3 → …, with the panel auto-rescanning the cleaned data after each round. A toolbar shield toggles DataGuard per workflow; when the panel is closed but the shield is on, a small floating icon hangs on the canvas as a one-click rescan.

Locate-cycle for multi-row issues

Each issue row has a pin. Click it and the Result Panel scrolls to the affected row and flashes the cell. If the issue affects multiple rows — a duplicate-ID cluster, say, with four offending rows — the pin cycles: first click 1 of 4, second 2 of 4, then 3 of 4, 4 of 4, and the fifth wraps back to 1 of 4. The cursor is per-issue, so two different issues don't fight over the focus, and the tooltip previews the next position before you click.

What got fixed on this branch (the unglamorous moments)

  • Null-cell fingerprint normalization. Texera's JSONL scan parses through Jackson, and JsonNullNode.asText() returns the literal string "null" — not Java null. Before this fix, locating a row with any null cell silently fell through to a byte-order index path and flashed whatever shuffled display row happened to sit at that position. Both sides now canonicalize every missing form (null, undefined, NaN, "", the literal string "null", and na / n/a / nan / none case-insensitive) to a single fingerprint token before comparison.
  • Silent fingerprint fallback → "row not found" toast. When the fingerprint walk legitimately exhausts (drift, post-Apply schema change), the result panel used to land on the wrong row without saying anything. A silent miss-into-wrong-row is strictly worse than no flash, so it now surfaces a toast instead.
  • No-op write guard for replace_value and standardize. When the LLM re-proposes a mapping whose target equals the cell that's already there (e.g., {south: "South"} after round 2 already standardized it), the applier now skips the no-op write. Without this, LakeFS aborts the version commit with "No changes detected in dataset" mid-iteration and the Apply loop dies on the second pass.

What's next

  • Broader operator coverage — Allow for Arrow, generic FileScan, and TextInput.

How was this PR tested?

Automated

  • cd agent-service && bun test231 pass / 0 fail / 500 expect calls across 20 files (profiler, applier, permission gate with-approval, decision log, apply-batch end-to-end, JSONL parser, fingerprint contract, dataguard tool surface).
  • cd agent-service && bun run typecheck → exit 0.
  • cd frontend && npx tsc --noEmit → exit 0.
  • cd frontend && npx ng test --watch=false scoped to DataGuard specs → 60+ tests pass across dataguard-checklist, data-guard-row-navigator, and data-guard-results.service. Two unrelated specs in this directory (data-guard-jsonl.spec.ts, data-guard-auto-trigger.service.spec.ts) trip on a pre-existing vitest / jsdom Blob.text() infrastructure gap; this is environment plumbing, not a regression introduced by this PR (the underlying code paths are covered by the agent-service side).
  • sbt "scalafixAll --check" and sbt scalafmtCheckAll → exit 0 (this PR touches no Scala).
  • Prettier check on all DataGuard frontend files → clean.

Manual end-to-end

  1. Drop CSVFileScan on a dirty dataset with known missing values, an age = 999 placeholder, duplicate IDs, and mixed-case labels → checklist auto-opens; all four enabled detectors fire with the right risk tiers.
  2. Drop JSONLFileScan on a file with nested objects and explicit null cells in a numeric column → flatten policy produces user.id-style columns; clicking the pin on a null-cell row lands on the correct row (not the wrong worker-shuffled row).
  3. Click Fix N & run with a mix of approved and skipped rows → new dataset version is created, operator is repointed, workflow re-runs, panel auto-rescans showing only residuals.
  4. Click the pin on a duplicate-ID cluster repeatedly → highlight cycles 1 → 2 → 3 → 4 → 1.
  5. Toggle the toolbar shield off → auto-trigger goes silent. Toggle on, click the floating icon → fresh scan starts even mid-flight (queued, never concurrent).
  6. Trigger a no-op standardize / replace_value path (already-canonical column) → apply succeeds without a LakeFS "no changes detected" error.
  7. Supply a validRanges payload via the scan options → outlier detector activates and flags out-of-range values as WARNING.
  8. Confirm HIGH / WARNING risk tiers never auto-approve, even with "always allow" toggled on the issue type.

Was this PR authored or co-authored using generative AI tooling?

Yes. Generated-by: Anthropic Claude Opus 4.7 via Claude Code CLI. AI was used throughout the development cycle — design exploration, implementation, test authoring, code review. Team Feature is implemented. Every decision was reviewed and approved by a human author before being committed; the AI did not autonomously merge or push.

DataGuard-720p.mp4

eugenegujing and others added 13 commits May 14, 2026 19:07
First checkpoint for the DataGuard data-cleaning agent: adds the shared
types contract (DataQualityIssue, FixProposal, DecisionLogEntry,
AutoAllowRule, plus supporting unions) and the read-only profile_dataset
scanner with four heuristics (missing, placeholder, duplicate-ID,
out-of-range). DataGuard auto-launches when a dataset is added to the
workflow and asks Claude-Code-style approval before applying each fix;
see README_DataGuard_Texera.md.
Finishes the agent-side DataGuard MVP: LLM-driven suggest_fix, mutating
apply_fix, the requestApproval/awaitDecision/resolveDecision gating layer
on TexeraAgent (pendingApproval step + WS decision message + auto-allow
"remember" rules), write_decision_log (RFC-4180 CSV), bias-check
(per-group retention), and a ~50-row polluted diabetes demo CSV. 122 new
test cases, all four DataGuard tools registered into createTools().
Adds the chat-panel UX: a standalone PermissionPromptComponent that renders
inline on any ReActStep with pendingApproval (Allow / Deny / Modify /
Allow & remember), AgentService.sendDecision wiring the new WS message,
and DataGuardAutoTriggerService that fires when a dataset-reading
operator (CSVFileScan, TableFileScan, JSONFileScan, ParallelCSVFileScan)
is added to the workflow. AgentPanelComponent subscribes and surfaces a
notification — full agent-creation flow remains a follow-up.
The dataguard/ folder had 8 source + 8 test files in one directory and was
getting hard to scan. Move the seven DataGuard-specific test files into a
__tests__/ subdirectory and update their relative imports (one extra ../).
The types/dataguard.test.ts stays put — it's in src/types/, not under
dataguard/, and that folder isn't crowded. bun test still auto-discovers;
all 159 tests still pass.
…nd-to-end MVP polish

Major iteration on top of the four committed DataGuard MVP commits. Ships the
auto-trigger checklist as the primary UX (chat flow stays wired for any future
DataGuard-via-LLM path but isn't used by the user-facing flow).

Detector model: five categories (was six). The z-score outlier detector was
dropped — clustered legitimate extremes (e.g. sustained high glucose readings)
were being flagged en masse with no good way for the user to say "those are
real." The old `out_of_range` was renamed `outlier` and keeps its
validRanges-based definition. `duplicate_id` now auto-infers the ID column
from name patterns (`id`, `*_id`, `*Id`, `id_*`, `uid`) so the auto-trigger's
empty-body `/scan` still catches duplicates without UI configuration.

Fix-operation model: `flag` removed entirely (was a no-op against the data;
caused LakeFS "No changes detected" errors after Apply). `warning` added to
RiskTier — concrete fix, always prompts, no "remember." `replace_value` now
supports `rowIndices` targeting (deterministic; wins over value-based `match`),
which fixes a class of silent no-ops where LLM-rounded match values didn't
equal the actual cells. Suggest-fix prompt explicitly passes
`affectedRowIndices` and instructs `rowIndices`-based replace for outliers.

Permission contract: `/apply-batch` body schema is `verdict: "allow" | "deny"`
with `additionalProperties: false`. The Elysia app is built with
`normalize: false` so unknown legacy fields aren't silently stripped before
validation. Global onError converts `code === "VALIDATION"` to HTTP 400.
Runtime check rejects `{verdict: "deny", remember: true}`. WS decision handler
narrowed the same way. `modifiedAction` removed everywhere, including the
decision-log CSV header (9 columns now).

Checklist UX:
  - Auto-trigger CSV-only (`CSVFileScan`, `ParallelCSVFileScan`). JSON/Parquet
    were trigger-set members but `loadFromOperatorFile` blindly Papa-parsed,
    producing garbage.
  - Floating, draggable panel via Angular CDK; cdkDragBoundary="body" so it
    can't get lost behind the toolbar.
  - 📍 locate button per row: cycles through `affectedRowIndices` on repeat
    clicks (per-row cursor in a `Map<issueId, number>`, wraps at length).
    Tooltip previews the next click position.
  - Result-panel integration via `DataGuardRowNavigatorService` (ReplaySubject
    with 500ms TTL for cold-mount). `ResultTableFrameComponent` adds a private
    `pageRendered$ Subject` so the highlight only fires after the new page
    actually renders, not on an arbitrary 100ms timer. Cross-operator race,
    viewport-resize during page swap, columnDef-vs-header drift, destroyed-view
    NG0911 — all guarded.
  - After-Apply auto-rescan: the panel re-runs `/scan` against the new dataset
    version so users see real residue instead of stale entries.
  - Floating reopen icon when panel is closed & shield is ON. Click → fresh
    scan. Pipeline concurrency is gated by `currentPipeline: Promise<void> |
    null`: auto-trigger drops silently on slot conflict (spam suppression);
    user-initiated awaits the in-flight pipeline (with a "queuing your scan…"
    toast) — at most one `/scan` POST in flight at a time.
  - Toolbar 🛡 shield (per-workflow ON/OFF, localStorage).

Backend: server normalize:false; `/dataguard/load-demo-dataset` deleted as
dead code (frontend never called it). DataGuardSession's `flaggedRows` field
plus the post-apply `acknowledgedIssues` split removed alongside `flag`.
`missing-detection.ts` is the single source of truth for missing/placeholder
checks; `applyFix` threads `ScanOptions.missingTokens` through to `impute`.
With-approval gates `warning` identically to `high` (never auto-allows, even
with a remembered rule).

Frontend service ownership: the auto-trigger orchestration subscription moved
off `AgentPanelComponent` onto `DataGuardChecklistComponent` (the natural
consumer of its output). `selectedCount`/`deniedCount`/`pendingCount` are
cached on each state push instead of being three filter walks per CD tick.

README_DataGuard.md is the consolidated feature spec, kept in repo root.

Tests: 199 pass / 0 fail (419 expect calls), agent-service typecheck clean,
frontend `tsc --noEmit` clean. New regression-locks include: `replace_value`
with `rowIndices` (no more byte-identical export), `inferIdColumn` for all
id-name patterns, "clustered large readings are NOT auto-outliers", warning
tier prompts even with remembered rule, modify-verdict + modifiedAction +
`{deny, remember:true}` rejection, pipeline serialization (two user-initiated
rescans never invoke `/scan` concurrently), per-row locate cycle math.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The demo CSVs (diabetes_messy + the six single-category fixtures) and their
README are kept on disk for local hand-testing but shouldn't live in upstream.
git rm --cached preserves the files locally; new .gitignore prevents
accidental re-staging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removed the LiteLLM proxy / Python venv / bun-install setup instructions
from README_DataGuard.md. The flow walkthrough that lived under §14.3 is
preserved as the new §14 "End-to-end flow" so the doc still tells a user
what to expect after they click around. Downstream section numbers are
unchanged (§15 Testing → §19 Post-MVP follow-ups).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… enforcement

Substantial follow-up iteration on the locate UX, file-type coverage, and
contract hardening. All work on top of feat/dataguard-mvp commit 5cd0160.

## File-type coverage (auto-trigger dispatcher)

`loadFromOperatorFile` refactored into a parser-dispatcher (`PARSERS` map +
`DatasetParser` type). Three operator types now in the trigger set:
  - CSVFileScan / CSVOldFileScan → parseCsv (CSVOld threads customDelimiter
    from operator properties so `;` / `\t` / etc. are honored)
  - JSONLFileScan → parseJsonl (new module, nested-object flatten to
    dot-notation columns; arrays stringified as single cells; non-object
    lines warn-and-skip; CRLF tolerated; collision rule: nested-owned paths
    always win over literal-dotted top-level keys regardless of JSON source
    order, via two-pass scan)

Format-aware write-back: server-side GET /dataguard/export-jsonl serializes
the in-memory session back to JSONL (canonical column key order; `undefined`
→ `null` for lossless round-trip). Frontend writeBackAsNewVersion branches
on source operator type to pull the right export endpoint.

ParallelCSVFileScan dropped from the trigger set — Texera disables it in
the operator registry (LogicalOp.scala:171 commented out). One-line re-add
if/when re-enabled.

## Locate feature — split path

CSV operators (`CSVFileScan`, `CSVOldFileScan`) use the original simple
synchronous index-based locate. Texera runs them with a single worker so
display order matches file-byte source order; the index DataGuard computed
against parseCsv's output is correct as-is. Cursor advances synchronously,
navigate is fire-and-forget.

JSONLFileScan uses a new fingerprint + flash-confirmed path because Texera
parallelizes JSONL scans and shuffles display order:
  - `rowFingerprint(row, columns)` is byte-identical on agent-service and
    frontend: alphabetical column sort + per-cell `JSON.stringify(String(v))`
    + empty-string concat. The `String()` step before stringify is critical
    — Texera widens schema to string when a column has mixed types, so
    `JSON.stringify(45)` vs `JSON.stringify("45")` would mismatch without it.
  - DataQualityIssue carries `affectedRowKeys[]` 1:1 aligned with
    `affectedRowIndices[]`. Profiler emits both. Frontend `findRowByKey`
    scans display rows for the matching fingerprint, paginates up to 10
    pages, falls back silently to the index path if not found.
  - Result-table-frame `currentLocateToken` cancellation kills the rapid-
    click race — every async resumption checks the captured token and bails
    (emitting flashResult: false exactly once so the awaiter's Promise
    resolves rather than hanging).
  - `navigate()` returns Promise<boolean>; subscribes to flashResult$ via
    firstValueFrom(race(filtered, timer(36s))) BEFORE publishing to nav$,
    so synchronous fast-path emissions are caught.
  - Checklist component awaits the Promise and only advances `locateCursors`
    on flashed===true. Empty clicks (timeout / supersede / out-of-bounds)
    leave the cursor put, eliminating "skipped a row then jumped back" UX.

Per-row cursor (`Map<issueId, number>`) survives benign state re-emits via
`purgeStaleCursors` (only ids not in the live entry set get evicted —
never wholesale clear, which previously corrupted cursors on benign setState
patches between clicks).

## Detector + ID-inference improvements

`inferIdColumn` heuristic for the auto-trigger's empty-body /scan now
recognizes dotted-notation names produced by JSONL flatten (`user.id`,
`customer.uid`, `nested.user.id`, `Account.ID` case-insensitive) in addition
to underscore variants (`sample_id`, `userId`, `id_card`, etc.). Without
this, dup-ID detection silently no-op'd on JSONL data.

## Contract enforcement

Apply-batch body schema strictly rejects:
  - verdict: "modify" (cut by #11a — silently lied to users)
  - modifiedAction field
  - {verdict: "deny", remember: true}
Elysia built with `normalize: false` so unknown legacy fields aren't silently
stripped before validation. Global onError converts VALIDATION to HTTP 400.
WS decision handler narrowed similarly.

## Test counts

  agent-service: typecheck clean / 217 pass / 0 fail / 457 expect
  frontend: tsc --noEmit clean
  frontend specs: 36 navigator + 15 jsonl + 2 checklist + 7 auto-trigger

## CONTRIBUTING compliance

  ✅ sbt scalafixAll --check
  ✅ sbt scalafmtCheckAll
  ✅ yarn format:ci (after yarn format:fix on 15 files)
  ✅ Apache license headers on all new files (.ts/.html/.scss)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s, no-op fix guard

- profile-dataset: gate auto-IQR behind `enableOutlierDetection` (default
  false). validRanges still fires unconditionally as a per-column override.
  IQR detector code preserved so re-enable is a single-flag flip after
  convergence behavior is fully refined.
- apply-fix replace_value: skip writes when the proposed replacement equals
  the current cell (cellEquals guard, both rowIndices and value-match
  branches). Fixes iterative-cleanup convergence — without the guard,
  v3 onwards re-applied byte-identical CSVs and LakeFS aborted the commit
  with "No changes detected."
- locate cycle: add `rowKeyOccurrence` + `findNthRowByKey` so clicking
  "locate" on duplicate-fingerprint rows cycles through all matches rather
  than always pinning to the first occurrence. Cumulative match counter
  surfaces "2 of 4" in the result panel.
Texera's JSONLScanSourceOpExec parses through Jackson; `JsonNullNode.asText()`
returns the literal string `"null"` (4 chars), not Java null. The result-panel
row for a source `{"score": null}` therefore arrives at the frontend as
`{score: "null"}` while the profiler preserved real JS null — fingerprints
diverged, `findRowByKey` returned -1 for every null cell, and the silent
byte-order index fallback in result-table-frame landed on a worker-shuffled
display row (e.g., U037 score=88 instead of U007 Grace score=null).

Fix: normalize all missing forms to a single bare `null` fingerprint token
on both sides — real null, undefined, NaN, empty/whitespace strings, the
literal string `"null"`, and the standard missing-token set (`na`, `n/a`,
`nan`, `none`), all case- and trim-insensitive. The profiler routes through
the shared `isCellMissing` predicate; the frontend mirrors with an inline
duplicate (no shared module across services).

Round-1..5 locate invariants intact: `String(v)` numeric coercion preserved
on the non-missing path, `rowKeyOccurrence` cycling untouched, CSV path
unaffected.
When the fingerprint walk exhausts every page without finding the target
row, result-table-frame used to silently fall back to highlighting the
file-byte-order index — which is right for single-worker CSV but wrong
for multi-worker JSONL (worker-shuffle lands an unrelated row at that
position). Earlier versions toasted on every miss and were too noisy,
but that was because the pre-`isCellMissing`-fingerprint contract had
been mismatching 100% of the time on null cells; now that round-6
normalises null/"null"/missing forms across the wire, a real miss only
happens when the data has drifted from the scan.

The table frame now emits `flashed: false` on walk exhaustion; the
checklist caller toasts ("data may have changed since the scan — try
Scan again") only when a rowKey was sent, so the CSV no-rowKey path
stays silent and rapid-double-click supersession doesn't spam.

Tests: extend the existing "JSONL cursor stays put on false" spec to
assert the toast fires, and add a CSV-path test that asserts it does
NOT fire when no rowKey was sent.
After iterative Apply rounds normalize a string column to its canonical
form (e.g., `region: "South"` everywhere), the inconsistent_label detector
can keep re-flagging the column whenever the LLM proposes a mapping that
includes the canonical entry itself (`{south: "South"}` against rows that
are already `"South"`). The standardize branch was incrementing
`rowsAffected` on every mapping-key hit regardless of whether
`mapping[v] === v`, so the frontend pushed a byte-identical CSV/JSONL to
LakeFS and got "No changes detected in dataset. Version creation aborted"
— the same convergence failure the round-yesterday `cellEquals` guard
fixed for replace_value.

Add the same guard to `case "standardize"`. `affected` increments only
when the cell genuinely changes.

trim_whitespace already guards (`trimmed !== v`); impute is safe by
construction in the normal path (only missing cells are visited); the
all-missing-column edge case is a known follow-up and not in scope here.
@github-actions github-actions Bot added feature frontend Changes related to the frontend GUI docs Changes related to documentations common agent-service labels May 16, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 16, 2026

Codecov Report

❌ Patch coverage is 73.10575% with 323 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.75%. Comparing base (7bb102e) to head (ad43d2e).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
agent-service/src/server.ts 41.15% 153 Missing ⚠️
agent-service/src/agent/texera-agent.ts 23.86% 67 Missing ⚠️
...rvice/src/agent/tools/dataguard/dataguard-tools.ts 46.77% 66 Missing ⚠️
...ent-service/src/agent/tools/dataguard/apply-fix.ts 90.90% 14 Missing ⚠️
...nt-service/src/agent/tools/dataguard/bias-check.ts 87.17% 10 Missing ⚠️
...-service/src/agent/tools/dataguard/decision-log.ts 87.23% 6 Missing ⚠️
...ice/src/agent/tools/dataguard/dataguard-session.ts 94.44% 4 Missing ⚠️
...rvice/src/agent/tools/dataguard/profile-dataset.ts 98.91% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5101      +/-   ##
============================================
+ Coverage     42.85%   43.75%   +0.90%     
- Complexity     2207     2208       +1     
============================================
  Files          1045     1054       +9     
  Lines         40146    41374    +1228     
  Branches       4240     4242       +2     
============================================
+ Hits          17203    18103     +900     
- Misses        21878    22198     +320     
- Partials       1065     1073       +8     
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 44.03% <73.10%> (+10.30%) ⬆️
amber 43.79% <ø> (+0.05%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 32.18% <ø> (ø)
frontend 33.92% <ø> (-0.02%) ⬇️ Carriedforward from 87c8744
python 88.99% <ø> (ø) Carriedforward from 87c8744
workflow-compiling-service 56.81% <ø> (+9.09%) ⬆️

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- agent-service: prettier --write across 21 DataGuard files (CI was running
  format:check, not just format). Backend tests unchanged at 231/0/500.
- frontend: add Apache 2.0 headers to the two permission-prompt component
  partials that skywalking-eyes flagged.
@eugenegujing eugenegujing changed the title [Hackathon] feat: add DataGuard for automatic Data Cleaning feat(dataguard): permission-gated data cleaning agent for Texera May 16, 2026
@eugenegujing eugenegujing changed the title feat(dataguard): permission-gated data cleaning agent for Texera [Hackathon] feat: permission-gated data cleaning agent for Texera May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-service common docs Changes related to documentations feature frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants