[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard by EmilySun621 · Pull Request #5097 · apache/texera

EmilySun621 · 2026-05-16T07:16:47Z

The Story
A biomedical researcher wants to study diabetes. She opens Texera and thinks: "Where do I even find the right dataset?" She opens a new tab, googles around, downloads a CSV from UCI, uploads it to Texera, configures the file path manually. Twenty minutes gone before she's even started analyzing.
After she builds her workflow and runs it, the AI agent gives her a detailed model comparison — accuracy, F1 scores, key insights. But it's all buried in a long chat message she has to scroll through. She can't easily reference it, copy it, or share it with her advisor.
We fixed both problems.
What We Built

Dataset Bank — Browse and Import Public Datasets Without Leaving Texera
A new page in the sidebar where users browse a curated catalog of public datasets from UCI, Kaggle, and dkNET — searchable, categorized, and importable with one click.
How it works:

Open "Dataset Bank" in the sidebar → see a grid of dataset cards
Search by name, description, or tag (e.g., "diabetes", "classification", "healthcare")
Filter by category: Biomedical, NLP, Computer Vision, Finance, Social Science, Time Series, Tabular
Every card shows: name, source badge (UCI/Kaggle/dkNET), description, row/column counts, file size, tags
Three actions per card:

🔗 View on source — opens the original dataset page so users can verify before importing
↓ Download — saves the file locally
☁ Import — imports directly into Texera's dataset system. One click, no manual upload, no file path configuration. The dataset immediately appears in "Your Datasets" and is ready for any workflow.

Backend: Server-side proxy (/api/dataset-bank/import-from-url) fetches the file and uploads it through Texera's existing dataset pipeline — bypassing browser CORS restrictions.
2. Dataset Search Agent Tool — "Find Me a Diabetes Dataset"
The AI agent can now search for datasets on your behalf during a conversation.

User asks: "find me a diabetes dataset" → agent calls search_datasets tool
Searches dkNET, UCI, and Kaggle in parallel, returns top results
Agent also knows your existing Texera datasets (injected into system prompt as "Your Datasets" section)
User says "use my iris dataset" → agent knows the exact file path and configures the CSV Source automatically

Results Dashboard — Analysis Reports Outside the Chat
When the AI agent produces a workflow analysis (model comparison, metrics, key findings), it now appears in a dedicated Results Dashboard panel instead of being buried in chat.

Agent wraps analysis in report markers → chat shows a compact card: "📊 Results ready · View Report →"
Clicking opens a floating Results Dashboard panel alongside the canvas
Dashboard renders formatted markdown: tables, headers, bold metrics, key insights
Copy button to clipboard, Export button to download as markdown
Timestamped so users know when the analysis was generated
Auto-updates when agent sends new analysis

The experience: Canvas on the left showing your DAG, Results Dashboard on the right showing your analysis, chat in between for interaction. Everything visible at once — no tab switching, no scrolling through messages.
Demo Scenario

Open Dataset Bank → search "diabetes" → filter "Biomedical" → see Pima Indians Diabetes dataset
Click "🔗 View on UCI" to verify → click "☁ Import" → dataset appears in Your Datasets
Open a workflow → ask the Diabetes Agent: "Build a classification workflow using my diabetes dataset"
Agent generates the workflow on canvas
Ask: "Run it and give me a comparison report"
Agent runs workflow, produces analysis → "📊 Results ready · View Report →"
Click → Results Dashboard opens with formatted model comparison table, winner, key insights
Copy the report to share with advisor

Files Changed
Dataset Bank (Frontend)

dashboard/component/user/dataset-bank/ — DatasetBankComponent (page, search, categories, cards)
dashboard/component/user/dataset-bank/dataset-bank.seed.ts — Curated seed of 20+ popular datasets
dashboard/service/dataset-bank/dataset-bank.service.ts — Fetch, filter, import logic

Dataset Search (Agent Service)

agent-service/src/agent/tools/dataset-search-tool.ts — search_datasets tool (dkNET + UCI + Kaggle)
agent-service/src/api/user-datasets-api.ts — Fetches user's existing datasets for prompt injection
agent-service/src/agent/prompts.ts — "Your Datasets" section in system prompt

Dataset Import Proxy (Agent Service)

agent-service/src/api/dataset-import-api.ts — Server-side fetch + Texera dataset upload pipeline
agent-service/src/server.ts — /api/dataset-bank router mount

Results Dashboard (Frontend + Agent Service)

workspace/component/results-dashboard-panel/ — Floating panel with markdown rendering, copy, export
workspace/service/agent-report/agent-report.service.ts — Report pub/sub between chat and panel
agent-service/src/agent/prompts.ts — Report marker convention instructions

Configuration

proxy.config.json — Dev proxy for /api/dataset-bank → agent-service
agent-service/src/config/env.ts — TEXERA_FILE_SERVICE_ENDPOINT for dataset operations

Testing

Angular build: clean ✅
agent-service typecheck: clean ✅
Dataset Import: tested with UCI Iris dataset — end-to-end success ✅
Results Dashboard: tested with agent-generated report — renders correctly ✅
Dataset search tool: registered and callable by agent ✅

This bundles the feature work that built up on this branch: - Custom agents: dashboard CRUD page and editor dialog (48px icon tile, chip-style guardrails, model selector). Each custom agent now carries a LiteLLM model_name (Opus 4.7 / Haiku 4.5) that is passed through to the agent-service so different agents can use different models. - Conversation history is scoped per (workflowId, agentId): switching agent or workflow yields a different conversation list. localStorage key: texera.workflowConversations.v1.{workflowId}.{agentId}. - Time machine: workflow snapshot list, revert, and agent-tagged checkpoints. New workflow-history-tool in agent-service backs the "undo my last change" flow; amber gains a WorkflowSnapshotResource; sql/updates/23.sql adds the snapshot table. - Operator-aware custom-agent prompts: the system prompt now injects the full operator catalog with a "prefer built-in operators over Python UDFs" rule, sourced from WorkflowSystemMetadata at request time. - LiteLLM: added the claude-opus-4.7 entry alongside claude-haiku-4.5 and gpt-5-mini in bin/litellm-config.yaml. - Agent panel rewritten around the (conversation list / chat) two-view model with subscription-managed list reloads and per-step persistence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rompt Adds a new agent tool that queries dkNET, UCI ML Repository, and Kaggle in parallel and returns up to 5 results per source. Failures from individual sources degrade gracefully so the rest still return. Kaggle is skipped when KAGGLE_USERNAME / KAGGLE_KEY are not set. Also fetches the user's accessible datasets via /api/dataset/list when an agent is bound to a workflow (delegate config), and renders them in a "Your Datasets" section of the system prompt with the path prefix a File Scan operator would use to reference files in each one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a new floating right-side panel that displays the most recent agent-generated analysis report (model comparison tables, key metrics, winner/recommendation, train-vs-test). The agent system prompt now instructs the model to wrap structured result summaries in `` / `` markers. Flow: - agent emits content wrapped in the markers - agent-chat strips the marker block from inline rendering and shows a compact "Results ready — View Report" card in its place - card click and new-report arrival both surface the Results Dashboard panel - panel renders the markdown via ngx-markdown, with copy-to-clipboard and export-as-markdown buttons plus the generation timestamp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This reverts commit 76e87ed.

New page that lists popular public datasets from dkNET, UCI, and Kaggle with search and category filters. Backed by a hardcoded seed of ~30 well-known datasets (Iris, Titanic, MNIST, COCO, TCGA, …) so the page always has content even when the live catalog APIs are unavailable. DatasetBankService: - BehaviorSubjects for search query and active category, plus a combined filteredDatasets$ stream the component subscribes to. - Best-effort live refresh from dkNET + UCI on first visit; results merge with the seed (Kaggle is skipped in the browser — CORS/auth). - Hour-long localStorage cache for the merged list. DatasetBankComponent: - Standalone component with title/subtitle, full-width search bar, horizontal category chips (All / Biomedical / NLP / CV / Finance / Social Science / Time Series / Tabular), and a responsive card grid with name, source badge, description, rows/cols/size/format stats, tags, and an Import button. - Import currently opens the source download / catalog page in a new tab and surfaces a toast — backend wiring to copy the file into the user's Texera datasets is left as a stretch goal. Route registered at /dashboard/user/dataset-bank with a "Dataset Bank" sidebar link under Your Work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pload Card actions now show two equal-width buttons side by side: - **Download** (left, outline): opens the bank entry's direct download URL (or source catalog page) in a new tab — the previous behavior. - **Import** (right, primary): fetches the file in-browser and registers it as a Texera dataset under the current user. Import goes through the existing user-dataset upload pipeline: DatasetService.createDataset() → new dataset metadata DatasetService.multipartUpload() → stages the file via LakeFS DatasetService.createDatasetVersion() → publishes as v1 The button reflects state per card: idle → "Importing…" (loading) → "Imported" (disabled, ✓). Failures (most commonly CORS on the source fetch) re-enable the button so the user can retry, and surface a clear toast suggesting the Download fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds POST /api/dataset-bank/import-from-url to agent-service. The endpoint takes { url, name, description } plus a bearer token; server-side fetches the source file (no browser CORS), then drives the existing dashboard endpoints with the caller's token: /api/dataset/create /api/dataset/multipart-upload?type=init → returns missingParts[] /api/dataset/multipart-upload/part (per chunk, 5 MB) /api/dataset/multipart-upload?type=finish /api/dataset/{did}/version/create body "v1" Returns { did, datasetName, fileName, fileSize }. DatasetBankService.importToTexera() now calls this proxy instead of doing the browser-side fetch + multipart upload itself; the per-card Import button flow on the Dataset Bank page is unchanged from the user's perspective (idle → Importing… → ✓ Imported), but actually succeeds for catalogs that don't send CORS headers (UCI, Kaggle direct downloads, etc.). The Angular proxy.config.json routes /api/dataset-bank/* to localhost:3001 in dev so the existing relative-URL pattern keeps working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The /api/dataset/* routes live in file-service (port 9092 per file-service-web-config.yaml), not amber/dashboard (8080). The new dataset-bank import proxy and the Feature-1 user-dataset list fetch were both hitting 8080 and getting 404. Adds TEXERA_FILE_SERVICE_ENDPOINT to env (default http://localhost:9092) and exposes it on BackendConfig.fileServiceEndpoint. Both call sites now read from there. Also logs the exact URL + upstream status/body at every step of the import pipeline so future endpoint drift is obvious from the agent-service logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Card footer now has three actions side by side: [🔗 View on UCI] [↓ Download] [☁ Import] The View link is a plain anchor (always visible, not on hover) opening the dataset's catalog page in a new tab so users can verify the source before importing. Layout is a 1.4fr / 1fr / 1fr grid that gives the "View" pill room for the longer label without crowding Download or Import. Also fixed the Human Protein Atlas seed entry to actually point at the dkNET catalog (RRID:SCR_006710) instead of proteinatlas.org. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PubMed (live search) — when the user types a query of 3+ characters in the Dataset Bank search box, DatasetBankService debounces 400ms then hits NCBI eSearch + eFetch directly (NCBI sends CORS headers). Each returned paper appears as a card with title, abstract, authors, journal, year. Source badge is green "PubMed". Importing a paper sends its PMID to the backend proxy, which re-fetches via eFetch server-side and emits a 1-row CSV with columns (pmid, title, abstract, authors, journal, year). WHO Global Health Observatory — 5 hardcoded seed entries with real GHO indicator codes: - Life Expectancy at Birth (WHOSIS_000001) - HIV Prevalence Adults 15-49 (HIV_0000000001) - Tuberculosis Incidence (MDG_0000000020) - Malaria Estimated Deaths (MALARIA_EST_DEATHS) - Under-Five Mortality Rate (MDG_0000000007) Source badge is geekblue "WHO". Import fetches the GHO indicator API (https://ghoapi.azureedge.net/api/<indicator>) server-side and converts the rows into a (country, year, sex, numeric_value, value) CSV. New "Public Health" category chip groups WHO entries (and biomedical seeds that touch population health) for filtering. Backend proxy refactor: the existing /api/dataset-bank/import-from-url now accepts a sourceType discriminator. "url" (default) keeps the existing fetch-arbitrary-URL behavior. "pubmed" and "who" each fetch their canonical API server-side, build a CSV, then feed the shared createDataset → multipart-upload → createDatasetVersion pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Angular dev-server proxy was misrouting requests to /api/dataset-bank/* because the catch-all /api/dataset rule (file-service, port 9092) shares a common string prefix with /api/dataset-bank and was winning the proxy match race despite the more-specific rule being declared first. Avoid the collision by giving the agent-service endpoint a distinct path. Component/directory names remain `dataset-bank` (it's the user-facing page identity); only the HTTP path changes: proxy.config.json: "/api/databank" → http://localhost:3001 agent-service: new Elysia({ prefix: "/databank" }) frontend service: POST /api/databank/import-from-url A dev-server restart is required when proxy.config.json changes, since webpack-dev-server does not hot-reload it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Last comment site that still mentioned the pre-rename /api/dataset-bank path. No behavior change — the actual http.post() call already used the new path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Emily Sun and others added 10 commits May 15, 2026 21:55

wip: project gallery in-progress changes

76e87ed

Revert "wip: project gallery in-progress changes"

3cb31c0

This reverts commit 76e87ed.

github-actions Bot assigned EmilySun621 May 16, 2026

github-actions Bot added engine ddl-change Changes to the TexeraDB DDL frontend Changes related to the frontend GUI dev common agent-service labels May 16, 2026

Emily Sun and others added 3 commits May 16, 2026 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097

[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097
EmilySun621 wants to merge 13 commits into
apache:mainfrom
EmilySun621:hackathon/dataset-results

EmilySun621 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant