Skip to content

[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097

Open
EmilySun621 wants to merge 13 commits into
apache:mainfrom
EmilySun621:hackathon/dataset-results
Open

[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097
EmilySun621 wants to merge 13 commits into
apache:mainfrom
EmilySun621:hackathon/dataset-results

Conversation

@EmilySun621
Copy link
Copy Markdown

The Story
A biomedical researcher wants to study diabetes. She opens Texera and thinks: "Where do I even find the right dataset?" She opens a new tab, googles around, downloads a CSV from UCI, uploads it to Texera, configures the file path manually. Twenty minutes gone before she's even started analyzing.
After she builds her workflow and runs it, the AI agent gives her a detailed model comparison — accuracy, F1 scores, key insights. But it's all buried in a long chat message she has to scroll through. She can't easily reference it, copy it, or share it with her advisor.
We fixed both problems.
What We Built

  1. Dataset Bank — Browse and Import Public Datasets Without Leaving Texera
    A new page in the sidebar where users browse a curated catalog of public datasets from UCI, Kaggle, and dkNET — searchable, categorized, and importable with one click.
    How it works:

Open "Dataset Bank" in the sidebar → see a grid of dataset cards
Search by name, description, or tag (e.g., "diabetes", "classification", "healthcare")
Filter by category: Biomedical, NLP, Computer Vision, Finance, Social Science, Time Series, Tabular
Every card shows: name, source badge (UCI/Kaggle/dkNET), description, row/column counts, file size, tags
Three actions per card:

🔗 View on source — opens the original dataset page so users can verify before importing
↓ Download — saves the file locally
☁ Import — imports directly into Texera's dataset system. One click, no manual upload, no file path configuration. The dataset immediately appears in "Your Datasets" and is ready for any workflow.

Backend: Server-side proxy (/api/dataset-bank/import-from-url) fetches the file and uploads it through Texera's existing dataset pipeline — bypassing browser CORS restrictions.
2. Dataset Search Agent Tool — "Find Me a Diabetes Dataset"
The AI agent can now search for datasets on your behalf during a conversation.

User asks: "find me a diabetes dataset" → agent calls search_datasets tool
Searches dkNET, UCI, and Kaggle in parallel, returns top results
Agent also knows your existing Texera datasets (injected into system prompt as "Your Datasets" section)
User says "use my iris dataset" → agent knows the exact file path and configures the CSV Source automatically

  1. Results Dashboard — Analysis Reports Outside the Chat
    When the AI agent produces a workflow analysis (model comparison, metrics, key findings), it now appears in a dedicated Results Dashboard panel instead of being buried in chat.

Agent wraps analysis in report markers → chat shows a compact card: "📊 Results ready · View Report →"
Clicking opens a floating Results Dashboard panel alongside the canvas
Dashboard renders formatted markdown: tables, headers, bold metrics, key insights
Copy button to clipboard, Export button to download as markdown
Timestamped so users know when the analysis was generated
Auto-updates when agent sends new analysis

The experience: Canvas on the left showing your DAG, Results Dashboard on the right showing your analysis, chat in between for interaction. Everything visible at once — no tab switching, no scrolling through messages.
Demo Scenario

Open Dataset Bank → search "diabetes" → filter "Biomedical" → see Pima Indians Diabetes dataset
Click "🔗 View on UCI" to verify → click "☁ Import" → dataset appears in Your Datasets
Open a workflow → ask the Diabetes Agent: "Build a classification workflow using my diabetes dataset"
Agent generates the workflow on canvas
Ask: "Run it and give me a comparison report"
Agent runs workflow, produces analysis → "📊 Results ready · View Report →"
Click → Results Dashboard opens with formatted model comparison table, winner, key insights
Copy the report to share with advisor

Files Changed
Dataset Bank (Frontend)

dashboard/component/user/dataset-bank/ — DatasetBankComponent (page, search, categories, cards)
dashboard/component/user/dataset-bank/dataset-bank.seed.ts — Curated seed of 20+ popular datasets
dashboard/service/dataset-bank/dataset-bank.service.ts — Fetch, filter, import logic

Dataset Search (Agent Service)

agent-service/src/agent/tools/dataset-search-tool.ts — search_datasets tool (dkNET + UCI + Kaggle)
agent-service/src/api/user-datasets-api.ts — Fetches user's existing datasets for prompt injection
agent-service/src/agent/prompts.ts — "Your Datasets" section in system prompt

Dataset Import Proxy (Agent Service)

agent-service/src/api/dataset-import-api.ts — Server-side fetch + Texera dataset upload pipeline
agent-service/src/server.ts — /api/dataset-bank router mount

Results Dashboard (Frontend + Agent Service)

workspace/component/results-dashboard-panel/ — Floating panel with markdown rendering, copy, export
workspace/service/agent-report/agent-report.service.ts — Report pub/sub between chat and panel
agent-service/src/agent/prompts.ts — Report marker convention instructions

Configuration

proxy.config.json — Dev proxy for /api/dataset-bank → agent-service
agent-service/src/config/env.ts — TEXERA_FILE_SERVICE_ENDPOINT for dataset operations

Testing

Angular build: clean ✅
agent-service typecheck: clean ✅
Dataset Import: tested with UCI Iris dataset — end-to-end success ✅
Results Dashboard: tested with agent-generated report — renders correctly ✅
Dataset search tool: registered and callable by agent ✅

Emily Sun and others added 10 commits May 15, 2026 21:55
This bundles the feature work that built up on this branch:

- Custom agents: dashboard CRUD page and editor dialog (48px icon tile,
  chip-style guardrails, model selector). Each custom agent now carries a
  LiteLLM model_name (Opus 4.7 / Haiku 4.5) that is passed through to the
  agent-service so different agents can use different models.

- Conversation history is scoped per (workflowId, agentId): switching
  agent or workflow yields a different conversation list. localStorage
  key: texera.workflowConversations.v1.{workflowId}.{agentId}.

- Time machine: workflow snapshot list, revert, and agent-tagged
  checkpoints. New workflow-history-tool in agent-service backs the
  "undo my last change" flow; amber gains a WorkflowSnapshotResource;
  sql/updates/23.sql adds the snapshot table.

- Operator-aware custom-agent prompts: the system prompt now injects the
  full operator catalog with a "prefer built-in operators over Python
  UDFs" rule, sourced from WorkflowSystemMetadata at request time.

- LiteLLM: added the claude-opus-4.7 entry alongside claude-haiku-4.5
  and gpt-5-mini in bin/litellm-config.yaml.

- Agent panel rewritten around the (conversation list / chat) two-view
  model with subscription-managed list reloads and per-step persistence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rompt

Adds a new agent tool that queries dkNET, UCI ML Repository, and Kaggle in
parallel and returns up to 5 results per source. Failures from individual
sources degrade gracefully so the rest still return. Kaggle is skipped when
KAGGLE_USERNAME / KAGGLE_KEY are not set.

Also fetches the user's accessible datasets via /api/dataset/list when an
agent is bound to a workflow (delegate config), and renders them in a "Your
Datasets" section of the system prompt with the path prefix a File Scan
operator would use to reference files in each one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new floating right-side panel that displays the most recent
agent-generated analysis report (model comparison tables, key metrics,
winner/recommendation, train-vs-test). The agent system prompt now instructs
the model to wrap structured result summaries in `<!-- REPORT_START -->` /
`<!-- REPORT_END -->` markers.

Flow:
- agent emits content wrapped in the markers
- agent-chat strips the marker block from inline rendering and shows a
  compact "Results ready — View Report" card in its place
- card click and new-report arrival both surface the Results Dashboard panel
- panel renders the markdown via ngx-markdown, with copy-to-clipboard and
  export-as-markdown buttons plus the generation timestamp

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New page that lists popular public datasets from dkNET, UCI, and Kaggle with
search and category filters. Backed by a hardcoded seed of ~30 well-known
datasets (Iris, Titanic, MNIST, COCO, TCGA, …) so the page always has
content even when the live catalog APIs are unavailable.

DatasetBankService:
- BehaviorSubjects for search query and active category, plus a combined
  filteredDatasets$ stream the component subscribes to.
- Best-effort live refresh from dkNET + UCI on first visit; results merge
  with the seed (Kaggle is skipped in the browser — CORS/auth).
- Hour-long localStorage cache for the merged list.

DatasetBankComponent:
- Standalone component with title/subtitle, full-width search bar,
  horizontal category chips (All / Biomedical / NLP / CV / Finance /
  Social Science / Time Series / Tabular), and a responsive card grid
  with name, source badge, description, rows/cols/size/format stats,
  tags, and an Import button.
- Import currently opens the source download / catalog page in a new tab
  and surfaces a toast — backend wiring to copy the file into the user's
  Texera datasets is left as a stretch goal.

Route registered at /dashboard/user/dataset-bank with a "Dataset Bank"
sidebar link under Your Work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pload

Card actions now show two equal-width buttons side by side:

- **Download** (left, outline): opens the bank entry's direct download URL
  (or source catalog page) in a new tab — the previous behavior.
- **Import** (right, primary): fetches the file in-browser and registers it
  as a Texera dataset under the current user.

Import goes through the existing user-dataset upload pipeline:
  DatasetService.createDataset()          → new dataset metadata
  DatasetService.multipartUpload()        → stages the file via LakeFS
  DatasetService.createDatasetVersion()   → publishes as v1

The button reflects state per card: idle → "Importing…" (loading) →
"Imported" (disabled, ✓). Failures (most commonly CORS on the source fetch)
re-enable the button so the user can retry, and surface a clear toast
suggesting the Download fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds POST /api/dataset-bank/import-from-url to agent-service. The endpoint
takes { url, name, description } plus a bearer token; server-side fetches
the source file (no browser CORS), then drives the existing dashboard
endpoints with the caller's token:

  /api/dataset/create
  /api/dataset/multipart-upload?type=init   → returns missingParts[]
  /api/dataset/multipart-upload/part        (per chunk, 5 MB)
  /api/dataset/multipart-upload?type=finish
  /api/dataset/{did}/version/create  body "v1"

Returns { did, datasetName, fileName, fileSize }.

DatasetBankService.importToTexera() now calls this proxy instead of
doing the browser-side fetch + multipart upload itself; the per-card
Import button flow on the Dataset Bank page is unchanged from the user's
perspective (idle → Importing… → ✓ Imported), but actually succeeds for
catalogs that don't send CORS headers (UCI, Kaggle direct downloads, etc.).

The Angular proxy.config.json routes /api/dataset-bank/* to localhost:3001
in dev so the existing relative-URL pattern keeps working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /api/dataset/* routes live in file-service (port 9092 per
file-service-web-config.yaml), not amber/dashboard (8080). The new
dataset-bank import proxy and the Feature-1 user-dataset list fetch
were both hitting 8080 and getting 404.

Adds TEXERA_FILE_SERVICE_ENDPOINT to env (default http://localhost:9092)
and exposes it on BackendConfig.fileServiceEndpoint. Both call sites now
read from there.

Also logs the exact URL + upstream status/body at every step of the import
pipeline so future endpoint drift is obvious from the agent-service logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Card footer now has three actions side by side:

  [🔗 View on UCI]   [↓ Download]   [☁ Import]

The View link is a plain anchor (always visible, not on hover) opening
the dataset's catalog page in a new tab so users can verify the source
before importing. Layout is a 1.4fr / 1fr / 1fr grid that gives the
"View" pill room for the longer label without crowding Download or Import.

Also fixed the Human Protein Atlas seed entry to actually point at the
dkNET catalog (RRID:SCR_006710) instead of proteinatlas.org.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added engine ddl-change Changes to the TexeraDB DDL frontend Changes related to the frontend GUI dev common agent-service labels May 16, 2026
Emily Sun and others added 3 commits May 16, 2026 19:28
PubMed (live search) — when the user types a query of 3+ characters in
the Dataset Bank search box, DatasetBankService debounces 400ms then hits
NCBI eSearch + eFetch directly (NCBI sends CORS headers). Each returned
paper appears as a card with title, abstract, authors, journal, year.
Source badge is green "PubMed". Importing a paper sends its PMID to the
backend proxy, which re-fetches via eFetch server-side and emits a 1-row
CSV with columns (pmid, title, abstract, authors, journal, year).

WHO Global Health Observatory — 5 hardcoded seed entries with real GHO
indicator codes:
  - Life Expectancy at Birth      (WHOSIS_000001)
  - HIV Prevalence Adults 15-49   (HIV_0000000001)
  - Tuberculosis Incidence        (MDG_0000000020)
  - Malaria Estimated Deaths      (MALARIA_EST_DEATHS)
  - Under-Five Mortality Rate     (MDG_0000000007)

Source badge is geekblue "WHO". Import fetches the GHO indicator API
(https://ghoapi.azureedge.net/api/<indicator>) server-side and converts
the rows into a (country, year, sex, numeric_value, value) CSV.

New "Public Health" category chip groups WHO entries (and biomedical
seeds that touch population health) for filtering.

Backend proxy refactor: the existing /api/dataset-bank/import-from-url
now accepts a sourceType discriminator. "url" (default) keeps the
existing fetch-arbitrary-URL behavior. "pubmed" and "who" each fetch
their canonical API server-side, build a CSV, then feed the shared
createDataset → multipart-upload → createDatasetVersion pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Angular dev-server proxy was misrouting requests to /api/dataset-bank/*
because the catch-all /api/dataset rule (file-service, port 9092) shares a
common string prefix with /api/dataset-bank and was winning the proxy match
race despite the more-specific rule being declared first.

Avoid the collision by giving the agent-service endpoint a distinct path.
Component/directory names remain `dataset-bank` (it's the user-facing page
identity); only the HTTP path changes:

  proxy.config.json:   "/api/databank" → http://localhost:3001
  agent-service:       new Elysia({ prefix: "/databank" })
  frontend service:    POST /api/databank/import-from-url

A dev-server restart is required when proxy.config.json changes, since
webpack-dev-server does not hot-reload it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Last comment site that still mentioned the pre-rename /api/dataset-bank
path. No behavior change — the actual http.post() call already used the
new path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-service common ddl-change Changes to the TexeraDB DDL dev engine frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant