[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097
Open
EmilySun621 wants to merge 13 commits into
Open
[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097EmilySun621 wants to merge 13 commits into
EmilySun621 wants to merge 13 commits into
Conversation
This bundles the feature work that built up on this branch:
- Custom agents: dashboard CRUD page and editor dialog (48px icon tile,
chip-style guardrails, model selector). Each custom agent now carries a
LiteLLM model_name (Opus 4.7 / Haiku 4.5) that is passed through to the
agent-service so different agents can use different models.
- Conversation history is scoped per (workflowId, agentId): switching
agent or workflow yields a different conversation list. localStorage
key: texera.workflowConversations.v1.{workflowId}.{agentId}.
- Time machine: workflow snapshot list, revert, and agent-tagged
checkpoints. New workflow-history-tool in agent-service backs the
"undo my last change" flow; amber gains a WorkflowSnapshotResource;
sql/updates/23.sql adds the snapshot table.
- Operator-aware custom-agent prompts: the system prompt now injects the
full operator catalog with a "prefer built-in operators over Python
UDFs" rule, sourced from WorkflowSystemMetadata at request time.
- LiteLLM: added the claude-opus-4.7 entry alongside claude-haiku-4.5
and gpt-5-mini in bin/litellm-config.yaml.
- Agent panel rewritten around the (conversation list / chat) two-view
model with subscription-managed list reloads and per-step persistence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rompt Adds a new agent tool that queries dkNET, UCI ML Repository, and Kaggle in parallel and returns up to 5 results per source. Failures from individual sources degrade gracefully so the rest still return. Kaggle is skipped when KAGGLE_USERNAME / KAGGLE_KEY are not set. Also fetches the user's accessible datasets via /api/dataset/list when an agent is bound to a workflow (delegate config), and renders them in a "Your Datasets" section of the system prompt with the path prefix a File Scan operator would use to reference files in each one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new floating right-side panel that displays the most recent agent-generated analysis report (model comparison tables, key metrics, winner/recommendation, train-vs-test). The agent system prompt now instructs the model to wrap structured result summaries in `<!-- REPORT_START -->` / `<!-- REPORT_END -->` markers. Flow: - agent emits content wrapped in the markers - agent-chat strips the marker block from inline rendering and shows a compact "Results ready — View Report" card in its place - card click and new-report arrival both surface the Results Dashboard panel - panel renders the markdown via ngx-markdown, with copy-to-clipboard and export-as-markdown buttons plus the generation timestamp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This reverts commit 76e87ed.
New page that lists popular public datasets from dkNET, UCI, and Kaggle with search and category filters. Backed by a hardcoded seed of ~30 well-known datasets (Iris, Titanic, MNIST, COCO, TCGA, …) so the page always has content even when the live catalog APIs are unavailable. DatasetBankService: - BehaviorSubjects for search query and active category, plus a combined filteredDatasets$ stream the component subscribes to. - Best-effort live refresh from dkNET + UCI on first visit; results merge with the seed (Kaggle is skipped in the browser — CORS/auth). - Hour-long localStorage cache for the merged list. DatasetBankComponent: - Standalone component with title/subtitle, full-width search bar, horizontal category chips (All / Biomedical / NLP / CV / Finance / Social Science / Time Series / Tabular), and a responsive card grid with name, source badge, description, rows/cols/size/format stats, tags, and an Import button. - Import currently opens the source download / catalog page in a new tab and surfaces a toast — backend wiring to copy the file into the user's Texera datasets is left as a stretch goal. Route registered at /dashboard/user/dataset-bank with a "Dataset Bank" sidebar link under Your Work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pload Card actions now show two equal-width buttons side by side: - **Download** (left, outline): opens the bank entry's direct download URL (or source catalog page) in a new tab — the previous behavior. - **Import** (right, primary): fetches the file in-browser and registers it as a Texera dataset under the current user. Import goes through the existing user-dataset upload pipeline: DatasetService.createDataset() → new dataset metadata DatasetService.multipartUpload() → stages the file via LakeFS DatasetService.createDatasetVersion() → publishes as v1 The button reflects state per card: idle → "Importing…" (loading) → "Imported" (disabled, ✓). Failures (most commonly CORS on the source fetch) re-enable the button so the user can retry, and surface a clear toast suggesting the Download fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds POST /api/dataset-bank/import-from-url to agent-service. The endpoint
takes { url, name, description } plus a bearer token; server-side fetches
the source file (no browser CORS), then drives the existing dashboard
endpoints with the caller's token:
/api/dataset/create
/api/dataset/multipart-upload?type=init → returns missingParts[]
/api/dataset/multipart-upload/part (per chunk, 5 MB)
/api/dataset/multipart-upload?type=finish
/api/dataset/{did}/version/create body "v1"
Returns { did, datasetName, fileName, fileSize }.
DatasetBankService.importToTexera() now calls this proxy instead of
doing the browser-side fetch + multipart upload itself; the per-card
Import button flow on the Dataset Bank page is unchanged from the user's
perspective (idle → Importing… → ✓ Imported), but actually succeeds for
catalogs that don't send CORS headers (UCI, Kaggle direct downloads, etc.).
The Angular proxy.config.json routes /api/dataset-bank/* to localhost:3001
in dev so the existing relative-URL pattern keeps working.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /api/dataset/* routes live in file-service (port 9092 per file-service-web-config.yaml), not amber/dashboard (8080). The new dataset-bank import proxy and the Feature-1 user-dataset list fetch were both hitting 8080 and getting 404. Adds TEXERA_FILE_SERVICE_ENDPOINT to env (default http://localhost:9092) and exposes it on BackendConfig.fileServiceEndpoint. Both call sites now read from there. Also logs the exact URL + upstream status/body at every step of the import pipeline so future endpoint drift is obvious from the agent-service logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Card footer now has three actions side by side: [🔗 View on UCI] [↓ Download] [☁ Import] The View link is a plain anchor (always visible, not on hover) opening the dataset's catalog page in a new tab so users can verify the source before importing. Layout is a 1.4fr / 1fr / 1fr grid that gives the "View" pill room for the longer label without crowding Download or Import. Also fixed the Human Protein Atlas seed entry to actually point at the dkNET catalog (RRID:SCR_006710) instead of proteinatlas.org. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PubMed (live search) — when the user types a query of 3+ characters in the Dataset Bank search box, DatasetBankService debounces 400ms then hits NCBI eSearch + eFetch directly (NCBI sends CORS headers). Each returned paper appears as a card with title, abstract, authors, journal, year. Source badge is green "PubMed". Importing a paper sends its PMID to the backend proxy, which re-fetches via eFetch server-side and emits a 1-row CSV with columns (pmid, title, abstract, authors, journal, year). WHO Global Health Observatory — 5 hardcoded seed entries with real GHO indicator codes: - Life Expectancy at Birth (WHOSIS_000001) - HIV Prevalence Adults 15-49 (HIV_0000000001) - Tuberculosis Incidence (MDG_0000000020) - Malaria Estimated Deaths (MALARIA_EST_DEATHS) - Under-Five Mortality Rate (MDG_0000000007) Source badge is geekblue "WHO". Import fetches the GHO indicator API (https://ghoapi.azureedge.net/api/<indicator>) server-side and converts the rows into a (country, year, sex, numeric_value, value) CSV. New "Public Health" category chip groups WHO entries (and biomedical seeds that touch population health) for filtering. Backend proxy refactor: the existing /api/dataset-bank/import-from-url now accepts a sourceType discriminator. "url" (default) keeps the existing fetch-arbitrary-URL behavior. "pubmed" and "who" each fetch their canonical API server-side, build a CSV, then feed the shared createDataset → multipart-upload → createDatasetVersion pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Angular dev-server proxy was misrouting requests to /api/dataset-bank/*
because the catch-all /api/dataset rule (file-service, port 9092) shares a
common string prefix with /api/dataset-bank and was winning the proxy match
race despite the more-specific rule being declared first.
Avoid the collision by giving the agent-service endpoint a distinct path.
Component/directory names remain `dataset-bank` (it's the user-facing page
identity); only the HTTP path changes:
proxy.config.json: "/api/databank" → http://localhost:3001
agent-service: new Elysia({ prefix: "/databank" })
frontend service: POST /api/databank/import-from-url
A dev-server restart is required when proxy.config.json changes, since
webpack-dev-server does not hot-reload it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Last comment site that still mentioned the pre-rename /api/dataset-bank path. No behavior change — the actual http.post() call already used the new path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Story
A biomedical researcher wants to study diabetes. She opens Texera and thinks: "Where do I even find the right dataset?" She opens a new tab, googles around, downloads a CSV from UCI, uploads it to Texera, configures the file path manually. Twenty minutes gone before she's even started analyzing.
After she builds her workflow and runs it, the AI agent gives her a detailed model comparison — accuracy, F1 scores, key insights. But it's all buried in a long chat message she has to scroll through. She can't easily reference it, copy it, or share it with her advisor.
We fixed both problems.
What We Built
A new page in the sidebar where users browse a curated catalog of public datasets from UCI, Kaggle, and dkNET — searchable, categorized, and importable with one click.
How it works:
Open "Dataset Bank" in the sidebar → see a grid of dataset cards
Search by name, description, or tag (e.g., "diabetes", "classification", "healthcare")
Filter by category: Biomedical, NLP, Computer Vision, Finance, Social Science, Time Series, Tabular
Every card shows: name, source badge (UCI/Kaggle/dkNET), description, row/column counts, file size, tags
Three actions per card:
🔗 View on source — opens the original dataset page so users can verify before importing
↓ Download — saves the file locally
☁ Import — imports directly into Texera's dataset system. One click, no manual upload, no file path configuration. The dataset immediately appears in "Your Datasets" and is ready for any workflow.
Backend: Server-side proxy (/api/dataset-bank/import-from-url) fetches the file and uploads it through Texera's existing dataset pipeline — bypassing browser CORS restrictions.
2. Dataset Search Agent Tool — "Find Me a Diabetes Dataset"
The AI agent can now search for datasets on your behalf during a conversation.
User asks: "find me a diabetes dataset" → agent calls search_datasets tool
Searches dkNET, UCI, and Kaggle in parallel, returns top results
Agent also knows your existing Texera datasets (injected into system prompt as "Your Datasets" section)
User says "use my iris dataset" → agent knows the exact file path and configures the CSV Source automatically
When the AI agent produces a workflow analysis (model comparison, metrics, key findings), it now appears in a dedicated Results Dashboard panel instead of being buried in chat.
Agent wraps analysis in report markers → chat shows a compact card: "📊 Results ready · View Report →"
Clicking opens a floating Results Dashboard panel alongside the canvas
Dashboard renders formatted markdown: tables, headers, bold metrics, key insights
Copy button to clipboard, Export button to download as markdown
Timestamped so users know when the analysis was generated
Auto-updates when agent sends new analysis
The experience: Canvas on the left showing your DAG, Results Dashboard on the right showing your analysis, chat in between for interaction. Everything visible at once — no tab switching, no scrolling through messages.
Demo Scenario
Open Dataset Bank → search "diabetes" → filter "Biomedical" → see Pima Indians Diabetes dataset
Click "🔗 View on UCI" to verify → click "☁ Import" → dataset appears in Your Datasets
Open a workflow → ask the Diabetes Agent: "Build a classification workflow using my diabetes dataset"
Agent generates the workflow on canvas
Ask: "Run it and give me a comparison report"
Agent runs workflow, produces analysis → "📊 Results ready · View Report →"
Click → Results Dashboard opens with formatted model comparison table, winner, key insights
Copy the report to share with advisor
Files Changed
Dataset Bank (Frontend)
dashboard/component/user/dataset-bank/ — DatasetBankComponent (page, search, categories, cards)
dashboard/component/user/dataset-bank/dataset-bank.seed.ts — Curated seed of 20+ popular datasets
dashboard/service/dataset-bank/dataset-bank.service.ts — Fetch, filter, import logic
Dataset Search (Agent Service)
agent-service/src/agent/tools/dataset-search-tool.ts — search_datasets tool (dkNET + UCI + Kaggle)
agent-service/src/api/user-datasets-api.ts — Fetches user's existing datasets for prompt injection
agent-service/src/agent/prompts.ts — "Your Datasets" section in system prompt
Dataset Import Proxy (Agent Service)
agent-service/src/api/dataset-import-api.ts — Server-side fetch + Texera dataset upload pipeline
agent-service/src/server.ts — /api/dataset-bank router mount
Results Dashboard (Frontend + Agent Service)
workspace/component/results-dashboard-panel/ — Floating panel with markdown rendering, copy, export
workspace/service/agent-report/agent-report.service.ts — Report pub/sub between chat and panel
agent-service/src/agent/prompts.ts — Report marker convention instructions
Configuration
proxy.config.json — Dev proxy for /api/dataset-bank → agent-service
agent-service/src/config/env.ts — TEXERA_FILE_SERVICE_ENDPOINT for dataset operations
Testing
Angular build: clean ✅
agent-service typecheck: clean ✅
Dataset Import: tested with UCI Iris dataset — end-to-end success ✅
Results Dashboard: tested with agent-generated report — renders correctly ✅
Dataset search tool: registered and callable by agent ✅