[Hackathon] MachineUDF operator + per-host machine-manager#5086
Open
aicam wants to merge 33 commits into
Open
Conversation
…ent via public cluster services - Add CloudMapperSourceOpDesc, ReferenceGenome, ReferenceGenomeEnum operator classes - Add FileResolver.resolveDirectory for resolving dataset directories by path - Add DatasetFileDocument directory mode: downloads all files as a zip via LakeFS/FileService - Add DocumentFactory.openReadonlyDocument isDirectory parameter - Add ENV_FILE_SERVICE_LIST_DIRECTORY_OBJECTS_ENDPOINT env var - Add Kubernetes Helm chart and PVC for the cloudmapper service Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ntend integration - Add ClusterResource, ClusterCallbackResource, ClusterServiceClient, ClusterUtils backend API for managing EC2 clusters - Add cluster dashboard component with launch/stop/terminate/start actions and management modal - Add ClusterSelectionComponent and ClusterAutoCompleteComponent for operator property panel - Add DirectoryPathInput and DirectorySelection components for dataset directory selection - Add cluster route in app-routing, cluster declarations in app.module - Add cluster_enabled feature flag to gui-config, dashboard sidebar, and admin settings - Add clusterautocomplete and directorypathinput formly field types - Register cluster/directoryName/fastQFiles/fastAFiles/gtfFile fields in operator property editor - Add SQL schema for cluster and cluster_activity tables - Add dknet logo, CloudBioMapper operator icon, and sequence-alignment workflow assets - Add DatasetDirectoryDocument and PathUtils storage utilities Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Feat/cloudbiomapper
Revert "Feat/cloudbiomapper"
…ution Adds a "Machines" abstraction so workflows can dispatch per-tuple Python to a user's own host: - machine-manager: FastAPI service (port 5555) exposing /exec, /python, /deploy-code, /upload-to-dataset. Token-optional bearer auth. - MachineUDFOpDesc/OpExec: Python-only map operator that POSTs each tuple as JSON to a configured machine URL and merges script output. Pins HTTP/1.1 to avoid h2c upgrade dropping the request body. - MachineResource (JAX-RS, JWT-gated) + `machine` DDL table for CRUD. - Dashboard "Machines" tab (standalone component + service + types). - Agent tools: createMachine/run-on-machine/listDatasets/ uploadFileToDataset/getDatasetFile. Upload returns the canonical /<ownerEmail>/<datasetName>/latest/<file> path the LLM should paste into CSVFileScan. - FileResolver: accept `latest` as a version token (and 3-segment paths) by resolving to the dataset's newest version. - Agent prompts: MACHINE_TOOLS_INSTRUCTIONS section with hard rules against guessing datasetId from name + worked example. - gui.conf: enable local login + copilot for local dev; default user texera/texera. - llm.conf: point at cherry00 LiteLLM gateway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets the agent build a multi-operator Sklearn regression workflow on the
canvas (CSVFileScan → Split → 3× SklearnTraining → 3× SklearnPrediction
→ 3× MachineUDF) instead of stuffing all ML into one fat MachineUDF.
- MachineUDF: new `batchMode` property. When true, the exec class buffers
every input tuple and emits one /python call with `tuple_in` as a list
of dicts on onFinish; the script's stdout JSON lines become output
rows. Switches the base class to OperatorExecutor so blocking semantics
work. Per-tuple mode (default) is unchanged.
- machine-manager: `tuple_in` schema widened to Any so batch-mode payloads
(list of dicts) deserialize. New MACHINE_MANAGER_PYTHON env var picks
the python interpreter used for /python; run.sh auto-selects the
texera .venv (sklearn/pandas/matplotlib).
- agent-service prompt: regression demo recipe replaces the old "single
MachineUDF does everything" path. Hard rules added against fabricating
machineUrl, re-running already-COMPLETED workflows, embedding raw
newlines in single/double-quoted Python strings, and forgetting
`batchMode: true` on whole-table MachineUDF scripts.
- agent-service allowedOperatorTypes: enable Sklearn{LinearRegression,
Ridge, SVR, Prediction} trainers, Split, PythonTableReducer,
ImageVisualizer, ScatterMatrixChart.
- agent-service tool surface: added (then de-registered) a
`runPythonOnMachine` helper to keep the workflow as the only path to
results, per the "showcase Texera" requirement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- machine-manager `/python` now `compile()`s the script before running. SyntaxErrors come back instantly with `exit_code=2` and a hint about triple-quoted strings, instead of after spinning up Python. - Agent prompt rule apache#9 expanded to spell out f-strings explicitly (LLMs keep generating `f"...{x}...|` followed by a raw newline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5086 +/- ##
============================================
+ Coverage 42.85% 42.90% +0.05%
+ Complexity 2207 2206 -1
============================================
Files 1045 1048 +3
Lines 40146 40365 +219
Branches 4240 4267 +27
============================================
+ Hits 17203 17318 +115
- Misses 21878 21970 +92
- Partials 1065 1077 +12
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
2026-05-15.17-37-04.mp4
TL;DR
A new Texera operator, `MachineUDF`, that runs user Python on a user-registered remote machine (or just the user's own laptop) rather than on a Texera computing unit. The machine runs a tiny `machine-manager` HTTP service we ship; the operator POSTs each tuple — or, in batch mode, the entire upstream table — to that service, gets back the result(s), and emits them as output tuples just like `PythonUDFV2` would. Datasets in LakeFS now also accept a literal `latest` segment in the file path so workflows don't break every time a new dataset version lands.
Use cases this unlocks:
End-to-end demo that shipped on this branch: an LLM agent reads `/home/ali/UCI/hackathon/diabetes.csv` on the laptop, builds a workflow on the canvas (`CSVFileScan → MachineUDF[batch]`), trains LinearRegression / Ridge / SVR predicting `target`, saves three prediction-vs-actual PNGs and a `report.md` back into `/home/ali/UCI/hackathon/`, and surfaces the metrics in the Texera result table.
Architecture
The user registers a machine (URL + optional bearer token) via the new Machines dashboard tab; the operator looks up the URL from the `machine` table at runtime.
What's in the PR
1. `MachineUDF` operator (Scala) — `common/workflow-operator/.../udf/machine/`
Registered in `LogicalOp.scala` as the `MachineUDF` JSON-type, in the `PYTHON_GROUP` operator group.
2. `machine-manager` Python service — `machine-manager/` (new directory)
FastAPI service the operator POSTs to. Endpoints:
Highlights:
3. Machine CRUD — `amber/.../resource/MachineResource.scala` + `sql/texera_ddl.sql` + frontend
4. `FileResolver` — `latest` version sentinel
`common/workflow-core/.../FileResolver.scala` now accepts the literal segment `latest` in dataset paths:
The "latest" form is resolved at execution time by querying `dataset_version` ordered by `dvid DESC LIMIT 1`. This is the path `uploadFileToDataset` returns to callers, which means once an LLM (or a human) wires a scan operator with a `latest` path, subsequent uploads to that dataset don't break the workflow.
5. Agent-service integration
The `agent-service` (LLM-driven workflow builder) has been taught about all of the above:
A worked example for the diabetes regression demo is embedded in the prompt verbatim, including a per-branch script template.
How it was tested
End-to-end demo, repeatedly: agent prompt → `runOnMachine` to verify the local CSV → `uploadFileToDataset` → `addOperator` `CSVFileScan` + batch `MachineUDF` → workflow execution. Result table populated with 3 rows (LinearRegression / Ridge / SVR metrics), 3 PNGs + `report.md` written to the laptop. Example output:
Failure modes we encountered and have either fixed or surfaced clearly:
Known limitations / things to discuss
🤖 Generated with Claude Code