Skip to content

[Hackathon] MachineUDF operator + per-host machine-manager#5086

Open
aicam wants to merge 33 commits into
apache:mainfrom
aicam:hackathon
Open

[Hackathon] MachineUDF operator + per-host machine-manager#5086
aicam wants to merge 33 commits into
apache:mainfrom
aicam:hackathon

Conversation

@aicam
Copy link
Copy Markdown
Contributor

@aicam aicam commented May 16, 2026

2026-05-15.17-37-04.mp4

TL;DR

A new Texera operator, `MachineUDF`, that runs user Python on a user-registered remote machine (or just the user's own laptop) rather than on a Texera computing unit. The machine runs a tiny `machine-manager` HTTP service we ship; the operator POSTs each tuple — or, in batch mode, the entire upstream table — to that service, gets back the result(s), and emits them as output tuples just like `PythonUDFV2` would. Datasets in LakeFS now also accept a literal `latest` segment in the file path so workflows don't break every time a new dataset version lands.

Use cases this unlocks:

  • Data already on the user's laptop? Read it without uploading via S3 first.
  • Results need to land on the user's laptop? Plots, reports, fine-tuned model files, JSON output — written directly to a path on the user's machine.
  • Per-tuple side effects on a known host (e.g. write each row as a JSONL file).
  • Whole-table analytics that need a Python ML stack the computing units don't necessarily have (sklearn, matplotlib, ...).

End-to-end demo that shipped on this branch: an LLM agent reads `/home/ali/UCI/hackathon/diabetes.csv` on the laptop, builds a workflow on the canvas (`CSVFileScan → MachineUDF[batch]`), trains LinearRegression / Ridge / SVR predicting `target`, saves three prediction-vs-actual PNGs and a `report.md` back into `/home/ali/UCI/hackathon/`, and surfaces the metrics in the Texera result table.

Architecture

  ┌──────────────┐         ┌─────────────────────────────┐
  │  Texera UI   │  HTTP   │  Texera backend services    │
  │  (Angular)   │ ──────► │  (amber, file-service, …)   │
  └──────┬───────┘         └──────────────┬──────────────┘
         │ JWT                            │ JDBC
         │                                ▼
         │                        ┌───────────────────┐
         │                        │   Postgres        │
         │                        │   + `machine` tbl │
         │                        └───────────────────┘
         │
         │  CRUD over /api/machines (new JAX-RS resource)
         │
  ┌──────▼───────────────────────────────────────────┐
  │  ComputingUnitMaster                             │
  │   • MachineUDFOpExec (new)                       │
  │     ── HTTP/1.1 POST /python ──┐                 │
  └────────────────────────────────│─────────────────┘
                                   │
                                   ▼
                ┌─────────────────────────────────────┐
                │   machine-manager  (FastAPI)        │
                │     /healthz                        │
                │     /exec        (bash, allowlist)  │
                │     /python      (run script)       │
                │     /deploy-code (sandboxed write)  │
                │     /upload-to-dataset              │
                └─────────────────────────────────────┘
                  runs on the user's own laptop
                  (or any registered host)

The user registers a machine (URL + optional bearer token) via the new Machines dashboard tab; the operator looks up the URL from the `machine` table at runtime.

What's in the PR

1. `MachineUDF` operator (Scala) — `common/workflow-operator/.../udf/machine/`

  • `MachineUDFOpDesc` — operator descriptor with properties `machineUrl`, `machineToken`, `code`, `timeoutSeconds`, `batchMode`, `retainInputColumns`, `outputColumns`.
  • `MachineUDFOpExec` — extends `OperatorExecutor` directly (not `MapOpExec`) so it can run either per-tuple or in batch mode.
    • Per-tuple mode (default): one `POST /python` per row, `tuple_in` is a single dict; last JSON line on stdout becomes the output tuple.
    • Batch mode (`batchMode: true`): buffer all tuples; on `onFinish`, ONE `/python` call with `tuple_in` as a list of dicts; each JSON object line on stdout becomes one output row, projected onto the declared `outputColumns`.
  • HTTP client pinned to HTTP/1.1 — we hit a real bug where the JDK's default `HttpClient` does h2c upgrade and uvicorn drops the request body. The pinning eliminates that whole class of "POST sent but server received empty body" failures.
  • Output schema is preserved tuple-by-tuple in per-tuple mode (re-uses the original tuple's typed values for retained columns so e.g. Int stays Int instead of becoming Long after a JSON round-trip).

Registered in `LogicalOp.scala` as the `MachineUDF` JSON-type, in the `PYTHON_GROUP` operator group.

2. `machine-manager` Python service — `machine-manager/` (new directory)

FastAPI service the operator POSTs to. Endpoints:

Endpoint Purpose
`GET /healthz` Liveness + reports which Python interpreter `/python` uses
`POST /exec` Run a shell command (`bash -c`), capture stdout/stderr/exit_code
`POST /python` Run a Python script with `tuple_in` injected as a global
`POST /deploy-code` Write a file under a sandbox dir (path-escape protected)
`POST /upload-to-dataset` Read a local file and push it into a Texera dataset (creating a new version)

Highlights:

  • Token-optional bearer auth via `MACHINE_MANAGER_TOKEN`. Skipped in dev when unset.
  • Configurable interpreter for `/python` via `MACHINE_MANAGER_PYTHON`. Defaults to whatever Python the service itself runs on, but the included `bin/run.sh` autodetects a Texera data-science venv (sklearn/pandas/matplotlib/numpy) and picks that so analysis scripts work out of the box.
  • Pre-flight syntax check on every `/python` call: we `compile()` the user script before spawning a subprocess. Caller gets `exit_code=2` + a one-line hint (about raw newlines in string literals being the typical cause). Saves a real round-trip when LLM-generated scripts are malformed.
  • `/upload-to-dataset` handles file-service's "No changes detected" 400 as a no-op and falls back to the latest version metadata, so re-uploads of identical content don't error.
  • `tuple_in` schema is `Any` — dict for per-tuple mode, list-of-dicts for batch mode, `null` for one-shot diagnostic calls.

3. Machine CRUD — `amber/.../resource/MachineResource.scala` + `sql/texera_ddl.sql` + frontend

  • New `machine` table (`mid, uid, name, url, token, creation_time`) — owner-scoped, JWT-gated.
  • New JAX-RS resource at `/api/machines` (list/create/update/delete).
  • New Machines dashboard tab — standalone Angular component with `NzTable` and an add/edit modal — lives next to Datasets in the sidebar.
  • Service + types under `frontend/src/app/common/service/machine/` and `frontend/src/app/common/type/machine.ts`.

4. `FileResolver` — `latest` version sentinel

`common/workflow-core/.../FileResolver.scala` now accepts the literal segment `latest` in dataset paths:

/<ownerEmail>/<datasetName>/latest/<filePath>      ← always resolves to the newest version
/<ownerEmail>/<datasetName>/v3/<filePath>          ← still works, explicit version
/<ownerEmail>/<datasetName>/<filePath>             ← 3-segment form; latest implied

The "latest" form is resolved at execution time by querying `dataset_version` ordered by `dvid DESC LIMIT 1`. This is the path `uploadFileToDataset` returns to callers, which means once an LLM (or a human) wires a scan operator with a `latest` path, subsequent uploads to that dataset don't break the workflow.

5. Agent-service integration

The `agent-service` (LLM-driven workflow builder) has been taught about all of the above:

  • New tools: `runOnMachine` (`/exec`), `runPythonOnMachine` (`/python`, currently de-registered to keep workflows as the only answer path), `listDatasets`, `uploadFileToDataset`, `getDatasetFile`. The upload tool returns a verbatim `fileName_for_scan_operator` string so the LLM doesn't have to construct the canonical dataset path itself.
  • Prompt rules added to prevent the common LLM failure modes we observed: guessing `datasetId` from a name, fabricating `machineUrl` hostnames, ignoring tool result paths, embedding raw newlines in single-/double-quoted Python strings (including f-strings), forgetting `batchMode: true` on whole-table MachineUDF scripts, re-running a workflow that already returned `COMPLETED`.
  • `MachineUDF`, the relevant Sklearn operators (`SklearnTrainingLinearRegression`, `SklearnTrainingRidge`, `SVRTrainer`, `SklearnPrediction`, `Split`), and `PythonTableReducer` are added to `DEFAULT_AGENT_SETTINGS.allowedOperatorTypes` so the agent can actually pick them.

A worked example for the diabetes regression demo is embedded in the prompt verbatim, including a per-branch script template.

How it was tested

End-to-end demo, repeatedly: agent prompt → `runOnMachine` to verify the local CSV → `uploadFileToDataset` → `addOperator` `CSVFileScan` + batch `MachineUDF` → workflow execution. Result table populated with 3 rows (LinearRegression / Ridge / SVR metrics), 3 PNGs + `report.md` written to the laptop. Example output:

Model MSE
LinearRegression 0.4753 3375.74
Ridge 0.4838 3320.47
SVR -0.2073 7766.64

Failure modes we encountered and have either fixed or surfaced clearly:

  • JDK `HttpClient` h2c upgrade dropping the POST body → pin HTTP/1.1.
  • `MachineUDF` in batch mode being treated as per-tuple because the agent forgot `batchMode: true` → loud prompt rule + clear malformed-output signature.
  • LLM-generated Python scripts with raw newlines inside f-strings → server-side `compile()` pre-flight + explicit prompt rule.
  • LLM hallucinating `http://:5555` instead of using the URL returned by `runOnMachine` → hard prompt rule.

Known limitations / things to discuss

  • The Apache license header is missing on the new `machine-manager/` Python files. Easy to add before any cleanup pass.
  • `machine-manager` has no rate limiting / process isolation beyond what FastAPI gives you. The `/exec` endpoint is a literal "run this bash command on the user's host" — useful for inspection from the canvas, but the threat model has to be "the user already trusts this Texera instance with their laptop".
  • `MachineUDF` operator has only one input port. Three-branch ML demos that want to fan three SklearnPrediction outputs into one collector currently need a separate `MachineUDF` per branch (or to standardize column names and Union upstream). A multi-port variant would be a nice follow-up.
  • The frontend changes include some sibling work (ML-model component, MLFlow operator schema) that would need to be split out for a real merge.
  • The agent-service prompt is currently authored for a specific demo flow (regression on diabetes.csv). A real merge would generalize the worked example.

🤖 Generated with Claude Code

aicam and others added 30 commits February 24, 2026 13:58
…ent via public cluster services

- Add CloudMapperSourceOpDesc, ReferenceGenome, ReferenceGenomeEnum operator classes
- Add FileResolver.resolveDirectory for resolving dataset directories by path
- Add DatasetFileDocument directory mode: downloads all files as a zip via LakeFS/FileService
- Add DocumentFactory.openReadonlyDocument isDirectory parameter
- Add ENV_FILE_SERVICE_LIST_DIRECTORY_OBJECTS_ENDPOINT env var
- Add Kubernetes Helm chart and PVC for the cloudmapper service

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ntend integration

- Add ClusterResource, ClusterCallbackResource, ClusterServiceClient, ClusterUtils backend API for managing EC2 clusters
- Add cluster dashboard component with launch/stop/terminate/start actions and management modal
- Add ClusterSelectionComponent and ClusterAutoCompleteComponent for operator property panel
- Add DirectoryPathInput and DirectorySelection components for dataset directory selection
- Add cluster route in app-routing, cluster declarations in app.module
- Add cluster_enabled feature flag to gui-config, dashboard sidebar, and admin settings
- Add clusterautocomplete and directorypathinput formly field types
- Register cluster/directoryName/fastQFiles/fastAFiles/gtfFile fields in operator property editor
- Add SQL schema for cluster and cluster_activity tables
- Add dknet logo, CloudBioMapper operator icon, and sequence-alignment workflow assets
- Add DatasetDirectoryDocument and PathUtils storage utilities

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
aicam and others added 3 commits May 15, 2026 14:49
…ution

Adds a "Machines" abstraction so workflows can dispatch per-tuple Python
to a user's own host:

- machine-manager: FastAPI service (port 5555) exposing /exec, /python,
  /deploy-code, /upload-to-dataset. Token-optional bearer auth.
- MachineUDFOpDesc/OpExec: Python-only map operator that POSTs each
  tuple as JSON to a configured machine URL and merges script output.
  Pins HTTP/1.1 to avoid h2c upgrade dropping the request body.
- MachineResource (JAX-RS, JWT-gated) + `machine` DDL table for CRUD.
- Dashboard "Machines" tab (standalone component + service + types).
- Agent tools: createMachine/run-on-machine/listDatasets/
  uploadFileToDataset/getDatasetFile. Upload returns the canonical
  /<ownerEmail>/<datasetName>/latest/<file> path the LLM should paste
  into CSVFileScan.
- FileResolver: accept `latest` as a version token (and 3-segment
  paths) by resolving to the dataset's newest version.
- Agent prompts: MACHINE_TOOLS_INSTRUCTIONS section with hard rules
  against guessing datasetId from name + worked example.
- gui.conf: enable local login + copilot for local dev; default user
  texera/texera.
- llm.conf: point at cherry00 LiteLLM gateway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets the agent build a multi-operator Sklearn regression workflow on the
canvas (CSVFileScan → Split → 3× SklearnTraining → 3× SklearnPrediction
→ 3× MachineUDF) instead of stuffing all ML into one fat MachineUDF.

- MachineUDF: new `batchMode` property. When true, the exec class buffers
  every input tuple and emits one /python call with `tuple_in` as a list
  of dicts on onFinish; the script's stdout JSON lines become output
  rows. Switches the base class to OperatorExecutor so blocking semantics
  work. Per-tuple mode (default) is unchanged.
- machine-manager: `tuple_in` schema widened to Any so batch-mode payloads
  (list of dicts) deserialize. New MACHINE_MANAGER_PYTHON env var picks
  the python interpreter used for /python; run.sh auto-selects the
  texera .venv (sklearn/pandas/matplotlib).
- agent-service prompt: regression demo recipe replaces the old "single
  MachineUDF does everything" path. Hard rules added against fabricating
  machineUrl, re-running already-COMPLETED workflows, embedding raw
  newlines in single/double-quoted Python strings, and forgetting
  `batchMode: true` on whole-table MachineUDF scripts.
- agent-service allowedOperatorTypes: enable Sklearn{LinearRegression,
  Ridge, SVR, Prediction} trainers, Split, PythonTableReducer,
  ImageVisualizer, ScatterMatrixChart.
- agent-service tool surface: added (then de-registered) a
  `runPythonOnMachine` helper to keep the workflow as the only path to
  results, per the "showcase Texera" requirement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- machine-manager `/python` now `compile()`s the script before running.
  SyntaxErrors come back instantly with `exit_code=2` and a hint about
  triple-quoted strings, instead of after spinning up Python.
- Agent prompt rule apache#9 expanded to spell out f-strings explicitly
  (LLMs keep generating `f"...{x}...|` followed by a raw newline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added engine ddl-change Changes to the TexeraDB DDL python frontend Changes related to the frontend GUI common agent-service labels May 16, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 16, 2026

Codecov Report

❌ Patch coverage is 4.02685% with 143 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.90%. Comparing base (7bb102e) to head (9653441).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
.../amber/operator/udf/machine/MachineUDFOpExec.scala 0.00% 84 Missing ⚠️
...g/apache/texera/web/resource/MachineResource.scala 0.00% 30 Missing ⚠️
.../amber/operator/udf/machine/MachineUDFOpDesc.scala 0.00% 24 Missing ⚠️
...pache/texera/amber/core/storage/FileResolver.scala 60.00% 0 Missing and 4 partials ⚠️
...a/org/apache/texera/web/TexeraWebApplication.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5086      +/-   ##
============================================
+ Coverage     42.85%   42.90%   +0.05%     
+ Complexity     2207     2206       -1     
============================================
  Files          1045     1048       +3     
  Lines         40146    40365     +219     
  Branches       4240     4267      +27     
============================================
+ Hits          17203    17318     +115     
- Misses        21878    21970      +92     
- Partials       1065     1077      +12     
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 33.72% <ø> (ø) Carriedforward from 7bb102e
amber 43.37% <4.02%> (-0.38%) ⬇️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 32.18% <ø> (ø)
frontend 33.93% <ø> (ø) Carriedforward from 7bb102e
python 90.42% <ø> (+1.42%) ⬆️
workflow-compiling-service 47.72% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aicam aicam changed the title feat: MachineUDF operator + per-host machine-manager (hackathon demo) [Hackathon] MachineUDF operator + per-host machine-manager May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-service common ddl-change Changes to the TexeraDB DDL engine frontend Changes related to the frontend GUI python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants