Skip to content

Fast source operator stays orange (RUNNING) after the workflow completes #6010

Description

@Yicong-Huang

What happened?

A fast source operator (e.g. Text Input with a few rows) stays orange (RUNNING) in the editor after the run has finished and results are shown. The operator never turns green (COMPLETED).

Image

Root cause — physical timestamps used to order causally-ordered state. A worker is the single writer of its own state; its transitions (READY → RUNNING → COMPLETED) have a strict causal order. But the controller reconstructs that state from three unordered channels:

State How it reaches the controller
source RUNNING startWorker response snapshot
non-source RUNNING workerStateUpdated push
COMPLETED queryStatistics response snapshot (no dedicated push)

WorkerExecution.update resolves conflicts by last System.nanoTime() wins. For a tiny source, the whole run finishes almost instantly, so the startWorker response (carrying the stale RUNNING it sampled at launch) can arrive at the controller after COMPLETED was already recorded. Because its receipt timestamp is later, the stale RUNNING clobbers COMPLETED:

Before:  start(RUNNING)──────────────(late)──────────▶ ts=30  ⇒ RUNNING wins ✗
         portCompleted/execCompleted ▶ COMPLETED ts=20
After:   RUNNING carries version 2  <  COMPLETED version 3   ⇒ COMPLETED stays ✓
         + terminal state is absorbing

The result data uses a separate path, so results render correctly while the border is stuck.

Introduced in #3557 (the timestamp-based update).

How to reproduce?

  1. New workflow with a single fast source operator (Text Input, a few lines).
  2. Run it. Results appear in the Result panel.
  3. Operator border stays orange/RUNNING instead of green/COMPLETED.

In the browser WS frames, the last OperatorStatisticsUpdateEvent for the operator carries operatorState: "Running" — i.e. the wrong state is sent by the backend; the frontend renders it faithfully. Intermittent (it is a race), but very likely for tiny sources.

Fix

Order worker state causally, not by wall clock:

  • Per-worker logical version: WorkerStateManager increments a monotonic counter on every transitTo; carried on every state report (WorkerStateResponse, WorkerStateUpdatedRequest, WorkerMetrics). The controller applies a state only if its version is newer. Single source ⇒ no cross-process clock-sync concern.
  • Terminal-state absorption: once COMPLETED/TERMINATED, a worker cannot be moved back by any later report.

Stats keep timestamp ordering (monotonic snapshots within one state).

Version/Branch

1.3.0-incubating-SNAPSHOT (main)

Commit Hash (Optional)

4d05ab2

What browsers are you seeing the problem on?

No response

Relevant log output

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions