fix: use task-stream index instead of wall clock in get_task_stream context manager (#9253) by MohammadYusif · Pull Request #9282 · dask/distributed

MohammadYusif · 2026-05-28T18:14:27Z

Tests added / passed
Passes pre-commit run --all-files

The get_task_stream context manager bounded the tasks it collected with a
wall-clock timestamp (time() - 0.1), and collect() bisects the buffer by
comparing that boundary against each task's recorded stop time. With latency
or clock skew between the client and the workers, a task that finished inside
the block can carry a stop time earlier than the client's start boundary
and be silently dropped — so get_task_stream() returns no tasks. This matches
the existing # FIXME ... We should query TaskStreamPlugin.index instead.

This records the scheduler's monotonic task-stream append index on entry and
collects everything appended after it on exit, removing the dependency on
synchronized clocks.

distributed/diagnostics/task_stream.py: collect() gains a start_index
path that selects records by append position instead of timestamp.
distributed/scheduler.py: new get_task_stream_index RPC + start_index
passthrough on get_task_stream. Refactored the plugin-init guard into
_task_stream_plugin() helper used by both methods.
distributed/client.py: get_task_stream/_get_task_stream forward
start_index; the context manager snapshots the index on enter and collects
from it on exit (sync and async), removing the FIXME comment.
distributed/diagnostics/tests/test_task_stream.py: tests for the index
semantics and the clock-skew regression.

…ontext manager (dask#9253) The get_task_stream context manager bounded the collected tasks with a wall-clock timestamp (time() - 0.1). collect() then bisected the buffer by comparing that boundary against each task's recorded stop time. When there was latency or clock skew between the client and the workers, a task that finished inside the block could carry a stop time earlier than the client's start boundary and be silently dropped, so get_task_stream() returned no tasks. Record the scheduler's monotonic task-stream append index on entry and collect everything appended after it on exit. This removes the dependency on synchronized clocks entirely, as the maintainers' FIXME suggested. Adds a get_task_stream_index scheduler RPC and a start_index path through collect()/get_task_stream(), with tests covering the index semantics and the clock-skew regression.

github-actions · 2026-05-28T18:59:33Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

31 files ± 0 31 suites ±0 10h 49m 37s ⏱️ + 14m 21s
4 082 tests + 2 3 966 ✅ - 1 113 💤 +1 3 ❌ +2
59 249 runs +32 56 643 ✅ +28 2 603 💤 +2 3 ❌ +2

For more details on these failures, see this check.

Results for commit 2d46d3f. ± Comparison against base commit bcad953.

This pull request skips 1 test.

distributed.deploy.tests.test_ssh ‑ test_defer_to_old

crusaderky

Some minor critiques

crusaderky · 2026-06-04T18:09:19Z

        plot=False,
        filename="task-stream.html",
        bokeh_resources=None,
+        start_index=None,


Suggested change

start_index=None,

this should not be exposed in the public API

crusaderky · 2026-06-04T18:09:33Z

            plot=plot,
            filename=filename,
            bokeh_resources=bokeh_resources,
+            start_index=start_index,


Suggested change

start_index=start_index,

crusaderky · 2026-06-04T18:11:04Z

        return self

    async def __aexit__(self, exc_type, exc_value, traceback):
        L = await self.client.get_task_stream(


Suggested change

L = await self.client.get_task_stream(

L = await self.client._get_task_stream(

crusaderky · 2026-06-04T18:11:43Z

        return self

    def __exit__(self, exc_type, exc_value, traceback):
        L = self.client.get_task_stream(


Suggested change

L = self.client.get_task_stream(

L = self.client.sync(self.client._get_task_stream(

crusaderky · 2026-06-04T18:11:49Z

        L = self.client.get_task_stream(
-            start=self.start, plot=self._plot, filename=self._filename
+            start_index=self._start_index, plot=self._plot, filename=self._filename
        )


Suggested change

)

))

crusaderky · 2026-06-04T18:16:39Z

+def test_collect_start_index_ignores_clock():
+    # When the worker clock lags the client clock (or there is latency), a task
+    # can finish with a recorded stop time that is earlier than the client's
+    # ``start`` boundary. The time-based collection then drops the task, which
+    # is the latency/clock-skew failure from the original bug report. The
+    # index-based path must still return it.
+    plugin = TaskStreamPlugin.__new__(TaskStreamPlugin)
+    plugin.buffer = deque()
+    plugin.index = 0
+
+    now = time()
+    plugin.buffer.append({"key": "task", "startstops": [{"stop": now - 100}]})
+    plugin.index += 1
+
+    # Time-based collection misses the task because its stop time is in the past.
+    assert plugin.collect(start=now) == []
+    # Index-based collection captures it regardless of the clock.
+    assert len(plugin.collect(start_index=0)) == 1
+
+


This test is very hacky and IMHO over-engineered; I'd rather stick to the public API whenever possible.
It should just be removed.

MohammadYusif requested review from fjetter and jacobtomlinson as code owners May 28, 2026 18:14

crusaderky requested changes Jun 4, 2026

View reviewed changes

crusaderky mentioned this pull request Jun 5, 2026

Release 2026.6.0 dask/community#444

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use task-stream index instead of wall clock in get_task_stream context manager (#9253)#9282

fix: use task-stream index instead of wall clock in get_task_stream context manager (#9253)#9282
MohammadYusif wants to merge 1 commit into
dask:mainfrom
MohammadYusif:fix/issue-9253

MohammadYusif commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

crusaderky left a comment

Uh oh!

crusaderky Jun 4, 2026

Uh oh!

crusaderky Jun 4, 2026

Uh oh!

crusaderky Jun 4, 2026

Uh oh!

crusaderky Jun 4, 2026

Uh oh!

crusaderky Jun 4, 2026

Uh oh!

crusaderky Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	L = await self.client.get_task_stream(
	L = await self.client._get_task_stream(

	L = self.client.get_task_stream(
	L = self.client.sync(self.client._get_task_stream(

Uh oh!

Conversation

MohammadYusif commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Unit Test Results

Uh oh!

crusaderky left a comment

Choose a reason for hiding this comment

Uh oh!

crusaderky Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

crusaderky Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

crusaderky Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

crusaderky Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

crusaderky Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

crusaderky Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants