fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race by Ma77Ball · Pull Request #5915 · apache/texera

Ma77Ball · 2026-06-23T22:09:58Z

What changes were proposed in this PR?

Makes ReconfigurationIntegrationSpec deterministic instead of flaky.

The problem: the test starts a workflow, pauses it, swaps operator code, then resumes, and it relied on the amount of data to keep the workflow alive long enough to pause. That is a race with no safe setting:

too little data: the workflow finishes before pause lands, so PAUSED is never reached and the test times out;
too much data (the medium CSV is 100k rows through 1-2 Python UDFs): the workflow can't reach COMPLETED within the 1-minute wait, so it times out.

Python throughput varies run to run, so either way it was flaky.

The fix: replace the data-volume race with a slow source. slowRegionSourceOpDesc emits only 30 rows, 0.25s apart (~7.5s of steady production):

the workflow is provably still running for ~7.5s, so pause lands reliably (an explicit running window instead of racing throughput);
only 30 tiny tuples flow, so completion is near-instant and the timeout is never stressed;
the Python source generator suspends on pause and resumes the same frame, so the remaining rows flow through the reconfigured UDF(s) after resume and the assertions still hold.

Also kept from earlier: a TextInput -> Python UDF warmup in beforeAll (speeds worker cold-start) and widened control-command awaits (now just safety bounds, no longer the fix). Test 2 is unified onto the Region column.

Any related issues, documentation, discussions?

The reconfiguration mechanism itself is also covered by the pure-Scala ReconfigurationSpec (Java operators, stable amber job); this spec adds the Python-executor path.

How was this PR tested?

DAO/compile + WorkflowExecutionService/Test/compile succeed; scalafmt clean.
The amber-integration CI job runs ReconfigurationIntegrationSpec end to end (Python workers + postgres + lakekeeper + Iceberg), which is the real signal for this flake.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

… a pause is generated leading to a failed ci

github-actions · 2026-06-23T22:10:15Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @shengquan-ni
You can notify them by mentioning @shengquan-ni in a comment.

github-actions · 2026-06-23T22:15:38Z

⚠️ Benchmark changes need a look

🟢 4 better · 🔴 2 worse · ⚪ 9 noise (<±5%) · 0 without baseline

Compared against main baca3f9 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	388	0.237	25,654/33,508/33,508 us	🔴 +13.9% / 🔴 +126.7%
🟢	bs=100 sw=10 sl=64	826	0.504	121,778/135,384/135,384 us	🟢 -8.2% / 🔴 +26.7%
⚪	bs=1000 sw=10 sl=64	923	0.563	1,085,953/1,122,596/1,122,596 us	⚪ within ±5% / 🔴 +10.4%

Baseline details

Latest main baca3f9 from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	388 tuples/sec	409 tuples/sec	786.27 tuples/sec	-5.1%	-50.7%
bs=10 sw=10 sl=64	MB/s	0.237 MB/s	0.249 MB/s	0.48 MB/s	-4.8%	-50.6%
bs=10 sw=10 sl=64	p50	25,654 us	22,517 us	12,495 us	+13.9%	+105.3%
bs=10 sw=10 sl=64	p95	33,508 us	36,587 us	14,784 us	-8.4%	+126.7%
bs=10 sw=10 sl=64	p99	33,508 us	36,587 us	18,468 us	-8.4%	+81.4%
bs=100 sw=10 sl=64	throughput	826 tuples/sec	827 tuples/sec	991.49 tuples/sec	-0.1%	-16.7%
bs=100 sw=10 sl=64	MB/s	0.504 MB/s	0.505 MB/s	0.605 MB/s	-0.2%	-16.7%
bs=100 sw=10 sl=64	p50	121,778 us	119,742 us	100,929 us	+1.7%	+20.7%
bs=100 sw=10 sl=64	p95	135,384 us	147,453 us	106,894 us	-8.2%	+26.7%
bs=100 sw=10 sl=64	p99	135,384 us	147,453 us	114,085 us	-8.2%	+18.7%
bs=1000 sw=10 sl=64	throughput	923 tuples/sec	929 tuples/sec	1,023 tuples/sec	-0.6%	-9.8%
bs=1000 sw=10 sl=64	MB/s	0.563 MB/s	0.567 MB/s	0.624 MB/s	-0.7%	-9.8%
bs=1000 sw=10 sl=64	p50	1,085,953 us	1,074,639 us	983,835 us	+1.1%	+10.4%
bs=1000 sw=10 sl=64	p95	1,122,596 us	1,106,653 us	1,023,777 us	+1.4%	+9.7%
bs=1000 sw=10 sl=64	p99	1,122,596 us	1,106,653 us	1,053,883 us	+1.4%	+6.5%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,515.31,200,128000,388,0.237,25654.31,33508.03,33508.03
1,100,10,64,20,2421.78,2000,1280000,826,0.504,121778.18,135383.88,135383.88
2,1000,10,64,20,21679.88,20000,12800000,923,0.563,1085953.39,1122595.78,1122595.78

codecov-commenter · 2026-06-23T22:19:07Z

Codecov Report

❌ Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.89%. Comparing base (baca3f9) to head (38f532f).

Files with missing lines	Patch %	Lines
...g/apache/texera/amber/operator/TestOperators.scala	0.00%	7 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5915      +/-   ##
============================================
- Coverage     56.92%   56.89%   -0.04%     
+ Complexity     3029     3023       -6     
============================================
  Files          1129     1129              
  Lines         43794    43801       +7     
  Branches       4743     4743              
============================================
- Hits          24931    24921      -10     
- Misses        17388    17400      +12     
- Partials       1475     1480       +5

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`
agent-service	`44.59% <ø> (ø)`		Carriedforward from baca3f9
amber	`58.70% <0.00%> (-0.09%)`	⬇️
computing-unit-managing-service	`0.00% <ø> (ø)`
config-service	`52.30% <ø> (ø)`
file-service	`62.81% <ø> (ø)`
frontend	`50.09% <ø> (ø)`		Carriedforward from baca3f9
notebook-migration-service	`78.57% <ø> (ø)`
pyamber	`90.20% <ø> (ø)`		Carriedforward from baca3f9
python	`90.76% <ø> (ø)`		Carriedforward from baca3f9
workflow-compiling-service	`55.14% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Ma77Ball · 2026-06-27T00:20:53Z

/request-review @aglinxinyuan

Yicong-Huang · 2026-06-28T22:57:22Z

that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen?

Copilot

Pull request overview

This PR aims to stabilize ReconfigurationIntegrationSpec by ensuring the CSV-backed workflows run long enough for pauseWorkflow to take effect, avoiding a race where workflows can complete before the pause is applied.

Changes:

Introduces a helper (boundedCsvSource) to create a CSV scan operator descriptor based on the medium CSV source.
Switches the two CSV-sourced reconfiguration tests to use the new helper instead of smallCsvScanOpDesc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Ma77Ball · 2026-06-30T04:21:21Z

that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen?

Yes, it is a quick, temporary fix. I could do a deeper dive into this problem and find a permanent solution. This pr should help mitigate the issue to avoid it blocking PRs in the meantime.

Yicong-Huang

OK. Approved for a short term fix. Please keep the issue open or create a new issue to track for long term solution.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Matthew B. <mgball@uci.edu>

Yicong-Huang · 2026-06-30T22:59:51Z

maybe we need manually backport? I removed the v1.2 label

The test that keeps failing: it starts a workflow, pauses it, swaps some code, then resumes it. To pause, it asks each worker "please pause" and waits 5 seconds for a reply. The problem: some workers run Python. Starting Python is slow (it has to boot the interpreter and load libraries). If a worker is still busy starting up Python, it can't reply "paused" within 5 seconds → the test gives up → CI fails. The test that uses two Python workers is slowest to start, so it fails most.

Ma77Ball · 2026-07-01T18:52:36Z

Is this test even worth keeping? ReconfigurationSpec already covers the mechanism (stable); this only adds the Python-executor path, and it's been our recurring flake.

Keep the deterministic rewrite
@ignore + tracking issue, rely on ReconfigurationSpec
Remove the spec

@Yicong-Huang

Yicong-Huang · 2026-07-01T20:45:06Z

It is definitely worth to keep. Reconfiguration and all interactive features are mainly used in Python UDFs.

Let's improve the test for long term stability.

increase amount of data for csv scan to avoid engine finishing before…

291e31c

… a pause is generated leading to a failed ci

github-actions Bot assigned Ma77Ball Jun 23, 2026

github-actions Bot added fix amber-integration labels Jun 23, 2026

Ma77Ball added 3 commits June 23, 2026 15:20

Merge branch 'main' into fix/reconfiguration-spec-flaky

94a697b

Merge branch 'main' into fix/reconfiguration-spec-flaky

740e963

bound the csv source spec to 10000

a0841b4

Ma77Ball marked this pull request as ready for review June 26, 2026 23:39

github-actions Bot requested a review from aglinxinyuan June 27, 2026 00:20

aglinxinyuan requested a review from Copilot June 28, 2026 23:47

aglinxinyuan approved these changes Jun 28, 2026

View reviewed changes

Copilot started reviewing on behalf of aglinxinyuan June 28, 2026 23:48 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread ...src/test/integration/org/apache/texera/amber/engine/e2e/ReconfigurationIntegrationSpec.scala Outdated

xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 29, 2026

Yicong-Huang approved these changes Jun 30, 2026

View reviewed changes

Ma77Ball and others added 2 commits June 30, 2026 15:35

Potential fix for pull request finding

b1728be

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Matthew B. <mgball@uci.edu>

Merge branch 'main' into fix/reconfiguration-spec-flaky

d4e85e7

Yicong-Huang removed the release/v1.2 back porting to release/v1.2 label Jun 30, 2026

github-actions Bot added the engine label Jul 1, 2026

add slow source to avoid ci failure

12e0a6e

Ma77Ball force-pushed the fix/reconfiguration-spec-flaky branch from d5da5a6 to 12e0a6e Compare July 1, 2026 18:50

github-actions Bot added the common label Jul 1, 2026

Merge branch 'main' into fix/reconfiguration-spec-flaky

38f532f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915

fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915
Ma77Ball wants to merge 9 commits into
apache:mainfrom
Ma77Ball:fix/reconfiguration-spec-flaky

Ma77Ball commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 23, 2026 •

edited

Loading

Uh oh!

Ma77Ball commented Jun 27, 2026

Uh oh!

Yicong-Huang commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Ma77Ball commented Jun 30, 2026

Uh oh!

Yicong-Huang left a comment •

edited

Loading

Uh oh!

Yicong-Huang commented Jun 30, 2026

Uh oh!

Ma77Ball commented Jul 1, 2026 •

edited

Loading

Uh oh!

Yicong-Huang commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

Ma77Ball commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Reviewer Suggestions

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Benchmark changes need a look

Uh oh!

codecov-commenter commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Ma77Ball commented Jun 27, 2026

Uh oh!

Yicong-Huang commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Ma77Ball commented Jun 30, 2026

Uh oh!

Yicong-Huang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang commented Jun 30, 2026

Uh oh!

Ma77Ball commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yicong-Huang commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Ma77Ball commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

codecov-commenter commented Jun 23, 2026 •

edited

Loading

Yicong-Huang left a comment •

edited

Loading

Ma77Ball commented Jul 1, 2026 •

edited

Loading