Skip to content

fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915

Open
Ma77Ball wants to merge 9 commits into
apache:mainfrom
Ma77Ball:fix/reconfiguration-spec-flaky
Open

fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915
Ma77Ball wants to merge 9 commits into
apache:mainfrom
Ma77Ball:fix/reconfiguration-spec-flaky

Conversation

@Ma77Ball

@Ma77Ball Ma77Ball commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Makes ReconfigurationIntegrationSpec deterministic instead of flaky.

The problem: the test starts a workflow, pauses it, swaps operator code, then resumes, and it relied on the amount of data to keep the workflow alive long enough to pause. That is a race with no safe setting:

  • too little data: the workflow finishes before pause lands, so PAUSED is never reached and the test times out;
  • too much data (the medium CSV is 100k rows through 1-2 Python UDFs): the workflow can't reach COMPLETED within the 1-minute wait, so it times out.

Python throughput varies run to run, so either way it was flaky.

The fix: replace the data-volume race with a slow source. slowRegionSourceOpDesc emits only 30 rows, 0.25s apart (~7.5s of steady production):

  • the workflow is provably still running for ~7.5s, so pause lands reliably (an explicit running window instead of racing throughput);
  • only 30 tiny tuples flow, so completion is near-instant and the timeout is never stressed;
  • the Python source generator suspends on pause and resumes the same frame, so the remaining rows flow through the reconfigured UDF(s) after resume and the assertions still hold.

Also kept from earlier: a TextInput -> Python UDF warmup in beforeAll (speeds worker cold-start) and widened control-command awaits (now just safety bounds, no longer the fix). Test 2 is unified onto the Region column.

Any related issues, documentation, discussions?

The reconfiguration mechanism itself is also covered by the pure-Scala ReconfigurationSpec (Java operators, stable amber job); this spec adds the Python-executor path.

How was this PR tested?

  • DAO/compile + WorkflowExecutionService/Test/compile succeed; scalafmt clean.
  • The amber-integration CI job runs ReconfigurationIntegrationSpec end to end (Python workers + postgres + lakekeeper + Iceberg), which is the real signal for this flake.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

… a pause is generated leading to a failed ci
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @shengquan-ni
    You can notify them by mentioning @shengquan-ni in a comment.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 4 better · 🔴 2 worse · ⚪ 9 noise (<±5%) · 0 without baseline

Compared against main baca3f9 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 388 0.237 25,654/33,508/33,508 us 🔴 +13.9% / 🔴 +126.7%
🟢 bs=100 sw=10 sl=64 826 0.504 121,778/135,384/135,384 us 🟢 -8.2% / 🔴 +26.7%
bs=1000 sw=10 sl=64 923 0.563 1,085,953/1,122,596/1,122,596 us ⚪ within ±5% / 🔴 +10.4%
Baseline details

Latest main baca3f9 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 388 tuples/sec 409 tuples/sec 786.27 tuples/sec -5.1% -50.7%
bs=10 sw=10 sl=64 MB/s 0.237 MB/s 0.249 MB/s 0.48 MB/s -4.8% -50.6%
bs=10 sw=10 sl=64 p50 25,654 us 22,517 us 12,495 us +13.9% +105.3%
bs=10 sw=10 sl=64 p95 33,508 us 36,587 us 14,784 us -8.4% +126.7%
bs=10 sw=10 sl=64 p99 33,508 us 36,587 us 18,468 us -8.4% +81.4%
bs=100 sw=10 sl=64 throughput 826 tuples/sec 827 tuples/sec 991.49 tuples/sec -0.1% -16.7%
bs=100 sw=10 sl=64 MB/s 0.504 MB/s 0.505 MB/s 0.605 MB/s -0.2% -16.7%
bs=100 sw=10 sl=64 p50 121,778 us 119,742 us 100,929 us +1.7% +20.7%
bs=100 sw=10 sl=64 p95 135,384 us 147,453 us 106,894 us -8.2% +26.7%
bs=100 sw=10 sl=64 p99 135,384 us 147,453 us 114,085 us -8.2% +18.7%
bs=1000 sw=10 sl=64 throughput 923 tuples/sec 929 tuples/sec 1,023 tuples/sec -0.6% -9.8%
bs=1000 sw=10 sl=64 MB/s 0.563 MB/s 0.567 MB/s 0.624 MB/s -0.7% -9.8%
bs=1000 sw=10 sl=64 p50 1,085,953 us 1,074,639 us 983,835 us +1.1% +10.4%
bs=1000 sw=10 sl=64 p95 1,122,596 us 1,106,653 us 1,023,777 us +1.4% +9.7%
bs=1000 sw=10 sl=64 p99 1,122,596 us 1,106,653 us 1,053,883 us +1.4% +6.5%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,515.31,200,128000,388,0.237,25654.31,33508.03,33508.03
1,100,10,64,20,2421.78,2000,1280000,826,0.504,121778.18,135383.88,135383.88
2,1000,10,64,20,21679.88,20000,12800000,923,0.563,1085953.39,1122595.78,1122595.78

@codecov-commenter

codecov-commenter commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.89%. Comparing base (baca3f9) to head (38f532f).

Files with missing lines Patch % Lines
...g/apache/texera/amber/operator/TestOperators.scala 0.00% 7 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5915      +/-   ##
============================================
- Coverage     56.92%   56.89%   -0.04%     
+ Complexity     3029     3023       -6     
============================================
  Files          1129     1129              
  Lines         43794    43801       +7     
  Branches       4743     4743              
============================================
- Hits          24931    24921      -10     
- Misses        17388    17400      +12     
- Partials       1475     1480       +5     
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø)
agent-service 44.59% <ø> (ø) Carriedforward from baca3f9
amber 58.70% <0.00%> (-0.09%) ⬇️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 52.30% <ø> (ø)
file-service 62.81% <ø> (ø)
frontend 50.09% <ø> (ø) Carriedforward from baca3f9
notebook-migration-service 78.57% <ø> (ø)
pyamber 90.20% <ø> (ø) Carriedforward from baca3f9
python 90.76% <ø> (ø) Carriedforward from baca3f9
workflow-compiling-service 55.14% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Ma77Ball Ma77Ball marked this pull request as ready for review June 26, 2026 23:39
@Ma77Ball

Copy link
Copy Markdown
Contributor Author

/request-review @aglinxinyuan

@github-actions github-actions Bot requested a review from aglinxinyuan June 27, 2026 00:20
@Yicong-Huang

Copy link
Copy Markdown
Contributor

that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen?

@aglinxinyuan aglinxinyuan requested a review from Copilot June 28, 2026 23:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to stabilize ReconfigurationIntegrationSpec by ensuring the CSV-backed workflows run long enough for pauseWorkflow to take effect, avoiding a race where workflows can complete before the pause is applied.

Changes:

  • Introduces a helper (boundedCsvSource) to create a CSV scan operator descriptor based on the medium CSV source.
  • Switches the two CSV-sourced reconfiguration tests to use the new helper instead of smallCsvScanOpDesc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xuang7 xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 29, 2026
@Ma77Ball

Copy link
Copy Markdown
Contributor Author

that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen?

Yes, it is a quick, temporary fix. I could do a deeper dive into this problem and find a permanent solution. This pr should help mitigate the issue to avoid it blocking PRs in the meantime.

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Approved for a short term fix. Please keep the issue open or create a new issue to track for long term solution.

Ma77Ball and others added 2 commits June 30, 2026 15:35
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Matthew B. <mgball@uci.edu>
@Yicong-Huang

Copy link
Copy Markdown
Contributor

maybe we need manually backport? I removed the v1.2 label

@Yicong-Huang Yicong-Huang removed the release/v1.2 back porting to release/v1.2 label Jun 30, 2026
The test that keeps failing: it starts a workflow,
  pauses it, swaps some code, then resumes it. To
  pause, it asks each worker "please pause" and waits
  5 seconds for a reply.

  The problem: some workers run Python. Starting
  Python is slow (it has to boot the interpreter and
  load libraries). If a worker is still busy starting
  up Python, it can't reply "paused" within 5 seconds
  → the test gives up → CI fails. The test that uses
  two Python workers is slowest to start, so it fails
  most.
@github-actions github-actions Bot added the engine label Jul 1, 2026
@Ma77Ball Ma77Ball force-pushed the fix/reconfiguration-spec-flaky branch from d5da5a6 to 12e0a6e Compare July 1, 2026 18:50
@github-actions github-actions Bot added the common label Jul 1, 2026
@Ma77Ball

Ma77Ball commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Is this test even worth keeping? ReconfigurationSpec already covers the mechanism (stable); this only adds the Python-executor path, and it's been our recurring flake.

  1. Keep the deterministic rewrite
  2. @ignore + tracking issue, rely on ReconfigurationSpec
  3. Remove the spec

@Yicong-Huang

@Yicong-Huang

Copy link
Copy Markdown
Contributor

It is definitely worth to keep. Reconfiguration and all interactive features are mainly used in Python UDFs.

Let's improve the test for long term stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants