fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915
fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915Ma77Ball wants to merge 9 commits into
Conversation
… a pause is generated leading to a failed ci
Automated Reviewer SuggestionsBased on the
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 388 | 0.237 | 25,654/33,508/33,508 us | 🔴 +13.9% / 🔴 +126.7% |
| 🟢 | bs=100 sw=10 sl=64 | 826 | 0.504 | 121,778/135,384/135,384 us | 🟢 -8.2% / 🔴 +26.7% |
| ⚪ | bs=1000 sw=10 sl=64 | 923 | 0.563 | 1,085,953/1,122,596/1,122,596 us | ⚪ within ±5% / 🔴 +10.4% |
Baseline details
Latest main baca3f9 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 388 tuples/sec | 409 tuples/sec | 786.27 tuples/sec | -5.1% | -50.7% |
| bs=10 sw=10 sl=64 | MB/s | 0.237 MB/s | 0.249 MB/s | 0.48 MB/s | -4.8% | -50.6% |
| bs=10 sw=10 sl=64 | p50 | 25,654 us | 22,517 us | 12,495 us | +13.9% | +105.3% |
| bs=10 sw=10 sl=64 | p95 | 33,508 us | 36,587 us | 14,784 us | -8.4% | +126.7% |
| bs=10 sw=10 sl=64 | p99 | 33,508 us | 36,587 us | 18,468 us | -8.4% | +81.4% |
| bs=100 sw=10 sl=64 | throughput | 826 tuples/sec | 827 tuples/sec | 991.49 tuples/sec | -0.1% | -16.7% |
| bs=100 sw=10 sl=64 | MB/s | 0.504 MB/s | 0.505 MB/s | 0.605 MB/s | -0.2% | -16.7% |
| bs=100 sw=10 sl=64 | p50 | 121,778 us | 119,742 us | 100,929 us | +1.7% | +20.7% |
| bs=100 sw=10 sl=64 | p95 | 135,384 us | 147,453 us | 106,894 us | -8.2% | +26.7% |
| bs=100 sw=10 sl=64 | p99 | 135,384 us | 147,453 us | 114,085 us | -8.2% | +18.7% |
| bs=1000 sw=10 sl=64 | throughput | 923 tuples/sec | 929 tuples/sec | 1,023 tuples/sec | -0.6% | -9.8% |
| bs=1000 sw=10 sl=64 | MB/s | 0.563 MB/s | 0.567 MB/s | 0.624 MB/s | -0.7% | -9.8% |
| bs=1000 sw=10 sl=64 | p50 | 1,085,953 us | 1,074,639 us | 983,835 us | +1.1% | +10.4% |
| bs=1000 sw=10 sl=64 | p95 | 1,122,596 us | 1,106,653 us | 1,023,777 us | +1.4% | +9.7% |
| bs=1000 sw=10 sl=64 | p99 | 1,122,596 us | 1,106,653 us | 1,053,883 us | +1.4% | +6.5% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,515.31,200,128000,388,0.237,25654.31,33508.03,33508.03
1,100,10,64,20,2421.78,2000,1280000,826,0.504,121778.18,135383.88,135383.88
2,1000,10,64,20,21679.88,20000,12800000,923,0.563,1085953.39,1122595.78,1122595.78
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #5915 +/- ##
============================================
- Coverage 56.92% 56.89% -0.04%
+ Complexity 3029 3023 -6
============================================
Files 1129 1129
Lines 43794 43801 +7
Branches 4743 4743
============================================
- Hits 24931 24921 -10
- Misses 17388 17400 +12
- Partials 1475 1480 +5
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
/request-review @aglinxinyuan |
|
that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen? |
There was a problem hiding this comment.
Pull request overview
This PR aims to stabilize ReconfigurationIntegrationSpec by ensuring the CSV-backed workflows run long enough for pauseWorkflow to take effect, avoiding a race where workflows can complete before the pause is applied.
Changes:
- Introduces a helper (
boundedCsvSource) to create a CSV scan operator descriptor based on the medium CSV source. - Switches the two CSV-sourced reconfiguration tests to use the new helper instead of
smallCsvScanOpDesc.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Yes, it is a quick, temporary fix. I could do a deeper dive into this problem and find a permanent solution. This pr should help mitigate the issue to avoid it blocking PRs in the meantime. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Matthew B. <mgball@uci.edu>
|
maybe we need manually backport? I removed the v1.2 label |
The test that keeps failing: it starts a workflow, pauses it, swaps some code, then resumes it. To pause, it asks each worker "please pause" and waits 5 seconds for a reply. The problem: some workers run Python. Starting Python is slow (it has to boot the interpreter and load libraries). If a worker is still busy starting up Python, it can't reply "paused" within 5 seconds → the test gives up → CI fails. The test that uses two Python workers is slowest to start, so it fails most.
d5da5a6 to
12e0a6e
Compare
|
Is this test even worth keeping? ReconfigurationSpec already covers the mechanism (stable); this only adds the Python-executor path, and it's been our recurring flake.
|
|
It is definitely worth to keep. Reconfiguration and all interactive features are mainly used in Python UDFs. Let's improve the test for long term stability. |
What changes were proposed in this PR?
Makes
ReconfigurationIntegrationSpecdeterministic instead of flaky.The problem: the test starts a workflow, pauses it, swaps operator code, then resumes, and it relied on the amount of data to keep the workflow alive long enough to pause. That is a race with no safe setting:
pauselands, so PAUSED is never reached and the test times out;mediumCSV is 100k rows through 1-2 Python UDFs): the workflow can't reach COMPLETED within the 1-minute wait, so it times out.Python throughput varies run to run, so either way it was flaky.
The fix: replace the data-volume race with a slow source.
slowRegionSourceOpDescemits only 30 rows, 0.25s apart (~7.5s of steady production):pauselands reliably (an explicit running window instead of racing throughput);Also kept from earlier: a
TextInput -> Python UDFwarmup inbeforeAll(speeds worker cold-start) and widened control-command awaits (now just safety bounds, no longer the fix). Test 2 is unified onto theRegioncolumn.Any related issues, documentation, discussions?
The reconfiguration mechanism itself is also covered by the pure-Scala
ReconfigurationSpec(Java operators, stableamberjob); this spec adds the Python-executor path.How was this PR tested?
DAO/compile+WorkflowExecutionService/Test/compilesucceed; scalafmt clean.amber-integrationCI job runsReconfigurationIntegrationSpecend to end (Python workers + postgres + lakekeeper + Iceberg), which is the real signal for this flake.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF