Skip to content

fix(amber): correct WorkerSpec input port to stop flaky timeouts#6037

Open
Ma77Ball wants to merge 2 commits into
apache:mainfrom
Ma77Ball:fix/flaky-workerspec-input-port
Open

fix(amber): correct WorkerSpec input port to stop flaky timeouts#6037
Ma77Ball wants to merge 2 commits into
apache:mainfrom
Ma77Ball:fix/flaky-workerspec-input-port

Conversation

@Ma77Ball

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • WorkerSpec's input-port AssignPortRequest now passes an empty storageUris (List()) matching its empty partitionings, satisfying InputManager.addPort's size invariant, so the input port registers and the three worker tests pass deterministically.
  • AsyncRPCServer.invokeMethod now re-throws Errors (e.g. failed assertions) after replying to the sender, instead of swallowing them; normal Exceptions are still returned to the sender unchanged, so genuine invariant violations surface loudly rather than degrading into flaky timeouts.

Any related issues, documentation, discussions?

Closes: #6036

How was this PR tested?

  • Run sbt "WorkflowExecutionService/testOnly *WorkerSpec"; expect Tests: succeeded 3, failed 0.
  • Diagnostic: grep the run log for assertion failed. Before this fix the count is 3 (the assertion fired and was swallowed every run); after it is 0. Verified locally under Java 17.
  • The full amber suite is heavy and flaky to run locally, so the engine-wide AsyncRPCServer change is left to the amber CI job to exercise across all RPC handlers.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @Yicong-Huang
    You can notify them by mentioning @Yicong-Huang in a comment.

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 4 worse · ⚪ 9 noise (<±5%) · 0 without baseline

Compared against main 1073b22 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 373 0.228 26,220/36,930/36,930 us 🔴 +9.0% / 🔴 +145.1%
🔴 bs=100 sw=10 sl=64 784 0.478 127,638/151,215/151,215 us 🔴 +5.1% / 🔴 +40.5%
🟢 bs=1000 sw=10 sl=64 919 0.561 1,092,826/1,125,265/1,125,265 us 🟢 -7.4% / 🔴 +10.7%
Baseline details

Latest main 1073b22 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 373 tuples/sec 408 tuples/sec 777.62 tuples/sec -8.6% -52.0%
bs=10 sw=10 sl=64 MB/s 0.228 MB/s 0.249 MB/s 0.475 MB/s -8.4% -52.0%
bs=10 sw=10 sl=64 p50 26,220 us 24,060 us 12,612 us +9.0% +107.9%
bs=10 sw=10 sl=64 p95 36,930 us 37,749 us 15,070 us -2.2% +145.1%
bs=10 sw=10 sl=64 p99 36,930 us 37,749 us 18,360 us -2.2% +101.1%
bs=100 sw=10 sl=64 throughput 784 tuples/sec 799 tuples/sec 988.31 tuples/sec -1.9% -20.7%
bs=100 sw=10 sl=64 MB/s 0.478 MB/s 0.488 MB/s 0.603 MB/s -2.0% -20.8%
bs=100 sw=10 sl=64 p50 127,638 us 121,435 us 101,066 us +5.1% +26.3%
bs=100 sw=10 sl=64 p95 151,215 us 154,061 us 107,594 us -1.8% +40.5%
bs=100 sw=10 sl=64 p99 151,215 us 154,061 us 115,830 us -1.8% +30.5%
bs=1000 sw=10 sl=64 throughput 919 tuples/sec 918 tuples/sec 1,019 tuples/sec +0.1% -9.9%
bs=1000 sw=10 sl=64 MB/s 0.561 MB/s 0.56 MB/s 0.622 MB/s +0.2% -9.8%
bs=1000 sw=10 sl=64 p50 1,092,826 us 1,085,200 us 986,982 us +0.7% +10.7%
bs=1000 sw=10 sl=64 p95 1,125,265 us 1,215,807 us 1,028,491 us -7.4% +9.4%
bs=1000 sw=10 sl=64 p99 1,125,265 us 1,215,807 us 1,058,493 us -7.4% +6.3%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,535.81,200,128000,373,0.228,26219.98,36930.48,36930.48
1,100,10,64,20,2551.66,2000,1280000,784,0.478,127637.86,151214.59,151214.59
2,1000,10,64,20,21772.61,20000,12800000,919,0.561,1092826.24,1125264.82,1125264.82

@codecov-commenter

codecov-commenter commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 55.37%. Comparing base (c3161f7) to head (12c1107).
⚠️ Report is 67 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...exera/amber/engine/common/rpc/AsyncRPCServer.scala 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6037      +/-   ##
============================================
+ Coverage     54.50%   55.37%   +0.86%     
- Complexity     2915     3002      +87     
============================================
  Files          1108     1111       +3     
  Lines         42807    42914     +107     
  Branches       4604     4614      +10     
============================================
+ Hits          23332    23763     +431     
+ Misses        18119    17762     -357     
- Partials       1356     1389      +33     
Flag Coverage Δ *Carryforward flag
access-control-service 70.44% <ø> (ø) Carriedforward from 4485492
agent-service 34.36% <ø> (ø) Carriedforward from 4485492
amber 58.73% <0.00%> (+2.21%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø) Carriedforward from 4485492
config-service 57.35% <ø> (ø) Carriedforward from 4485492
file-service 58.59% <ø> (ø) Carriedforward from 4485492
frontend 48.27% <ø> (ø) Carriedforward from 4485492
pyamber 90.20% <ø> (ø) Carriedforward from 4485492
python 90.76% <ø> (ø) Carriedforward from 4485492
workflow-compiling-service 58.69% <ø> (ø) Carriedforward from 4485492

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Updated error handling comment for clarity.

Signed-off-by: Matthew B. <mgball@uci.edu>
@xuang7 xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine fix release/v1.2 back porting to release/v1.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky WorkerSpec: input-port AssignPortRequest violates addPort size invariant

3 participants