Skip to content

sparse_strips: Use store_slice wherever possible#1616

Merged
LaurenzV merged 1 commit into
mainfrom
laurenz/store
May 11, 2026
Merged

sparse_strips: Use store_slice wherever possible#1616
LaurenzV merged 1 commit into
mainfrom
laurenz/store

Conversation

@LaurenzV
Copy link
Copy Markdown
Collaborator

@LaurenzV LaurenzV commented May 3, 2026

A while ago, we added the store_slice method to fearless_simd since we realized that using copy_from_slice doesn't always turn into efficient code on x86. Therefore, using this code in vello should lead to better performance, at the very least not regress it. I measured this on NEON and I'm not seeing any regressions.

@LaurenzV LaurenzV requested a review from grebmeg May 3, 2026 20:28
@grebmeg
Copy link
Copy Markdown
Collaborator

grebmeg commented May 7, 2026

I believe this is the PR: linebender/fearless_simd#181?

Copy link
Copy Markdown
Collaborator

@grebmeg grebmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t dug into store_slice much, but could you elaborate on why it might be more efficient on x86? Also, if NEON performance looks good, could we benchmark it on other platforms as well? I’m a bit hesitant to approve this without stronger evidence or proof.

@LaurenzV
Copy link
Copy Markdown
Collaborator Author

LaurenzV commented May 7, 2026

Fair enough! The reason why we determined it to be slower is that copy_from_slice often didn't optimize to using the best store intrinsics for the given level. It's been a while, but there is some additional discussion in smu160/PhastFT#58 and linebender/fearless_simd#185. Anyway, it's a fair concern, and I will try whether I can pull up the old vello bench repo to get the timings on x86 as well as WASM.

@LaurenzV
Copy link
Copy Markdown
Collaborator Author

LaurenzV commented May 8, 2026

@grebmeg Here are my results from running in Chrome using WASM, no changes observed:

image

@LaurenzV
Copy link
Copy Markdown
Collaborator Author

LaurenzV commented May 8, 2026

Same for raw NEON.

image

Will try to run the benchmarks on my AVX2 laptop now.

@LaurenzV
Copy link
Copy Markdown
Collaborator Author

LaurenzV commented May 8, 2026

Hmm, so to be honest, I wasn't really able to measure a speed boost on AVX2. Somtimes it's a bit faster, sometimes a bit slower, but it mostly seems like noise (see below).

Anyway, since there at least don't seem to be any regressions, I would personally still be in favor of merging this, since fearless_simd provides an API for storing vectors now, it's better to use it than just hope that copy_from_slice optimizes as we hope it does. But up to you, just let me know how you feel!

fine/fill/opaque_short_u8_avx2
                        time:   [6.5817 ns 6.9296 ns 7.2834 ns]
                        change: [-8.8687% -3.6346% +1.8572%] (p = 0.19 > 0.05)
                        No change in performance detected.
 
fine/fill/opaque_long_u8_avx2
                        time:   [22.981 ns 26.268 ns 29.743 ns]
                        change: [-18.986% -7.9699% +4.7343%] (p = 0.22 > 0.05)
                        No change in performance detected.
 
fine/fill/transparent_short_u8_avx2
                        time:   [13.110 ns 13.139 ns 13.170 ns]
                        change: [-2.6401% -1.8585% -1.1454%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
 
fine/fill/transparent_long_u8_avx2
                        time:   [96.356 ns 96.650 ns 96.951 ns]
                        change: [-2.0641% -1.2450% -0.5130%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
 
fine/strip/solid_short_u8_avx2
                        time:   [11.733 ns 11.765 ns 11.801 ns]
                        change: [-0.6357% -0.1145% +0.4128%] (p = 0.68 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
 
fine/strip/solid_long_u8_avx2
                        time:   [78.872 ns 79.010 ns 79.163 ns]
                        change: [-1.9532% -1.4405% -0.9604%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
 
fine/pack/pack_block_u8_avx2
                        time:   [90.445 ns 90.575 ns 90.726 ns]
                        change: [-1.2955% -0.8414% -0.4573%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
 
fine/pack/pack_regular_u8_avx2
                        time:   [134.62 ns 134.95 ns 135.29 ns]
                        change: [-0.7959% -0.4100% -0.0110%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
 
fine/pack/unpack_block_u8_avx2
                        time:   [91.827 ns 92.035 ns 92.244 ns]
                        change: [+0.4345% +0.8964% +1.4966%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
 
fine/pack/unpack_regular_u8_avx2
                        time:   [176.34 ns 176.80 ns 177.24 ns]
                        change: [-2.9782% -2.1356% -0.7432%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
 
fine/gradient/linear/opaque_u8_avx2
                        time:   [415.74 ns 441.66 ns 472.66 ns]
                        change: [+3.0365% +7.0960% +12.665%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high severe
 
fine/gradient/radial/opaque_u8_avx2
                        time:   [565.00 ns 577.77 ns 599.93 ns]
                        change: [-3.3081% +0.5814% +4.6626%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
 
fine/gradient/radial/opaque_conical_u8_avx2
                        time:   [655.38 ns 689.39 ns 725.05 ns]
                        change: [+2.4261% +7.3872% +12.451%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  20 (20.00%) high mild
  1 (1.00%) high severe
 
fine/gradient/sweep/opaque_u8_avx2
                        time:   [897.99 ns 917.25 ns 940.90 ns]
                        change: [+0.8496% +3.6072% +6.4875%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 19 outliers among 100 measurements (19.00%)
  6 (6.00%) high mild
  13 (13.00%) high severe
 
fine/gradient/extend/pad_u8_avx2
                        time:   [415.68 ns 435.54 ns 461.08 ns]
                        change: [-3.5682% +2.0580% +8.4721%] (p = 0.50 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
 
fine/gradient/extend/repeat_u8_avx2
                        time:   [494.69 ns 512.33 ns 536.46 ns]
                        change: [-1.4447% +3.0115% +7.5000%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
 
fine/gradient/extend/reflect_u8_avx2
                        time:   [562.51 ns 574.79 ns 591.80 ns]
                        change: [-6.4356% -2.2666% +2.1999%] (p = 0.31 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
 
fine/gradient/many_stops_u8_avx2
                        time:   [762.38 ns 767.54 ns 773.92 ns]
                        change: [-0.8385% +0.0220% +1.0074%] (p = 0.96 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
 
fine/gradient/transparent_u8_avx2
                        time:   [673.63 ns 675.59 ns 677.69 ns]
                        change: [-2.0569% -1.1367% +0.0059%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
 
fine/image/quality/low_u8_avx2
                        time:   [457.79 ns 459.96 ns 462.80 ns]
                        change: [-0.7154% -0.1406% +0.4472%] (p = 0.66 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
 
fine/image/quality/medium_u8_avx2
                        time:   [2.6238 µs 2.6434 µs 2.6673 µs]
                        change: [+1.1172% +1.7327% +2.4433%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe
 
fine/image/quality/high_u8_avx2
                        time:   [90.028 µs 90.370 µs 90.764 µs]
                        change: [+0.5259% +1.0082% +1.5054%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

@LaurenzV LaurenzV requested a review from grebmeg May 8, 2026 10:22
Copy link
Copy Markdown
Collaborator

@grebmeg grebmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the measurements, @LaurenzV! it's good to see there's no perf regression. I think I roughly see the minor benefits here, but I'm still a bit skeptical. That said, I likely just don't have clear picture for this yet, so happy to defer to your intuition here.

@LaurenzV LaurenzV added this pull request to the merge queue May 11, 2026
Merged via the queue into main with commit 958cd19 May 11, 2026
17 checks passed
@LaurenzV LaurenzV deleted the laurenz/store branch May 11, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants