[ET-VK][custom_ops] Probe-then-scale adaptive dispatch stacking in test framework by SS-JIA · Pull Request #19569 · pytorch/executorch

SS-JIA · 2026-05-13T20:54:06Z

Stack from ghstack (oldest at bottom):

Add Google-Benchmark-style adaptive dispatch stacking to the prototyping test framework. The framework now automatically determines how many op invocations to chain into a single graph.execute() call by probing each test case once, then computing a per-execute dispatch count N that targets a fixed wall-clock budget. No per-binary tuning required.

Motivation: under tight-loop microbenchmark patterns, the Adreno DCVS governor pins the GPU clock at a low step (e.g. 220 MHz on Adreno 740 when boost is 719 MHz), which inflates per-op latency by ~3x and makes per-device perf comparisons misleading. The fix is to drive sustained GPU activity by stacking many dispatches per execute. Previously each test author had to hand-tune the N for each op shape; this commit eliminates that work.

Algorithm (Google Benchmark style):

For each TestCase, run a sample execute() at N=1 to measure per-op wall time (probe_us).
Compute N = clamp(target_us / probe_us, 1, 1000), with target_us = 50000 us by default. The generous target value mitigates the fact that the probe itself runs at the pinned governor clock — the resulting N is somewhat under-sized, but still drives enough sustained activity for the governor to escalate during the measurement loop.
Build a new graph at the computed N and run warmup + benchmark as usual.
Per-execute CPU wall time and aggregate GPU time are divided by N before being reported so the user sees per-invocation latency. Per-shader samples remain undivided (each querypool entry is one dispatch; at N>1 the aggregator naturally averages over N samples per execute x benchmark_runs samples).

API surface:

execute_test_case gains an int chained_dispatches = 1 parameter. It is a pure 'run-at-N' primitive — no internal probe, no orchestration.
execute_test_cases (the orchestrator) does the probe-then-scale: for each test case, calls execute_test_case(tc, 1, 1, 1) to probe, computes N, prints a single '[probe] {name}: probe_us={us}, N={N}' line, then calls execute_test_case(tc, warmup_runs, benchmark_runs, N) for the real measurement.
setup_compute_graph gains an int op_invocations_per_execute = 1 parameter; it loops the opFn call that many times.
TestCase gains set_op_invocations_per_execute(int n) (manual override; 0 = adaptive, default) and set_target_execute_time_us(int us) (default 50000) plus getters.

On-device validation (Samsung Galaxy S24, Adreno 750, test_q8ta_conv2d, 181 test cases): 84/84 ACCU passed. probe_us ranged 125 to 9336 us; computed N ranged 5 to 400; no case hit the N=1000 cap or the N=1 floor. The probe-to-measurement latency ratio reached 15.93x on small ACCU shapes — i.e. the governor visibly escalated from pinned to boost clock between the probe and the measurement loop — confirming the mitigation is doing real work. Heavy PERF shapes showed probe/measured ratios of only 1.04-1.12x because they already hold the GPU at high clock from the first dispatch.

Test binaries do NOT need to be modified — the new behavior is automatic. Manual override via set_op_invocations_per_execute(N) remains available for advanced cases (e.g. ops with numerical sensitivity that requires N=1).

Differential Revision: D105059943

…st framework Add Google-Benchmark-style adaptive dispatch stacking to the prototyping test framework. The framework now automatically determines how many op invocations to chain into a single graph.execute() call by probing each test case once, then computing a per-execute dispatch count N that targets a fixed wall-clock budget. No per-binary tuning required. Motivation: under tight-loop microbenchmark patterns, the Adreno DCVS governor pins the GPU clock at a low step (e.g. 220 MHz on Adreno 740 when boost is 719 MHz), which inflates per-op latency by ~3x and makes per-device perf comparisons misleading. The fix is to drive sustained GPU activity by stacking many dispatches per execute. Previously each test author had to hand-tune the N for each op shape; this commit eliminates that work. Algorithm (Google Benchmark style): 1. For each TestCase, run a sample execute() at N=1 to measure per-op wall time (probe_us). 2. Compute N = clamp(target_us / probe_us, 1, 1000), with target_us = 50000 us by default. The generous target value mitigates the fact that the probe itself runs at the pinned governor clock — the resulting N is somewhat under-sized, but still drives enough sustained activity for the governor to escalate during the measurement loop. 3. Build a new graph at the computed N and run warmup + benchmark as usual. 4. Per-execute CPU wall time and aggregate GPU time are divided by N before being reported so the user sees per-invocation latency. Per-shader samples remain undivided (each querypool entry is one dispatch; at N>1 the aggregator naturally averages over N samples per execute x benchmark_runs samples). API surface: - execute_test_case gains an int chained_dispatches = 1 parameter. It is a pure 'run-at-N' primitive — no internal probe, no orchestration. - execute_test_cases (the orchestrator) does the probe-then-scale: for each test case, calls execute_test_case(tc, 1, 1, 1) to probe, computes N, prints a single '[probe] {name}: probe_us={us}, N={N}' line, then calls execute_test_case(tc, warmup_runs, benchmark_runs, N) for the real measurement. - setup_compute_graph gains an int op_invocations_per_execute = 1 parameter; it loops the opFn call that many times. - TestCase gains set_op_invocations_per_execute(int n) (manual override; 0 = adaptive, default) and set_target_execute_time_us(int us) (default 50000) plus getters. On-device validation (Samsung Galaxy S24, Adreno 750, test_q8ta_conv2d, 181 test cases): 84/84 ACCU passed. probe_us ranged 125 to 9336 us; computed N ranged 5 to 400; no case hit the N=1000 cap or the N=1 floor. The probe-to-measurement latency ratio reached 15.93x on small ACCU shapes — i.e. the governor visibly escalated from pinned to boost clock between the probe and the measurement loop — confirming the mitigation is doing real work. Heavy PERF shapes showed probe/measured ratios of only 1.04-1.12x because they already hold the GPU at high clock from the first dispatch. Test binaries do NOT need to be modified — the new behavior is automatic. Manual override via set_op_invocations_per_execute(N) remains available for advanced cases (e.g. ops with numerical sensitivity that requires N=1). Differential Revision: [D105059943](https://our.internmc.facebook.com/intern/diff/D105059943/) [ghstack-poisoned]

pytorch-bot · 2026-05-13T20:54:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19569

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

❌ You can merge normally! (3 Unrelated Failures), 2 Unclassified Failures

As of commit 1821aa1 with merge base 1992bdd ():

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Check Labels (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
periodic (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest-editable / windows / windows-job (gh) (similar failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_8a4w_recipe

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_8a4w_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…st framework Add Google-Benchmark-style adaptive dispatch stacking to the prototyping test framework. The framework now automatically determines how many op invocations to chain into a single graph.execute() call by probing each test case once, then computing a per-execute dispatch count N that targets a fixed wall-clock budget. No per-binary tuning required. Motivation: under tight-loop microbenchmark patterns, the Adreno DCVS governor pins the GPU clock at a low step (e.g. 220 MHz on Adreno 740 when boost is 719 MHz), which inflates per-op latency by ~3x and makes per-device perf comparisons misleading. The fix is to drive sustained GPU activity by stacking many dispatches per execute. Previously each test author had to hand-tune the N for each op shape; this commit eliminates that work. Algorithm (Google Benchmark style): 1. For each TestCase, run a sample execute() at N=1 to measure per-op wall time (probe_us). 2. Compute N = clamp(target_us / probe_us, 1, 1000), with target_us = 50000 us by default. The generous target value mitigates the fact that the probe itself runs at the pinned governor clock — the resulting N is somewhat under-sized, but still drives enough sustained activity for the governor to escalate during the measurement loop. 3. Build a new graph at the computed N and run warmup + benchmark as usual. 4. Per-execute CPU wall time and aggregate GPU time are divided by N before being reported so the user sees per-invocation latency. Per-shader samples remain undivided (each querypool entry is one dispatch; at N>1 the aggregator naturally averages over N samples per execute x benchmark_runs samples). API surface: - execute_test_case gains an int chained_dispatches = 1 parameter. It is a pure 'run-at-N' primitive — no internal probe, no orchestration. - execute_test_cases (the orchestrator) does the probe-then-scale: for each test case, calls execute_test_case(tc, 1, 1, 1) to probe, computes N, prints a single '[probe] {name}: probe_us={us}, N={N}' line, then calls execute_test_case(tc, warmup_runs, benchmark_runs, N) for the real measurement. - setup_compute_graph gains an int op_invocations_per_execute = 1 parameter; it loops the opFn call that many times. - TestCase gains set_op_invocations_per_execute(int n) (manual override; 0 = adaptive, default) and set_target_execute_time_us(int us) (default 50000) plus getters. On-device validation (Samsung Galaxy S24, Adreno 750, test_q8ta_conv2d, 181 test cases): 84/84 ACCU passed. probe_us ranged 125 to 9336 us; computed N ranged 5 to 400; no case hit the N=1000 cap or the N=1 floor. The probe-to-measurement latency ratio reached 15.93x on small ACCU shapes — i.e. the governor visibly escalated from pinned to boost clock between the probe and the measurement loop — confirming the mitigation is doing real work. Heavy PERF shapes showed probe/measured ratios of only 1.04-1.12x because they already hold the GPU at high clock from the first dispatch. Test binaries do NOT need to be modified — the new behavior is automatic. Manual override via set_op_invocations_per_execute(N) remains available for advanced cases (e.g. ops with numerical sensitivity that requires N=1). Differential Revision: [D105059943](https://our.internmc.facebook.com/intern/diff/D105059943/) ghstack-source-id: 381655470 Pull Request resolved: #19569

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2026

This was referenced May 13, 2026

[ET-VK][custom_ops] Standardize test case label format across binaries #19570

Merged

[ET-VK][custom_ops] Standardize binary filenames to test_*.cpp #19571

Merged

meta-codesync Bot added fb-exported meta-exported labels May 13, 2026

GregoryComer approved these changes May 13, 2026

View reviewed changes

meta-codesync Bot merged commit 11ce070 into gh/SS-JIA/533/base May 14, 2026
171 of 185 checks passed

meta-codesync Bot deleted the gh/SS-JIA/533/head branch May 14, 2026 01:56

meta-codesync Bot temporarily deployed to cherry-pick-bot May 14, 2026 01:56 Inactive

pytorchbot mentioned this pull request May 14, 2026

[ET-VK][custom_ops] Probe-then-scale adaptive dispatch stacking in test framework #19573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][custom_ops] Probe-then-scale adaptive dispatch stacking in test framework#19569

[ET-VK][custom_ops] Probe-then-scale adaptive dispatch stacking in test framework#19569
meta-codesync[bot] merged 1 commit into
gh/SS-JIA/533/basefrom
gh/SS-JIA/533/head

SS-JIA commented May 13, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19569

❗ 1 Active SEVs

❌ You can merge normally! (3 Unrelated Failures), 2 Unclassified Failures

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented May 13, 2026 •

edited

Loading

pytorch-bot Bot commented May 13, 2026 •

edited

Loading