Skip to content

[ExecuTorch][WebGPU] Dynamic resize hooks for sigmoid and select_copy#20578

Open
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/70/basefrom
gh/JulianCloudNTH/70/head
Open

[ExecuTorch][WebGPU] Dynamic resize hooks for sigmoid and select_copy#20578
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/70/basefrom
gh/JulianCloudNTH/70/head

Conversation

@JulianCloudNTH

@JulianCloudNTH JulianCloudNTH commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Make sigmoid and select_copy serve any live shape from one graph; fix select's last-token index under dynamic shapes.

Problem: Both ops baked their dispatch/params/output shape at build() for the max shape. select_copy was worse: a negative index (e.g. -1 for the last token) was normalized against the build-time MAX dim, so at a smaller live S it selected a stale/zero position past the live data — producing wrong (often zero) output.

Solution:

  • sigmoid (generic add_unary_op): a resize hook recomputes num_elements/dispatch and sets the output cur_dims (shape-preserving).
  • select_copy: KEEP the raw (possibly negative) index at build; a resize hook re-resolves it against the LIVE dim, recomputes the output dims (= input minus dim), rebuilds the out/in TensorMeta UBOs and the dispatch.
  • Both keep their uniform buffer(s) alive via own_uniform_buffer.

Implementation:

  • The select out/in meta is rebuilt from synthetic WebGPUTensor{dims} via fill_tensor_meta (reads only .dims).
  • Mirrors Vulkan per-op resize_*_node.

Constraints: Behavior-neutral on static graphs (hooks fire only when an input's live shape differs from the max). No kernel/WGSL/numerics change.

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109906095

Differential Revision: D109906095

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20578

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 46fcce5 with merge base db5f1ba (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@JulianCloudNTH

Copy link
Copy Markdown
Contributor Author

@claude review and check for any areas or opportunities for modularization

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @JulianCloudNTH's task in 2m 47s —— View job


Review: Dynamic resize hooks for sigmoid and select_copy

  • Read changed files & surrounding WebGPU runtime infrastructure
  • Analyze correctness of resize hooks
  • Identify modularization opportunities

The change is well-scoped and matches the established per-op resize-hook pattern (add, mul). The select_copy negative-index fix is correct and is a real bug fix: at build the index was normalized against the max dim, so -1 pointed past live data at smaller S; keeping raw_index and re-resolving against cur_dims(in_id)[dim] in the hook is the right fix. Both ops correctly switch to own_uniform_buffer so the hook can rewrite the UBOs.

Correctness — looks good, two minor notes

  1. Select.cpp — guards not re-checked on resize (acceptable). The build path validates nbytes == numel*4 (fp32) and the bind-group buffer size is pinned to the max nbytes. The hook only rewrites meta/params/workgroup count, never buffer sizes — correct, since dtype and allocation are static and only live sizes shrink. Worth a one-line comment in the hook noting the bind-group sizes intentionally stay at max so it doesn't read as an omission.

  2. UnaryOp.cpp — build vs. hook compute numel from different tensors. Build uses out_tensor.nbytes/sizeof(float); the hook uses numel_of(cur_dims(in_id)). They're guaranteed equal (the op asserts in.nbytes == out.nbytes), so this is fine, but reading in in one place and out in the other is a small inconsistency. Computing both from the input would read more uniformly.

No issues with the lambda captures (all captured values are used; handles copied before own_uniform_buffer transfers ownership).

Modularization — the more substantive feedback

This PR is behavior-correct, but it lands on top of a growing amount of near-duplicate boilerplate across ops/. Three concrete, increasingly valuable extractions:

1. A meta-UBO rewrite helper (highest value, lowest risk). The "synthesize a WebGPUTensor from dims → fill_tensor_metawgpuQueueWriteBuffer" sequence now appears in select (Select.cpp:194-201) and mul (mul/BinaryOp.cpp:187-197), and is the kind of thing every future shape-changing op will copy. A small helper in TensorMeta.h would absorb it:

inline void write_meta_from_dims(WebGPUGraph& g, WGPUBuffer buf,
                                 const std::vector<int64_t>& dims) {
  WebGPUTensor t; t.dims = dims;
  TensorMeta m; fill_tensor_meta(t, &m);
  wgpuQueueWriteBuffer(g.queue(), buf, 0, &m, sizeof(m));
}

Select's hook then drops to write_meta_from_dims(g, out_meta_buf, od); write_meta_from_dims(g, in_meta_buf, ind);. (A _broadcast overload covers mul.)

2. A set_workgroup_1d helper on the resize path. Every resize hook ends with the same g.dispatch_at(idx).workgroup_count_x = utils::compute_1d_workgroup_count(g.device(), numel, wg_size, name); (Select.cpp:206, UnaryOp.cpp:152, add/mul). A thin utils::set_dispatch_1d(g, dispatch_idx, numel, wg_size, name) removes the repeated dispatch_at(...).workgroup_count_x = plumbing and keeps the device/queue access in one place.

3. The 5-binding gather scaffold shared by slice and select. Select.cpp and Slice.cpp are ~110 lines of identical bind-group-layout / pipeline-layout / pipeline / bind-group setup for the canonical [in, out(rw), out_meta, in_meta, params] kernel — only the *Params struct and WGSL differ. This is the biggest duplication and the one most likely to drift. Consider a builder that returns the created objects:

struct GatherPipeline { WGPUComputePipeline pipeline; WGPUBindGroup bind_group;
                        WGPUBuffer out_meta_buf, in_meta_buf, params_buf; };
GatherPipeline build_gather_op(graph, in_tensor, out_tensor, wgsl,
                               wg_size, const void* params, size_t params_size);

The get_or_create_bgl cache already exists in WebGPUGraph, so the layout could also be keyed/shared rather than rebuilt per op. This one is a larger refactor and out of scope for this PR — but it's the place where per-op boilerplate is accumulating fastest, and select/slice are the natural first consolidation. Start the refactor →

Given CLAUDE.md's "no trivial single-use helpers" rule, #1 and #2 clear that bar because they're now used by 2+ ops and will be reused by every shape-changing op; #3 is a real refactor worth a follow-up PR rather than expanding this one.

Recommendation

Ship the fix as-is (correct and minimal). Suggest doing #1 either here or as the immediate next stacked diff while the pattern is fresh, and tracking #3 as a follow-up.
· branch gh/JulianCloudNTH/70/head

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants