Add Vision Transformer sample with attention visualization#569
Open
lyonsno wants to merge 1 commit into
Open
Conversation
Runs DeiT-Tiny (5.7M params) inference entirely in WebGPU compute shaders to classify images, and visualizes attention maps as interactive heatmap overlays showing which image patches the model focuses on. The sample is organized around the transformer compute stages: patch embedding, layer normalization, multi-head attention, MLP with GELU, and residual connections. Each compute shader is self-contained with at most 7 bindings per bind group. Model weights are int8 quantized (5.8MB). Third-party attribution is in sample/visionTransformer/THIRD_PARTY_NOTICES.md. Related to webgpu#350. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
234350e to
e1b2fd3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Runs DeiT-Tiny (5.7M params) inference entirely in WebGPU compute shaders to classify images, and visualizes attention maps as interactive heatmap overlays showing which image patches the model focuses on.
The sample is organized around the transformer compute stages: patch embedding, layer normalization, multi-head attention (Q/K/V projections, scaled dot-product scores, softmax, weighted sum), MLP with GELU, and residual connections. Each compute shader is self-contained with at most 7 bindings per bind group.
Model weights are int8 quantized (5.8MB committed binary). The quantized weights are dequantized to fp32 during loading. An offline Python converter (
tools/convert_deit_weights.py) generates the weight file from the HuggingFace model; it is not needed to run the sample. If maintainers prefer the weights hosted externally instead of committed, I can move them.Third-party attribution is in
sample/visionTransformer/THIRD_PARTY_NOTICES.md(model: Meta DeiT Apache-2.0, images: Unsplash).Related to #350. This demonstrates transformer building blocks in WebGPU compute but does not specifically exercise DP4A, shader-f16, or subgroups; those would make good follow-up primitive-focused samples.
I'm happy to make any changes needed. Please let me know if the scope or asset size is a concern.