Skip to content

Add Vision Transformer sample with attention visualization#569

Open
lyonsno wants to merge 1 commit into
webgpu:mainfrom
lyonsno:vit-attention-visualization
Open

Add Vision Transformer sample with attention visualization#569
lyonsno wants to merge 1 commit into
webgpu:mainfrom
lyonsno:vit-attention-visualization

Conversation

@lyonsno

@lyonsno lyonsno commented Jun 23, 2026

Copy link
Copy Markdown

Runs DeiT-Tiny (5.7M params) inference entirely in WebGPU compute shaders to classify images, and visualizes attention maps as interactive heatmap overlays showing which image patches the model focuses on.

The sample is organized around the transformer compute stages: patch embedding, layer normalization, multi-head attention (Q/K/V projections, scaled dot-product scores, softmax, weighted sum), MLP with GELU, and residual connections. Each compute shader is self-contained with at most 7 bindings per bind group.

Model weights are int8 quantized (5.8MB committed binary). The quantized weights are dequantized to fp32 during loading. An offline Python converter (tools/convert_deit_weights.py) generates the weight file from the HuggingFace model; it is not needed to run the sample. If maintainers prefer the weights hosted externally instead of committed, I can move them.

Third-party attribution is in sample/visionTransformer/THIRD_PARTY_NOTICES.md (model: Meta DeiT Apache-2.0, images: Unsplash).

Related to #350. This demonstrates transformer building blocks in WebGPU compute but does not specifically exercise DP4A, shader-f16, or subgroups; those would make good follow-up primitive-focused samples.

I'm happy to make any changes needed. Please let me know if the scope or asset size is a concern.

Runs DeiT-Tiny (5.7M params) inference entirely in WebGPU compute shaders
to classify images, and visualizes attention maps as interactive heatmap
overlays showing which image patches the model focuses on.

The sample is organized around the transformer compute stages: patch
embedding, layer normalization, multi-head attention, MLP with GELU, and
residual connections. Each compute shader is self-contained with at most
7 bindings per bind group.

Model weights are int8 quantized (5.8MB). Third-party attribution is in
sample/visionTransformer/THIRD_PARTY_NOTICES.md.

Related to webgpu#350.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lyonsno lyonsno force-pushed the vit-attention-visualization branch from 234350e to e1b2fd3 Compare June 24, 2026 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant