We're excited to ship the first release of the WebGPU Execution Provider as a plugin EP for ONNX Runtime. Instead of being baked into the core onnxruntime binary, the WebGPU EP is now distributed as a standalone artifact that registers with an existing ONNX Runtime installation at runtime.
Highlights
- Broad operator coverage on WebGPU. Native WebGPU kernels for the operators needed by common transformer, vision, and generative workloads — including Conv variants, MatMul/Gemm, normalizations, attention (Attention, MultiHeadAttention, GroupQueryAttention), rotary embeddings, quantized matmul, quantized Mixture-of-Experts (QMoE), and more. See the Operator coverage section below for a summary.
- Quantized & accelerated kernels. DP4A and subgroup-matrix MatMulNBits, a FlashAttention kernel, and vendor-optimized Intel MatMul/Gemm paths. See the Performance features section below.
- Plugin EP packaging. WebGPU support now ships as a separate, independently versioned library (
onnxruntime_providers_webgpu) that plugs into a compatible ONNX Runtime (1.24.4 or newer) at runtime. Users can adopt WebGPU acceleration without switching their core ORT package, and the EP can iterate on its own cadence. - Cross-platform native binaries for Windows x64/arm64 (bundled with
dxil.dll/dxcompiler.dll), Linux x64, and macOS arm64. - Language packages.
- Python:
onnxruntime-ep-webgpuwheel, installed alongside theonnxruntimepackage, registered viaonnxruntime.register_execution_provider_library(...). See package page for details on installation and usage. - .NET:
Microsoft.ML.OnnxRuntime.EP.WebGpuNuGet package, referenced alongsideMicrosoft.ML.OnnxRuntime, registered viaOrtEnv.RegisterExecutionProviderLibrary(...). See package page for details on installation and usage.
- Python:
Operator coverage
The WebGPU EP registers kernels for the majority of ONNX standard-domain operators used by mainstream model architectures, plus a curated set of com.microsoft contrib operators. Highlights by category:
- Math, normalization & reduction: MatMul, Gemm, Softmax, LayerNormalization, RMSNormalization, InstanceNormalization, BatchNormalization, LpNormalization, unary/binary elementwise ops, all standard reductions (ReduceMean, ReduceSum, ReduceMax, ...), CumSum, Einsum, TopK, ArgMax/ArgMin.
- Neural network: Conv, ConvTranspose, MaxPool/AveragePool (and Global* variants), plus a
FusedConvcontrib op. - Tensor manipulation: Transpose, Reshape, Slice, Concat, Split, Gather/GatherElements/GatherND, ScatterElements/ScatterND, Pad, Tile, Cast, Resize, GridSample, Where, Flatten, Squeeze, Identity, Shape, and more.
- Transformer / LLM contrib ops: Attention, MultiHeadAttention, GroupQueryAttention, RotaryEmbedding, SkipLayerNormalization, SkipSimplifiedLayerNormalization, SimplifiedLayerNormalization, BiasAdd, BiasGelu, BiasSplitGelu, FastGelu, Gelu, QuickGelu, CausalConvWithState, LinearAttention.
- Quantization: DequantizeLinear, MatMulNBits (with DP4A and subgroup-matrix paths), GatherBlockQuantized, QMoE.
For the authoritative list, see the kernel registrations in webgpu_execution_provider.cc and webgpu_contrib_kernels.cc.
Performance features
- DP4A and subgroup-matrix MatMulNBits paths for accelerated quantized matmul on supported hardware.
- FlashAttention kernel for attention-heavy workloads.
- Intel-optimized MatMul/Gemm code paths for improved performance on Intel GPUs.
- Program caching to amortize shader compilation costs across runs.
- Optional PIX frame capture and WebGPU profiler integration for performance investigation.
Known limitations
- Platform support in this release is limited to the platforms listed above (no mobile, no Linux arm64, no macOS x64).
Acknowledgments
This initial release is the result of contributions from engineers at Microsoft, Intel, and the broader community. Thank you to everyone who built, reviewed, and tested the WebGPU plugin EP — including (in alphabetical order):
@aciddelgado, @adrastogi, @adrianlizarraga, @chilo-ms, @daijh, @derdeljan-msft, @edgchen1, @eserscor, @feich-ms, @fs-eire, @guschmue, @HectorSVC, @ingyukoh, @jchen10, @jiangzhaoming, @Jiawei-Shao, @jing-bao, @justinchuby, @kunal-vaishnavi, @mindest, @prathikr, @qjia7, @satyajandhyala, @shaoboyan091, @sheetalarkadam, @skottmckay, @snnn, @sushraja-msft, @tianleiwu, @titaiwangms, @TomCrypto, @vraspar, @wenqinI, @xenova, @xhcao, @xiaofeihan1, @yuslepukhin.
Special thanks to the Intel team for the vendor-optimized MatMul/Gemm kernels.
Note: This list was compiled on a best-effort basis from PRs that touched WebGPU EP-specific paths, so it may not capture every contribution. If yours was missed, the omission is unintentional — your work is no less appreciated.