microsoft/onnxruntime plugin-ep-webgpu/v0.1.0 on GitHub

We're excited to ship the first release of the WebGPU Execution Provider as a plugin EP for ONNX Runtime. Instead of being baked into the core onnxruntime binary, the WebGPU EP is now distributed as a standalone artifact that registers with an existing ONNX Runtime installation at runtime.

Highlights

Broad operator coverage on WebGPU. Native WebGPU kernels for the operators needed by common transformer, vision, and generative workloads — including Conv variants, MatMul/Gemm, normalizations, attention (Attention, MultiHeadAttention, GroupQueryAttention), rotary embeddings, quantized matmul, quantized Mixture-of-Experts (QMoE), and more. See the Operator coverage section below for a summary.
Quantized & accelerated kernels. DP4A and subgroup-matrix MatMulNBits, a FlashAttention kernel, and vendor-optimized Intel MatMul/Gemm paths. See the Performance features section below.
Plugin EP packaging. WebGPU support now ships as a separate, independently versioned library (onnxruntime_providers_webgpu) that plugs into a compatible ONNX Runtime (1.24.4 or newer) at runtime. Users can adopt WebGPU acceleration without switching their core ORT package, and the EP can iterate on its own cadence.
Cross-platform native binaries for Windows x64/arm64 (bundled with dxil.dll / dxcompiler.dll), Linux x64, and macOS arm64.
Language packages.
- Python: onnxruntime-ep-webgpu wheel, installed alongside the onnxruntime package, registered via onnxruntime.register_execution_provider_library(...). See package page for details on installation and usage.
- .NET: Microsoft.ML.OnnxRuntime.EP.WebGpu NuGet package, referenced alongside Microsoft.ML.OnnxRuntime, registered via OrtEnv.RegisterExecutionProviderLibrary(...). See package page for details on installation and usage.

Operator coverage

The WebGPU EP registers kernels for the majority of ONNX standard-domain operators used by mainstream model architectures, plus a curated set of com.microsoft contrib operators. Highlights by category:

Math, normalization & reduction: MatMul, Gemm, Softmax, LayerNormalization, RMSNormalization, InstanceNormalization, BatchNormalization, LpNormalization, unary/binary elementwise ops, all standard reductions (ReduceMean, ReduceSum, ReduceMax, ...), CumSum, Einsum, TopK, ArgMax/ArgMin.
Neural network: Conv, ConvTranspose, MaxPool/AveragePool (and Global* variants), plus a FusedConv contrib op.
Tensor manipulation: Transpose, Reshape, Slice, Concat, Split, Gather/GatherElements/GatherND, ScatterElements/ScatterND, Pad, Tile, Cast, Resize, GridSample, Where, Flatten, Squeeze, Identity, Shape, and more.
Transformer / LLM contrib ops: Attention, MultiHeadAttention, GroupQueryAttention, RotaryEmbedding, SkipLayerNormalization, SkipSimplifiedLayerNormalization, SimplifiedLayerNormalization, BiasAdd, BiasGelu, BiasSplitGelu, FastGelu, Gelu, QuickGelu, CausalConvWithState, LinearAttention.
Quantization: DequantizeLinear, MatMulNBits (with DP4A and subgroup-matrix paths), GatherBlockQuantized, QMoE.

For the authoritative list, see the kernel registrations in webgpu_execution_provider.cc and webgpu_contrib_kernels.cc.

Performance features

DP4A and subgroup-matrix MatMulNBits paths for accelerated quantized matmul on supported hardware.
FlashAttention kernel for attention-heavy workloads.
Intel-optimized MatMul/Gemm code paths for improved performance on Intel GPUs.
Program caching to amortize shader compilation costs across runs.
Optional PIX frame capture and WebGPU profiler integration for performance investigation.

Known limitations

Platform support in this release is limited to the platforms listed above (no mobile, no Linux arm64, no macOS x64).

Acknowledgments

This initial release is the result of contributions from engineers at Microsoft, Intel, and the broader community. Thank you to everyone who built, reviewed, and tested the WebGPU plugin EP — including (in alphabetical order):

@aciddelgado, @adrastogi, @adrianlizarraga, @chilo-ms, @daijh, @derdeljan-msft, @edgchen1, @eserscor, @feich-ms, @fs-eire, @guschmue, @HectorSVC, @ingyukoh, @jchen10, @jiangzhaoming, @Jiawei-Shao, @jing-bao, @justinchuby, @kunal-vaishnavi, @mindest, @prathikr, @qjia7, @satyajandhyala, @shaoboyan091, @sheetalarkadam, @skottmckay, @snnn, @sushraja-msft, @tianleiwu, @titaiwangms, @TomCrypto, @vraspar, @wenqinI, @xenova, @xhcao, @xiaofeihan1, @yuslepukhin.

Special thanks to the Intel team for the vendor-optimized MatMul/Gemm kernels.

Note: This list was compiled on a best-effort basis from PRs that touched WebGPU EP-specific paths, so it may not capture every contribution. If yours was missed, the omission is unintentional — your work is no less appreciated.

microsoft/onnxruntime plugin-ep-webgpu/v0.1.0 ONNX Runtime WebGPU Plugin EP v0.1.0 on GitHub

Highlights

Operator coverage

Performance features

Known limitations

Acknowledgments

microsoft/onnxruntime plugin-ep-webgpu/v0.1.0
ONNX Runtime WebGPU Plugin EP v0.1.0

on GitHub