flashinfer-ai/flashinfer v0.6.12rc1 on GitHub

What's Changed

Loosened trtllm_ragged_attention_deepseek shape assertion by @nvjullin in #3064
Update moe gemm by @IwakuraRein in #3239
perf: optimize per-token nvfp4 quantization kernel. by @IwakuraRein in #3237
build: add sccache-backed jit-cache builds and AOT diagnostics by @dierksen in #3205
non-override tactic control by @yanqinz2 in #3260
ci(jit-cache): limit sm110 builds to aarch64 by @dierksen in #3275
feat(moe): add SM120 W4A16 b12x kernels by @lukealonso in #3271
Add dynamic tokens-per-page TRTLLM-GEN GQA kernels by @PerkzZheng in #3259
fix(cute_dsl/moe): unbias autotuner profiling for tile_size enumeration by @leejnau in #3252
Support Kimi K2.5 H64 CuTe DSL MLA decode by @saltyminty in #3235
feat: FP8 output support for CUTLASS MLA paged attention by @carlyou in #2779
fix(jit): propagate -DNDEBUG to host-side cflags by @arpera in #3278
feat: add SM120 fmha_v2 kernels to AOT pip wheel builds by @blake-snc in #2885
bench(moe_deepseek): fix moe benchmark (supersedes #2886) by @leejnau in #3292
fix(gdn_decode): widen pool indices to Int64 to prevent int32 element-offset overflow by @vadiklyutiy in #3230
[chore] Add guard to blackwell GDN prefill by @jiahanc in #3267
fix: remove over-strict K%4 assert in get_shuffle_matrix_sf_a_row_indices by @jimmyzho in #3163
ci: isolate nightly package tests from source tree by @dierksen in #3274
Fix [Spark unit test CI]: defer torch._dynamo.disable to avoid import-time crash in CI by @kahyunnam in #3290
bench(moe_deepseek): scope autotune(True) to pre-warm only by @leejnau in #3301
Improved simple mamba SSU kernel by @ishovkun in #2962
add cuda tile dependency for cuda 13.0 by @nv-yunzheq in #3305
[Fix] Fix XQA V tile reading from wrong page when nbVItersPerXIter > 1 by @qsang-nv in #3022
fix: MNNVL Allreduce uses bitwise sentinel checking to avoid subnormal value issue (#3053) by @timlee0212 in #3304
Fix: remove nvfp4 llama4 blocker by @IwakuraRein in #3313
[chore] add mamba codeowners list by @jimmyzho in #3318
Modify release deletion command in workflow by @aleozlx in #3307
Add to code owners by @dhiraj113 in #3326
feat: Add CuTe DSL grouped-gemm + combine fusion support by @nvcastet in #2944
fix(gdn): allow importing gdn_decode without a CUDA device by @kahyunnam in #3293
feat: enable glm5 router gemm by @b8zhong in #3185
fix(fmha_v2): fix FP8 V-scratch pipeline and varlen scheduler on SM90 by @jimmyzho in #3276
fix typo llama routing issue in trtllm-gen moe by @IwakuraRein in #3303
feat(logging,trace): cuda-graph-compatible level-5/10 logging + fi_trace template additions/fixes by @yyihuang in #3172
Use cudnn 9.23 new API to query workspace with override shape by @yanqinz2 in #3291
feat: Expose unpacked topk weights for routed moe (fp4) by @aleozlx in #2425
Reland support lse in trtllm paged attn kernels by @murphymatt in #3116
fix(CI unit tests, cute_dsl, spark): set USER env var before torch._dynamo import for unmapped UIDs by @kahyunnam in #3314
feat(trace): embed runnable init() in every TraceTemplate by @yyihuang in #3221
feat(cute_dsl/moe): deterministic balanced autotune profile inputs by @leejnau in #3286
feat(cute_dsl/moe): add moe_output_memset_inplace dense memset wrapper by @leejnau in #3328
Fix/3170 dense blockscaled sm12x by @leonardHONG in #3180
test: enable bmm_mxfp8 cutlass backend coverage on SM12x by @leonardHONG in #3183
Ep api design - Build Infra dependencies by @Anerudhan in #3315
[feat] Add gemma RMS AR fusion by @jiahanc in #3322
checkpointing_ssu kernel: fused replay + conditional state-write for Mamba2 by @ishovkun in #3324
Ameyn/gdn bf16 dispatcher and 4d pool by @ameynaik-hub in #3268
Update trtllm FMHA cubins by @djmmoss in #3317
fix(trace): repair TGV and XQA MLA reference tests by @yyihuang in #3365
feat: Add 8x4 swizzle layout support to MXFP4 and MXFP8 CuTe-DSL kernels by @bkryu in #3357
Add AGENTS.md shim by @aleozlx in #3342
Add list_api script by @aleozlx in #3341
Support 4over6 nvfp4 for quantizer and fused MoE by @zianglih in #3264
Add DeepSeek V4 sparse MLA TRTLLM-GEN kernels by @PerkzZheng in #3269
Reject EP configurations in b12x MoE with a clear error by @kahyunnam in #3302
fix(cute_dsl): avoid MoE wrapper runner reference cycle by @leejnau in #3340
feat: Add support for LoRa delta in MOE mxint4 x bf16, MXFP8 & BF16 to trtllm backend by @djns99 in #3153
Restore monolithic CuTe-DSL MLA decode alongside modular, gated by cute_dsl_impl= by @pgera in #3296
feat: RMSNorm + RoPE fusion for WAN: flashinfer.diffusion_ops.fused_qk_rmsnorm_rope by @kahyunnam in #3148
fix deprecation warnings from cute-dsl by @b8zhong in #3333
feat(cute_dsl/moe): re-enable use_cold_l2_cache in CuteDslMoEWrapper TuningConfig by @leejnau in #3384
Add torch.compile-compatible custom op for fp4_quantize by @Kh4L in #3081
Replace SM120 W4A16 MoE kernels by @lukealonso in #3336
bump version to 0.6.12 by @aleozlx in #3388

New Contributors

@carlyou made their first contribution in #2779
@nvcastet made their first contribution in #2944
@Kh4L made their first contribution in #3081

Full Changelog: v0.6.11rc1...v0.6.12rc1

flashinfer-ai/flashinfer v0.6.12rc1 Release v0.6.12rc1 on GitHub

What's Changed

New Contributors

flashinfer-ai/flashinfer v0.6.12rc1
Release v0.6.12rc1

on GitHub