github flashinfer-ai/flashinfer v0.6.12rc1
Release v0.6.12rc1

8 hours ago

What's Changed

  • Loosened trtllm_ragged_attention_deepseek shape assertion by @nvjullin in #3064
  • Update moe gemm by @IwakuraRein in #3239
  • perf: optimize per-token nvfp4 quantization kernel. by @IwakuraRein in #3237
  • build: add sccache-backed jit-cache builds and AOT diagnostics by @dierksen in #3205
  • non-override tactic control by @yanqinz2 in #3260
  • ci(jit-cache): limit sm110 builds to aarch64 by @dierksen in #3275
  • feat(moe): add SM120 W4A16 b12x kernels by @lukealonso in #3271
  • Add dynamic tokens-per-page TRTLLM-GEN GQA kernels by @PerkzZheng in #3259
  • fix(cute_dsl/moe): unbias autotuner profiling for tile_size enumeration by @leejnau in #3252
  • Support Kimi K2.5 H64 CuTe DSL MLA decode by @saltyminty in #3235
  • feat: FP8 output support for CUTLASS MLA paged attention by @carlyou in #2779
  • fix(jit): propagate -DNDEBUG to host-side cflags by @arpera in #3278
  • feat: add SM120 fmha_v2 kernels to AOT pip wheel builds by @blake-snc in #2885
  • bench(moe_deepseek): fix moe benchmark (supersedes #2886) by @leejnau in #3292
  • fix(gdn_decode): widen pool indices to Int64 to prevent int32 element-offset overflow by @vadiklyutiy in #3230
  • [chore] Add guard to blackwell GDN prefill by @jiahanc in #3267
  • fix: remove over-strict K%4 assert in get_shuffle_matrix_sf_a_row_indices by @jimmyzho in #3163
  • ci: isolate nightly package tests from source tree by @dierksen in #3274
  • Fix [Spark unit test CI]: defer torch._dynamo.disable to avoid import-time crash in CI by @kahyunnam in #3290
  • bench(moe_deepseek): scope autotune(True) to pre-warm only by @leejnau in #3301
  • Improved simple mamba SSU kernel by @ishovkun in #2962
  • add cuda tile dependency for cuda 13.0 by @nv-yunzheq in #3305
  • [Fix] Fix XQA V tile reading from wrong page when nbVItersPerXIter > 1 by @qsang-nv in #3022
  • fix: MNNVL Allreduce uses bitwise sentinel checking to avoid subnormal value issue (#3053) by @timlee0212 in #3304
  • Fix: remove nvfp4 llama4 blocker by @IwakuraRein in #3313
  • [chore] add mamba codeowners list by @jimmyzho in #3318
  • Modify release deletion command in workflow by @aleozlx in #3307
  • Add to code owners by @dhiraj113 in #3326
  • feat: Add CuTe DSL grouped-gemm + combine fusion support by @nvcastet in #2944
  • fix(gdn): allow importing gdn_decode without a CUDA device by @kahyunnam in #3293
  • feat: enable glm5 router gemm by @b8zhong in #3185
  • fix(fmha_v2): fix FP8 V-scratch pipeline and varlen scheduler on SM90 by @jimmyzho in #3276
  • fix typo llama routing issue in trtllm-gen moe by @IwakuraRein in #3303
  • feat(logging,trace): cuda-graph-compatible level-5/10 logging + fi_trace template additions/fixes by @yyihuang in #3172
  • Use cudnn 9.23 new API to query workspace with override shape by @yanqinz2 in #3291
  • feat: Expose unpacked topk weights for routed moe (fp4) by @aleozlx in #2425
  • Reland support lse in trtllm paged attn kernels by @murphymatt in #3116
  • fix(CI unit tests, cute_dsl, spark): set USER env var before torch._dynamo import for unmapped UIDs by @kahyunnam in #3314
  • feat(trace): embed runnable init() in every TraceTemplate by @yyihuang in #3221
  • feat(cute_dsl/moe): deterministic balanced autotune profile inputs by @leejnau in #3286
  • feat(cute_dsl/moe): add moe_output_memset_inplace dense memset wrapper by @leejnau in #3328
  • Fix/3170 dense blockscaled sm12x by @leonardHONG in #3180
  • test: enable bmm_mxfp8 cutlass backend coverage on SM12x by @leonardHONG in #3183
  • Ep api design - Build Infra dependencies by @Anerudhan in #3315
  • [feat] Add gemma RMS AR fusion by @jiahanc in #3322
  • checkpointing_ssu kernel: fused replay + conditional state-write for Mamba2 by @ishovkun in #3324
  • Ameyn/gdn bf16 dispatcher and 4d pool by @ameynaik-hub in #3268
  • Update trtllm FMHA cubins by @djmmoss in #3317
  • fix(trace): repair TGV and XQA MLA reference tests by @yyihuang in #3365
  • feat: Add 8x4 swizzle layout support to MXFP4 and MXFP8 CuTe-DSL kernels by @bkryu in #3357
  • Add AGENTS.md shim by @aleozlx in #3342
  • Add list_api script by @aleozlx in #3341
  • Support 4over6 nvfp4 for quantizer and fused MoE by @zianglih in #3264
  • Add DeepSeek V4 sparse MLA TRTLLM-GEN kernels by @PerkzZheng in #3269
  • Reject EP configurations in b12x MoE with a clear error by @kahyunnam in #3302
  • fix(cute_dsl): avoid MoE wrapper runner reference cycle by @leejnau in #3340
  • feat: Add support for LoRa delta in MOE mxint4 x bf16, MXFP8 & BF16 to trtllm backend by @djns99 in #3153
  • Restore monolithic CuTe-DSL MLA decode alongside modular, gated by cute_dsl_impl= by @pgera in #3296
  • feat: RMSNorm + RoPE fusion for WAN: flashinfer.diffusion_ops.fused_qk_rmsnorm_rope by @kahyunnam in #3148
  • fix deprecation warnings from cute-dsl by @b8zhong in #3333
  • feat(cute_dsl/moe): re-enable use_cold_l2_cache in CuteDslMoEWrapper TuningConfig by @leejnau in #3384
  • Add torch.compile-compatible custom op for fp4_quantize by @Kh4L in #3081
  • Replace SM120 W4A16 MoE kernels by @lukealonso in #3336
  • bump version to 0.6.12 by @aleozlx in #3388

New Contributors

Full Changelog: v0.6.11rc1...v0.6.12rc1

Don't miss a new flashinfer release

NewReleases is sending notifications on new releases.