github flashinfer-ai/flashinfer v0.6.13rc1
Release v0.6.13rc1

latest release: nightly-v0.6.13-20260610
5 hours ago

What's Changed

  • Run high-likelihood OOM culprits separately, record memory usage and test duration for analysis by @dierksen in #2961
  • fix(autotuner): differentiate file cache entries by runner specific kernel parameters by @qiching in #3367
  • feat: integrate cute-dsl Blackwell GQA decode into BatchDecodeWithPagedKVCacheWrapper by @richardmcai in #3360
  • Fix returning reference to temporary in moe gemm by @benbarsdell in #3332
  • Enable compression of GPU device binaries by @benbarsdell in #2949
  • feat: MNNVL Allreduce quant fusion and performance optimization by @timlee0212 in #3385
  • fix(norm): widen address arithmetic to int64 for large contiguous inputs > 2**31 elements by @bkryu in #3392
  • MLA Decode Autotuning Across TRTLLM-Gen and CuTe Backends by @Vinnie6167 in #3355
  • profiler: group perfetto traces by SM, one row per block by @Edenzzzz in #3038
  • Make cute dsl mxfp8/nvfp4 quantizer bitwise exact by @zianglih in #3387
  • Extend autotuner delay kernel length by @yanqinz2 in #3373
  • Enable smaller tile N for SM100 Cute-DSL NVFP4 GEMM by @b8zhong in #3403
  • test: align test_fmha_v2_prefill SM gating with is_sm12x_supported by @leonardHONG in #3182
  • feat(autotuner): enable per-op autotune bypass for faster framework warmup by @qiching in #3396
  • bench: Unify PDL behavior, add missing norm routines, and misc improvements by @bkryu in #3435
  • feat(kda): add recurrent KDA decode kernel with per-K gating by @djmmoss in #2572
  • Add mHC post mapping and pre big-fuse kernels by @jmydurant in #3285
  • [Feat] Add num_heads < 128 support for mla decode kernel by @Observer007 in #3309
  • Fix cross-warp race in checkpointing SSU kernel (mamba) by @ishovkun in #3439
  • feat(trace): add check callbacks to trace templates by @yyihuang in #3330
  • Add BGMV MoE CUDA kernels for multi-LoRA by @taehokim20 in #3249
  • replace deprecated APIs: cute.make_fragment and cute.core.ThrMma by @brandon-yujie-sun in #3430
  • [ci feat] Support /bot run tests/ to scope CI test runs by @kahyunnam in #3422
  • perf(attention): Speed up FP8 KV-cache prefill (FA2 BatchPrefill) by repacking K/V to BF16 in shared memory by @bkryu in #3485
  • Ep api design -- Adding the actual code and tests by @Anerudhan in #3453
  • Optimize mxfp8 quantization on sm100 by @IwakuraRein in #3289
  • Fix silent no-op autotuning for cuBLAS bmm_fp8 and cuDNN bmm_fp8/mm_fp4 by @bkryu in #3437
  • fix(quantization): nvfp4_quantize(backend='cuda') silently corrupts scale factors when global_scale is not float32 by @bkryu in #3497
  • feat(moe): add SWIGLUSTEP activation to CUTLASS fused MoE by @bkryu in #3492
  • Support LSE buffers in TRTLLM API by @saltyminty in #3410
  • NFC: replace deprecated API: cute.make_fragment by @brandon-yujie-sun in #3473
  • feat(moe): write routing_replay_out from custom routing kernels by @jdebache in #3382
  • Add CuTe DSL NVFP4 quantization with 4over6 FP16 scoring by @zianglih in #3448
  • fix intermittent exit 141 (SIGPIPE) in test resource summary by @yongwww in #3498
  • add_cudnn_mxfp8 by @yanqinz2 in #3489
  • fix: use routedScalingFactor to initialize mRouteScale by @yweng0828 in #3499
  • fix: Simple approach to restore support for bias for fp4 block scale types by @djns99 in #3416
  • docs(infra): document env vars, refresh SM list, fix stale paths in CLAUDE.md by @kangbintNV in #3440
  • docs(quant/sampling/activation): canonicalize quantization.rst, stub fp4_quantization.rst by @kangbintNV in #3447
  • docs(gemm): document GEMM + grouped_mm public surface; canonicalize aliases by @kangbintNV in #3442
  • Optimize MoE routing top-k reduction and non-power-of-two sorting by @jiahanc in #3476
  • [bugfix] TorchDistBackend.bcast uses global rank instead of local rank by @xuanyu-mistral in #3418
  • docs(comm): structural RST refactor for MoeAlltoAll/DCP/Mixed comm by @kangbintNV in #3445
  • [docs] Backfill missing docstrings and decorators across kernels by @kangbintNV in #3456
  • docs(attention): backfill missing/stale Attention/POD/cuDNN/CuteDSL APIs; restore single_prefill_with_kv_cache_return_lse by @kangbintNV in #3441
  • docs(moe): close fused_moe / trtllm_*_moe / CuteDSL MoE doc gaps by @kangbintNV in #3443
  • docs(comm): NumPy-style docstrings + Deprecated leads for 21 STALE comm APIs (no decorator changes) by @kangbintNV in #3444
  • bump version to 0.6.13 by @aleozlx in #3513

New Contributors

Full Changelog: v0.6.12rc3...v0.6.13rc1

Don't miss a new flashinfer release

NewReleases is sending notifications on new releases.