flashinfer-ai/flashinfer v0.6.13rc1 on GitHub

What's Changed

Run high-likelihood OOM culprits separately, record memory usage and test duration for analysis by @dierksen in #2961
fix(autotuner): differentiate file cache entries by runner specific kernel parameters by @qiching in #3367
feat: integrate cute-dsl Blackwell GQA decode into BatchDecodeWithPagedKVCacheWrapper by @richardmcai in #3360
Fix returning reference to temporary in moe gemm by @benbarsdell in #3332
Enable compression of GPU device binaries by @benbarsdell in #2949
feat: MNNVL Allreduce quant fusion and performance optimization by @timlee0212 in #3385
fix(norm): widen address arithmetic to int64 for large contiguous inputs > 2**31 elements by @bkryu in #3392
MLA Decode Autotuning Across TRTLLM-Gen and CuTe Backends by @Vinnie6167 in #3355
profiler: group perfetto traces by SM, one row per block by @Edenzzzz in #3038
Make cute dsl mxfp8/nvfp4 quantizer bitwise exact by @zianglih in #3387
Extend autotuner delay kernel length by @yanqinz2 in #3373
Enable smaller tile N for SM100 Cute-DSL NVFP4 GEMM by @b8zhong in #3403
test: align test_fmha_v2_prefill SM gating with is_sm12x_supported by @leonardHONG in #3182
feat(autotuner): enable per-op autotune bypass for faster framework warmup by @qiching in #3396
bench: Unify PDL behavior, add missing norm routines, and misc improvements by @bkryu in #3435
feat(kda): add recurrent KDA decode kernel with per-K gating by @djmmoss in #2572
Add mHC post mapping and pre big-fuse kernels by @jmydurant in #3285
[Feat] Add num_heads < 128 support for mla decode kernel by @Observer007 in #3309
Fix cross-warp race in checkpointing SSU kernel (mamba) by @ishovkun in #3439
feat(trace): add check callbacks to trace templates by @yyihuang in #3330
Add BGMV MoE CUDA kernels for multi-LoRA by @taehokim20 in #3249
replace deprecated APIs: cute.make_fragment and cute.core.ThrMma by @brandon-yujie-sun in #3430
[ci feat] Support /bot run tests/ to scope CI test runs by @kahyunnam in #3422
perf(attention): Speed up FP8 KV-cache prefill (FA2 BatchPrefill) by repacking K/V to BF16 in shared memory by @bkryu in #3485
Ep api design -- Adding the actual code and tests by @Anerudhan in #3453
Optimize mxfp8 quantization on sm100 by @IwakuraRein in #3289
Fix silent no-op autotuning for cuBLAS bmm_fp8 and cuDNN bmm_fp8/mm_fp4 by @bkryu in #3437
fix(quantization): nvfp4_quantize(backend='cuda') silently corrupts scale factors when global_scale is not float32 by @bkryu in #3497
feat(moe): add SWIGLUSTEP activation to CUTLASS fused MoE by @bkryu in #3492
Support LSE buffers in TRTLLM API by @saltyminty in #3410
NFC: replace deprecated API: cute.make_fragment by @brandon-yujie-sun in #3473
feat(moe): write routing_replay_out from custom routing kernels by @jdebache in #3382
Add CuTe DSL NVFP4 quantization with 4over6 FP16 scoring by @zianglih in #3448
fix intermittent exit 141 (SIGPIPE) in test resource summary by @yongwww in #3498
add_cudnn_mxfp8 by @yanqinz2 in #3489
fix: use routedScalingFactor to initialize mRouteScale by @yweng0828 in #3499
fix: Simple approach to restore support for bias for fp4 block scale types by @djns99 in #3416
docs(infra): document env vars, refresh SM list, fix stale paths in CLAUDE.md by @kangbintNV in #3440
docs(quant/sampling/activation): canonicalize quantization.rst, stub fp4_quantization.rst by @kangbintNV in #3447
docs(gemm): document GEMM + grouped_mm public surface; canonicalize aliases by @kangbintNV in #3442
Optimize MoE routing top-k reduction and non-power-of-two sorting by @jiahanc in #3476
[bugfix] TorchDistBackend.bcast uses global rank instead of local rank by @xuanyu-mistral in #3418
docs(comm): structural RST refactor for MoeAlltoAll/DCP/Mixed comm by @kangbintNV in #3445
[docs] Backfill missing docstrings and decorators across kernels by @kangbintNV in #3456
docs(attention): backfill missing/stale Attention/POD/cuDNN/CuteDSL APIs; restore single_prefill_with_kv_cache_return_lse by @kangbintNV in #3441
docs(moe): close fused_moe / trtllm_*_moe / CuteDSL MoE doc gaps by @kangbintNV in #3443
docs(comm): NumPy-style docstrings + Deprecated leads for 21 STALE comm APIs (no decorator changes) by @kangbintNV in #3444
bump version to 0.6.13 by @aleozlx in #3513

New Contributors

@richardmcai made their first contribution in #3360
@jmydurant made their first contribution in #3285
@taehokim20 made their first contribution in #3249
@brandon-yujie-sun made their first contribution in #3430
@xuanyu-mistral made their first contribution in #3418

Full Changelog: v0.6.12rc3...v0.6.13rc1

flashinfer-ai/flashinfer v0.6.13rc1 Release v0.6.13rc1 on GitHub

What's Changed

New Contributors

flashinfer-ai/flashinfer v0.6.13rc1
Release v0.6.13rc1

on GitHub