github flashinfer-ai/flashinfer v0.6.5
Release v0.6.5

latest releases: nightly-v0.6.8-20260421, v0.6.8.post1, nightly-v0.6.8-20260416...
pre-releaseone month ago

What's Changed

  • feat: BF16 GEMM benchmarking support by @raayandhar in #2525
  • [bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length by @huangzhilin-hzl in #2489
  • test: Skip test_decode_delta_rule.py by @bkryu in #2600
  • feat: add issue self-claim workflow for external contributors by @jwu1980 in #2586
  • ci: add cleanup step to nightly release self-hosted runner jobs by @yongwww in #2510
  • ci: fix H100 cleanup by @yongwww in #2590
  • tests: add bias testing to nvfp4 moe by @jimmyzho in #2585
  • feat: cute dsl mmfp4 for blackwell by @nv-yunzheq in #2540
  • fix: correct #pragma unoll typo to #pragma unroll in vec_dtypes.cuh by @Bias92 in #2611
  • fix: get tensors by const ref to not rely on deleted move constructor for TensorView by @hypdeb in #2602
  • Mamba SSU: better automatic kernel selection + algorithm selection optionally exposed to the user. by @ishovkun in #2591
  • chore/feat: Add do_finalize to trtllm-gen fp8/f16 MoE APIs by @IwakuraRein in #2548
  • docs: Document setuptools upgrade requirement for editable installs with --no-build-isolation by @bkryu in #2541
  • docs: resolve TODO by documenting log2f vs logf performance rationale in sampling by @Bias92 in #2609
  • Ameyn/gdn bf16 tolerance parallel reduction by @ameynaik-hub in #2610
  • feat: trtllm tinygemm2 in flashinfer as bf16 routergemm by @jimmyzho in #2587
  • fix: cute dsl nvfp4 moe routing index error by @nv-yunzheq in #2629
  • [bugfix] Fix FilteredTopK overflow correctness by @jiangyinzuo in #2605
  • fix: add SM121 support to SM120 version guards by @Yuening-wa in #2631
  • benchmark: Enable speculative decode microbenchmarking for paged decode by @bkryu in #2628
  • feat: add is_sm12x_supported() helper for SM12x family detection by @blake-snc in #2574
  • benchmark: Add MXFP4/MXFP8 quantization mode support to FP4 MoE benchmark by @bkryu in #2635
  • fix: duplicate username bug in codeowners_analyzer.py by @sricketts in #2637
  • Perf: Optimize GDN decode pretranspose kernel for all batch sizes by @ameynaik-hub in #2588
  • support qk_nope_head_dim for 192 check for GLM-5 by @rainj-me in #2607
  • fix: trtllm_mxint4_block_scale_moe unit test to index output list by @jimmyzho in #2627
  • chore: Update CODEOWNERS by @flashinfer-bot in #2286
  • fix: Add fused MOE and GEMM AOT modules for SM121 by @blake-snc in #2654
  • refactor: pull trtllm-gen batch-gemm/gemm headers from artifactory; update tma descriptor shape init by @jimmyzho in #2235
  • fix: Add tests for the AutoTuner and fix bug in _find_nearest_profile by @danisereb in #2617
  • Bf16 routed moe by @IwakuraRein in #2594
  • perf: Update trtllm-gen batched GEMM kernels - faster, more NVFP4 tile dims, MXFP8 with relu2 act by @amitz-nv in #2667
  • Add code owner for scripts/codeownder_overrides.js by @aleozlx in #2656
  • feat: Autotuner support CUDA graph and cold L2 cache by @amitz-nv in #2663
  • benchmarks: Add FP8 input / BF16 output in ragged prefill benchmark by @bkryu in #2666
  • Fix ImportError in AllReduceFusionWorkspace destructor during Python shutdown by @chaunceyjiang in #2659
  • Version bump to 0.6.5 by @aleozlx in #2668

New Contributors

Full Changelog: v0.6.4...v0.6.5

Don't miss a new flashinfer release

NewReleases is sending notifications on new releases.