flashinfer-ai/flashinfer v0.6.5 on GitHub

What's Changed

feat: BF16 GEMM benchmarking support by @raayandhar in #2525
[bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length by @huangzhilin-hzl in #2489
test: Skip test_decode_delta_rule.py by @bkryu in #2600
feat: add issue self-claim workflow for external contributors by @jwu1980 in #2586
ci: add cleanup step to nightly release self-hosted runner jobs by @yongwww in #2510
ci: fix H100 cleanup by @yongwww in #2590
tests: add bias testing to nvfp4 moe by @jimmyzho in #2585
feat: cute dsl mmfp4 for blackwell by @nv-yunzheq in #2540
fix: correct #pragma unoll typo to #pragma unroll in vec_dtypes.cuh by @Bias92 in #2611
fix: get tensors by const ref to not rely on deleted move constructor for TensorView by @hypdeb in #2602
Mamba SSU: better automatic kernel selection + algorithm selection optionally exposed to the user. by @ishovkun in #2591
chore/feat: Add do_finalize to trtllm-gen fp8/f16 MoE APIs by @IwakuraRein in #2548
docs: Document setuptools upgrade requirement for editable installs with --no-build-isolation by @bkryu in #2541
docs: resolve TODO by documenting log2f vs logf performance rationale in sampling by @Bias92 in #2609
Ameyn/gdn bf16 tolerance parallel reduction by @ameynaik-hub in #2610
feat: trtllm tinygemm2 in flashinfer as bf16 routergemm by @jimmyzho in #2587
fix: cute dsl nvfp4 moe routing index error by @nv-yunzheq in #2629
[bugfix] Fix FilteredTopK overflow correctness by @jiangyinzuo in #2605
fix: add SM121 support to SM120 version guards by @Yuening-wa in #2631
benchmark: Enable speculative decode microbenchmarking for paged decode by @bkryu in #2628
feat: add is_sm12x_supported() helper for SM12x family detection by @blake-snc in #2574
benchmark: Add MXFP4/MXFP8 quantization mode support to FP4 MoE benchmark by @bkryu in #2635
fix: duplicate username bug in codeowners_analyzer.py by @sricketts in #2637
Perf: Optimize GDN decode pretranspose kernel for all batch sizes by @ameynaik-hub in #2588
support qk_nope_head_dim for 192 check for GLM-5 by @rainj-me in #2607
fix: trtllm_mxint4_block_scale_moe unit test to index output list by @jimmyzho in #2627
chore: Update CODEOWNERS by @flashinfer-bot in #2286
fix: Add fused MOE and GEMM AOT modules for SM121 by @blake-snc in #2654
refactor: pull trtllm-gen batch-gemm/gemm headers from artifactory; update tma descriptor shape init by @jimmyzho in #2235
fix: Add tests for the AutoTuner and fix bug in _find_nearest_profile by @danisereb in #2617
Bf16 routed moe by @IwakuraRein in #2594
perf: Update trtllm-gen batched GEMM kernels - faster, more NVFP4 tile dims, MXFP8 with relu2 act by @amitz-nv in #2667
Add code owner for scripts/codeownder_overrides.js by @aleozlx in #2656
feat: Autotuner support CUDA graph and cold L2 cache by @amitz-nv in #2663
benchmarks: Add FP8 input / BF16 output in ragged prefill benchmark by @bkryu in #2666
Fix ImportError in AllReduceFusionWorkspace destructor during Python shutdown by @chaunceyjiang in #2659
Version bump to 0.6.5 by @aleozlx in #2668

New Contributors

@jwu1980 made their first contribution in #2586
@Bias92 made their first contribution in #2611
@jiangyinzuo made their first contribution in #2605
@chaunceyjiang made their first contribution in #2659

Full Changelog: v0.6.4...v0.6.5

flashinfer-ai/flashinfer v0.6.5 Release v0.6.5 on GitHub

What's Changed

New Contributors

flashinfer-ai/flashinfer v0.6.5
Release v0.6.5

on GitHub