flashinfer-ai/flashinfer v0.2.9rc2 on GitHub

What's Changed

Reduce the JIT compilation time of gen_gemm_sm100_module by @jinyangyuan-nvidia in #1251
fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) by @vlev02 in #1254
Made AR output optional + esthetic changes by @nvmbreughe in #1265
init add gemm fp8 using cudnn backend by @ttyio in #1264
Feature/sm100 low latency nvfp4 kernels by @azhurkevich in #1214
CI: install nvidia-nvshmem-cu12 by @EmilienM in #1262
feat: enable trtllm-gen mla MTP by @yyihuang in #1258
Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output by @weireweire in #1242
add trtllm-gen context attention by @IwakuraRein in #1239
feat: add masked deepgemm support and benchmarking by @cyx-6 in #1266
Add missing import in comm/init,py by @joker-eph in #1275
hotfix: fix deepgemm artifactory hash by @cyx-6 in #1278
Unify groupwise fp8 GEMM test by @cyx-6 in #1281
fix: update trtllm-gen fmha benchmark by @yyihuang in #1280
fix multiCtasKvScratchPtr misalignment issue (new one) by @nvpohanh in #1286
Fix install folder regression, and JIT-vs-AOT differences by @directhex in #1279
Add shuffle matrix flag by @aleozlx in #1272
Convert scale_factor from scalar to Tensor in trt_allreduce_fusion by @ilmarkov in #1284
patch error handling by @aleozlx in #1293
Bug fix: guard fp8 e8m0 and e2m1 compile by @Edenzzzz in #1287
refactor: Improved metainfo for trtllm-gen fmha by @cyx-6 in #1292
add mm_fp4 use cudnn backend by @ttyio in #1288
fix: minor errors in cubin loader by @yyihuang in #1295
perfix: use lightweight API to query device property by @azhurkevich in #1298
refactor: refactor trtllm-gen attention kernel integration code by @yzh119 in #1289
Remove FAST_BUILD FLAG for MOE by @wenscarl in #1291
bugfix: ensure graph is captured and executed on the same stream to avoid rep… by @elfiegg in #1303
minor: some fix and cleanup for trtllm-gen mha by @yyihuang in #1302
[Feature] SM level profiler by @Edenzzzz in #1305
Heuristics + testing unification + CUDA Graphs by @azhurkevich in #1306
Update cutlass fp4 moe kernels by @wenscarl in #1294
Fix the bug of the kernel-selection heuristic in trtllm-gen by @PerkzZheng in #1307
test qkvo quantization not equal to 1. by @weireweire in #1314
[fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. by @happierpig in #1290
Addition of flashinfer_benchmark.py for benchmarking routines by @bkryu in #1323
minor: update devcontainer by @yyihuang in #1329
Fix redundant argument in TrtllmGenDecodeModule by @IwakuraRein in #1326
Optimizations for TRTLLM MNNVL Allreduce by @timlee0212 in #1321
add torch float4_e2m1fn_x2 check for cudnn fp4 backend by @ttyio in #1333
only add cudnn dependency for x86 platform by @ttyio in #1332
Make Fp8 MoE routing_bias optional by @aleozlx in #1319
feat: Add weight layout option for trtllm-gen fused moe by @aleozlx in #1297
[Fix] remove torch 2.8 requirement for FP4 GEMM by @elfiegg in #1334
Bug fix: fix duplicate launch in POD by @Edenzzzz in #1267

New Contributors

@vlev02 made their first contribution in #1254
@ttyio made their first contribution in #1264
@azhurkevich made their first contribution in #1214
@weireweire made their first contribution in #1242
@IwakuraRein made their first contribution in #1239
@nvpohanh made their first contribution in #1286
@directhex made their first contribution in #1279
@ilmarkov made their first contribution in #1284
@elfiegg made their first contribution in #1303
@PerkzZheng made their first contribution in #1307
@bkryu made their first contribution in #1323
@timlee0212 made their first contribution in #1321

Full Changelog: v0.2.8...v0.2.9rc2