flashinfer-ai/flashinfer v0.2.6 on GitHub

What's Changed

ci: select 2_28 manylinux builder for new torch+cuda versions by @yzh119 in #1000
misc: update REAMDME.md by @yzh119 in #1003
bugfix: Fix illegal memory access due to custom mask ptr by @yongchaoding in #1008
misc: fix kv-layout doc references by @Edenzzzz in #1009
misc: more benchmark scripts in Python by @yzh119 in #1010
misc: fix instrument code for mla profiler by @yzh119 in #1014
bugfix: import wrapper of mla decode by @dhy2000 in #1013
feat: update decode attention APIs by @yzh119 in #1007
doc: use latest protobuf for profiler by @xslingcn in #1021
feat: SM-constraint Communication Kernels by @yyihuang in #994
feat: ragged tensor padding kernel for blackwell kernel alignment by @yzh119 in #1025
bugfix: fix custom mask not be reseted after convert custom mask into causal or non-causal by @yongchaoding in #1028
fix: add zero init for KV tiled copy by @happierpig in #1029
[NVIDIA] Add Cutlass MLA backend by @kaixih in #1031
Add workflow to build aarch64 wheel by @yongwww in #1036
Non-blocking host-to-device copy in the ragged prefill wrapper by @nandor in #1040
fix: remove default ubuntu user in Lunar/Noble by @rickyfeng0119 in #1042
feat: Softmax free sampling by @kf-zhang in #1035
feat: add functional per-head FP8 quantization for FA3 by @happierpig in #1033
add multi-item scoring by @arde171 in #1015
[nvidia] cutlass fp8 blockwise/groupwise gemm support by @cyx-6 in #1045
[nvidia] cutlass fp8 groupwise grouped gemm support by @cyx-6 in #1047
fix: top_k_mask_logits hangs on -inf inputs by @xslingcn in #1050
Benchmark: POD vs batched prefill by @Edenzzzz in #1052
[nvidia] initial support for blackwell kernels by @yzh119 in #1039
Fix KV chunking for POD. by @AKKamath in #1054
bugfix: temporally disable split-kv in blackwell mla by @yzh119 in #1055
bugfix: remove device allocation by @yzh119 in #1056
Parameterize prefix mask call (needed by POD-Attention) by @AKKamath in #1059
bugfix: move cum_m calculation inside kernels by @yzh119 in #1060
misc: add pull request template by @yzh119 in #1062
bugfix: Cast build paths to str before setuputils Extension by @farnasirim in #1058
Add PyTorch 2.7.0 build by @huydhn in #1063
bugfix: adding lse output to blackwell fmha kernels by @yzh119 in #1071
bugfix: follow user-specified sm_scale for blackwell cutlass fmha by @yzh119 in #1072
misc: jit: Introduce JitSpec and Generate ninja file by @abcdabcd987 in #1065
fix: fix a typo in docs by @acelyc111 in #1077
misc: jit: Deprecate load_cuda_ops() by @abcdabcd987 in #1066
misc: jit: fix missing _get_glibcxx_abi_build_flags by @abcdabcd987 in #1080
misc: jit: Refactor gen JitSpec out of get_xxx_module by @abcdabcd987 in #1069
misc: jit: Replace parallel_load_modules() with build_jit_specs() by @abcdabcd987 in #1070
misc: jit: Import jit_env as a module by @abcdabcd987 in #1073
misc: aot: Add script to build all AOT ops by @abcdabcd987 in #1067
misc: aot: Refactor AOT packaging by @abcdabcd987 in #1075
misc: aot: Remove has_prebuilt_ops by @abcdabcd987 in #1076
ci: upgrade docker ci image by @yzh119 in #1082
bugfix: fix custom allreduce compilation in AOT mode by @yzh119 in #1083
perf: accelerate blackwell grouped gemm by @yzh119 in #1086
misc: update pull request template by @yzh119 in #1088
Fix Cutlass grouped GEMM stride by @cyx-6 in #1081
bugfix: fix fp8 attention kernels aot compilation issue by @yzh119 in #1087
comm: refactor and initialize flashinfer.comm module by @yzh119 in #1089
misc: cleanup by @b8zhong in #1092
misc: followup by @b8zhong in #1093
[nvidia] Add Blackwell FMHA decode kernel from TRT-LLM by @joker-eph in #1051
bugfix: fix ninja generation rule for non-cuda input by @yzh119 in #1097
jit: Update TVM JIT binding with the latest FFI refactor by @MasterJH5574 in #1100
SM100 Groupwise GeMM K-Major Scale Supports by @cyx-6 in #1102
misc: aot: Add platform tag to wheel by @abcdabcd987 in #1105
feat: composable logits processor by @xslingcn in #1099
feat: add trtllm all-reduce (non-MoE) by @yyihuang in #1096
bugfix: host-precomuted plan function for blackwell fmha by @yzh119 in #1106
doc: fix LogitsPipe example by @xslingcn in #1110
bugfix: bugfix for blackwell mla split-k by @yzh119 in #1109
Add CUTLASS fused moe kernels from TensorRT-LLM. by @wenscarl in #1113
fix: initialize lamport buffer only once after creating new workspace by @yyihuang in #1111
hotfix: fix the blackwell fmha stream by @yzh119 in #1116
fix head_dim not defined if sm_scale is not None by @majian4work in #1119
doc: add Ask-AI widget by @xslingcn in #1121
bugfix: Fix test and output shape of fp4 quantize by @wenscarl in #1114
misc: update slack link by @yzh119 in #1120
release: bump version to v0.2.6 by @yzh119 in #1122

New Contributors

@yongchaoding made their first contribution in #1008
@Edenzzzz made their first contribution in #1009
@dhy2000 made their first contribution in #1013
@kaixih made their first contribution in #1031
@yongwww made their first contribution in #1036
@rickyfeng0119 made their first contribution in #1042
@kf-zhang made their first contribution in #1035
@arde171 made their first contribution in #1015
@farnasirim made their first contribution in #1058
@huydhn made their first contribution in #1063
@acelyc111 made their first contribution in #1077
@b8zhong made their first contribution in #1092
@joker-eph made their first contribution in #1051
@wenscarl made their first contribution in #1113
@majian4work made their first contribution in #1119

Full Changelog: v0.2.5...v0.2.6