What's Changed
- ci: select 2_28 manylinux builder for new torch+cuda versions by @yzh119 in #1000
- misc: update REAMDME.md by @yzh119 in #1003
- bugfix: Fix illegal memory access due to custom mask ptr by @yongchaoding in #1008
- misc: fix kv-layout doc references by @Edenzzzz in #1009
- misc: more benchmark scripts in Python by @yzh119 in #1010
- misc: fix instrument code for mla profiler by @yzh119 in #1014
- bugfix: import wrapper of mla decode by @dhy2000 in #1013
- feat: update decode attention APIs by @yzh119 in #1007
- doc: use latest protobuf for profiler by @xslingcn in #1021
- feat: SM-constraint Communication Kernels by @yyihuang in #994
- feat: ragged tensor padding kernel for blackwell kernel alignment by @yzh119 in #1025
- bugfix: fix custom mask not be reseted after convert custom mask into causal or non-causal by @yongchaoding in #1028
- fix: add zero init for KV tiled copy by @happierpig in #1029
- [NVIDIA] Add Cutlass MLA backend by @kaixih in #1031
- Add workflow to build aarch64 wheel by @yongwww in #1036
- Non-blocking host-to-device copy in the ragged prefill wrapper by @nandor in #1040
- fix: remove default ubuntu user in Lunar/Noble by @rickyfeng0119 in #1042
- feat: Softmax free sampling by @kf-zhang in #1035
- feat: add functional per-head FP8 quantization for FA3 by @happierpig in #1033
- add multi-item scoring by @arde171 in #1015
- [nvidia] cutlass fp8 blockwise/groupwise gemm support by @cyx-6 in #1045
- [nvidia] cutlass fp8 groupwise grouped gemm support by @cyx-6 in #1047
- fix: top_k_mask_logits hangs on -inf inputs by @xslingcn in #1050
- Benchmark: POD vs batched prefill by @Edenzzzz in #1052
- [nvidia] initial support for blackwell kernels by @yzh119 in #1039
- Fix KV chunking for POD. by @AKKamath in #1054
- bugfix: temporally disable split-kv in blackwell mla by @yzh119 in #1055
- bugfix: remove device allocation by @yzh119 in #1056
- Parameterize prefix mask call (needed by POD-Attention) by @AKKamath in #1059
- bugfix: move
cum_mcalculation inside kernels by @yzh119 in #1060 - misc: add pull request template by @yzh119 in #1062
- bugfix: Cast build paths to str before setuputils Extension by @farnasirim in #1058
- Add PyTorch 2.7.0 build by @huydhn in #1063
- bugfix: adding lse output to blackwell fmha kernels by @yzh119 in #1071
- bugfix: follow user-specified sm_scale for blackwell cutlass fmha by @yzh119 in #1072
- misc: jit: Introduce JitSpec and Generate ninja file by @abcdabcd987 in #1065
- fix: fix a typo in docs by @acelyc111 in #1077
- misc: jit: Deprecate
load_cuda_ops()by @abcdabcd987 in #1066 - misc: jit: fix missing _get_glibcxx_abi_build_flags by @abcdabcd987 in #1080
- misc: jit: Refactor gen JitSpec out of get_xxx_module by @abcdabcd987 in #1069
- misc: jit: Replace parallel_load_modules() with build_jit_specs() by @abcdabcd987 in #1070
- misc: jit: Import jit_env as a module by @abcdabcd987 in #1073
- misc: aot: Add script to build all AOT ops by @abcdabcd987 in #1067
- misc: aot: Refactor AOT packaging by @abcdabcd987 in #1075
- misc: aot: Remove has_prebuilt_ops by @abcdabcd987 in #1076
- ci: upgrade docker ci image by @yzh119 in #1082
- bugfix: fix custom allreduce compilation in AOT mode by @yzh119 in #1083
- perf: accelerate blackwell grouped gemm by @yzh119 in #1086
- misc: update pull request template by @yzh119 in #1088
- Fix Cutlass grouped GEMM stride by @cyx-6 in #1081
- bugfix: fix fp8 attention kernels aot compilation issue by @yzh119 in #1087
- comm: refactor and initialize
flashinfer.commmodule by @yzh119 in #1089 - misc: cleanup by @b8zhong in #1092
- misc: followup by @b8zhong in #1093
- [nvidia] Add Blackwell FMHA decode kernel from TRT-LLM by @joker-eph in #1051
- bugfix: fix ninja generation rule for non-cuda input by @yzh119 in #1097
- jit: Update TVM JIT binding with the latest FFI refactor by @MasterJH5574 in #1100
- SM100 Groupwise GeMM K-Major Scale Supports by @cyx-6 in #1102
- misc: aot: Add platform tag to wheel by @abcdabcd987 in #1105
- feat: composable logits processor by @xslingcn in #1099
- feat: add trtllm all-reduce (non-MoE) by @yyihuang in #1096
- bugfix: host-precomuted plan function for blackwell fmha by @yzh119 in #1106
- doc: fix LogitsPipe example by @xslingcn in #1110
- bugfix: bugfix for blackwell mla split-k by @yzh119 in #1109
- Add CUTLASS fused moe kernels from TensorRT-LLM. by @wenscarl in #1113
- fix: initialize lamport buffer only once after creating new workspace by @yyihuang in #1111
- hotfix: fix the blackwell fmha stream by @yzh119 in #1116
- fix head_dim not defined if sm_scale is not None by @majian4work in #1119
- doc: add Ask-AI widget by @xslingcn in #1121
- bugfix: Fix test and output shape of fp4 quantize by @wenscarl in #1114
- misc: update slack link by @yzh119 in #1120
- release: bump version to v0.2.6 by @yzh119 in #1122
New Contributors
- @yongchaoding made their first contribution in #1008
- @Edenzzzz made their first contribution in #1009
- @dhy2000 made their first contribution in #1013
- @kaixih made their first contribution in #1031
- @yongwww made their first contribution in #1036
- @rickyfeng0119 made their first contribution in #1042
- @kf-zhang made their first contribution in #1035
- @arde171 made their first contribution in #1015
- @farnasirim made their first contribution in #1058
- @huydhn made their first contribution in #1063
- @acelyc111 made their first contribution in #1077
- @b8zhong made their first contribution in #1092
- @joker-eph made their first contribution in #1051
- @wenscarl made their first contribution in #1113
- @majian4work made their first contribution in #1119
Full Changelog: v0.2.5...v0.2.6