What's Changed
- feat: Add backend='auto' to mm_fp4 and enable autotune for backend='cudnn' by @bkryu in #1979
- fix: Fix bench_mm_fp8.py by @bkryu in #2129
- feat: Enable API Logging for Better Debugging POC by @bkryu in #2108
- fix: add a check for int32 indices in sampling.py by @raayandhar in #2127
- update autotuner input tensor random range by @jiahanc in #2116
- enable xqa speculative decoding by @qsang-nv in #2105
- Add custom communicator for trtllm_mnnvl_ar by @wenscarl in #2056
- fix: DeepSeek activation uninitialized data by @nekorobov in #2128
- chore: Update CODEOWNERS by @flashinfer-bot in #2135
- bugfix: fix unittest error introduced in #2056 by @yzh119 in #2136
- fix flaky xqa test by @qsang-nv in #2126
- fix: some bugs of headDim 256 trtllm-gen fmha kernels. by @PerkzZheng in #2137
- fix(trtllm): reset negative strideBatch to 0 for ragged KV layout to … by @YAMY1234 in #2134
- feat: add trtllm-gen per-tensor sparseMla kernels. by @PerkzZheng in #2138
- Use global TuningConfig, to fix memory leak caused by AutoTuner LRU cache and dynamic lambda TuningConfig by @juju812 in #2140
- feat: add seed offset args to sampler to allow cuda graph support by @ksukrit in #2132
- ci: Reduce test time by moving compilation off-line by @kahyunnam in #2089
- feat: TRTLLM FMHAv2 backend for ctx attention by @jimmyzho in #2142
- refactor: pass hopper deepgemm include directory through python by @yzh119 in #2090
- bugfix: add driver support to CUPTI benchmark function, issue #2145 by @nv-yunzheq in #2154
- Bump tvm ffi version to 0.1.4 by @cyx-6 in #2155
- Update Docker CI tags to 20251202-23ff744 by @flashinfer-bot in #2158
- misc: Label APIs for Logging by @bkryu in #2153
- Update nvidia-cutlass-dsl version to 4.3.1 by @aleozlx in #2161
- chore: Update CODEOWNERS by @flashinfer-bot in #2152
- feat: C++ side tensor validation by @raayandhar in #2160
- Update Docker CI tags to 20251203-4efb7bb by @flashinfer-bot in #2164
- ci: Install CUDA version specified torch first during container building. by @bkryu in #2167
- fix xqa mha_sm90.cu by @qsang-nv in #2157
- Update Docker CI tags to 20251203-1e15fed by @flashinfer-bot in #2172
- enable sm103 moe dsl backend by @aleozlx in #2149
- ci: Use stable Torch Release for cu130 by @bkryu in #2174
- tiny upd
mm_fp4docstring by @b8zhong in #2177 - fix: compile flags for trtllm fmha_v2 by @jimmyzho in #2175
- Fix/dsl smem query by @aleozlx in #2178
- Update Docker CI tags to 20251204-cdc5fb7 by @flashinfer-bot in #2176
- feat: MxInt4 x Bf16 TRT-LLM Gen MoE support by @nekorobov in #2159
- refactor: Move mla code from decode.py to mla.py and add to documentation by @bkryu in #2163
- Fix gemm allreduce two shot by @aleozlx in #2171
- Update Docker CI tags to 20251205-54c1678 by @flashinfer-bot in #2179
- Rename noauxtc to fused_topk_deepseek by @nv-yunzheq in #2181
- refactor: update fa3 codebase and fix hopper unittest [part 1] by @yzh119 in #2111
- Add data type check for deepseek fp4 moe by @samuellees in #2165
- benchmark: Make use_cupti the default in microbenchmarks. by @bkryu in #2180
- ci: Specify MPI implementation to mpich by @bkryu in #2182
- Update Docker CI tags to 20251206-185d63a by @flashinfer-bot in #2184
- test: Skip sm90 test in test_jit_warmup.py if not on sm90 by @bkryu in #2189
- ci: Update sm12X minimum cuda capability to 12.9 in aot.py by @bkryu in #2188
- Super tiny fix version by @fzyzcjy in #2199
- docs: Document CUDA version support in README and installation page by @bkryu in #2197
- docs: Fix inaccurate API docstrings for attention prefill by @bkryu in #2196
- feat: unit-test and api change, w4a8 grouped-gemm fused MoE for SM90 by @jimmyzho in #2193
- Permute page table in benchmarking by @jhjpark in #2194
- Fix for moe on sm110 by @jhalabi-nv in #2190
- chore: update authorized codeowners by @jimmyzho in #2210
- perf: bunch of features and optimizations for top-k (sampling + sparse attention) by @yzh119 in #2119
- Refactor trtllm_mnnvl_allreduce by @timlee0212 in #2118
- chore: Update CODEOWNERS by @flashinfer-bot in #2186
- feat: support more head dim in RoPE kernel by @raayandhar in #2109
- Port TRT-LLM communication kernels to flashinfer by @djns99 in #2102
- cicd: Add sanity test script by @kahyunnam in #2212
- feat: add memcpy and memset to CUPTI timing method by @nv-yunzheq in #2223
- Added an initial implementation of Q and KV Cache in fp8 and to use t… by @Anerudhan in #2035
- feat: Support unpadded output hidden size for trtllm_fp4_block_scale_moe by @elvischenv in #2217
- fix: Eliminate the usage of CUDA ARCH macro in host function. by @timlee0212 in #2228
- misc: support checks for gemm by @jimmyzho in #2214
- feat: Cold L2 Cache Benchmarking with Rotating Buffers by @bkryu in #2213
- Move the run function definition out of BatchedGemmInterface by @jhalabi-nv in #2211
- make DeepGEMM swapAB available for linear gemm SM90 by @katec846 in #2131
- misc: upgrade tvm-ffi dependency to 0.1.6 by @yzh119 in #2229
- A unified API for the MNNVL and single-node/multi-GPU AllReduce kernels. by @nvmbreughe in #2130
- Update Docker CI tags to 20251217-f059241 by @flashinfer-bot in #2231
- Rebase FP8 SM100 Cutlass FMHA Attention to main (original PR#1238) by @pavanimajety in #2047
- [feat] Integrate SGLang concat_mla_k kernel into flashinfer by @jiahanc in #2237
- fix: add DeepSeek routing for Bf16xBf16 and MxIntxBf16 TRT-LLM Gen MoE by @nekorobov in #2234
- fix: Fix compilation with GCC 11 by @dbari in #2242
- feat: RMSNorm/Fused RMSNorm + FP8 Quantization kernels by @BLaZeKiLL in #2243
- feat: further optimize top-k and add fused top-k page construction kernels for DSA by @yzh119 in #2215
- test: Fix MNNVL tests to skip when container lacks SYS_PTRACE capability by @bkryu in #2245
- Remove cudaStreamSynchronize from gemm_groupwise_sm120.cuh for CUDA graph compatibility by @Copilot in #2244
- feat: support variable sequence length in decode kernel of trtllm-gen attention by @yaoyaoding in #2125
- feat: Fused RMSNorm + FP4 Quantization Kernels in CuTe-DSL by @bkryu in #2233
- Allreduce auto backend improvements by @nvmbreughe in #2239
- cicd / testing: Add xfails tracker script by @kahyunnam in #2227
- chore: export compile commands for better IDE integration by @yzh119 in #2253
- feat: support non-contiguous query for trtllm-gen attention backend by @yzh119 in #2254
- Fp8 attention are now part of cuDNN 9.17.1 by @Anerudhan in #2241
- feat: Support numLocalTokens=0 for moe All-to-all by @trevor-m in #2247
- feat: support inplace update output for get_batch_indices_positions by @elvischenv in #2257
- Fix CUTLASS FP8 gemm correctness issue on SM120/SM121 for shapes where N is not divisible by ScaleGranularityN. by @yongwww in #2261
- fix: support int64 IdType for RoPE part argument in
rope_quantize_fp8_append_paged_kv_cacheby @elvischenv in #2255 - [Minor] Reduce num blocks of qknorm in small batch size by @DarkSharpness in #2264
- test: use .float() in in F.cosine_similarity() in bmm_fp8 test by @yongwww in #2266
- feat: Add support for bmm mxfp8 by @danisereb in #2256
- [performance]optimize for nvfp4 by @Bruce-x-1997 in #2268
- chore: Update CODEOWNERS by @flashinfer-bot in #2218
- agent: add CLAUDE.md and claude skills by @yzh119 in #2240
- bugfix: fix claude skills by @yzh119 in #2275
- fix: Add global scale support and optional output allocation for RMSNorm+FP4Quant fusion kernels by @bkryu in #2260
- cicd: add a github workflow for xfails report script by @kahyunnam in #2273
- feat: IdType indices in sampling kernels by @raayandhar in #2281
- feat: add GDN Attention by @guangyunh-nv in #2276
- chore: update documentation and notice year to 2026 by @yzh119 in #2285
- Tiny fix bench tgv gemm by @vincentzed in #2277
- dependency: update nvidia-cutlass-dsl by @yzh119 in #2288
- Enable Hopper FA3 FP8 attention in decode.py by @nvpohanh in #2148
- Update Docker CI tags to 20260105-a97b5d7 by @flashinfer-bot in #2289
- [WIP] Refactor: simplify torch -> cute-dsl boilerplate and enable tvm-ffi for cute-dsl kernels by @yzh119 in #2279
- fix: Decode benchmark's fa2_tc uses backend=fa2 in wrapper by @bkryu in #2302
- bugfix: use torch cached default generators by @cyx-6 in #2295
- [TRTLLM-Gen Fmha] add optimized trtllm-gen decode kernels for high throughput + speculative decoding by @PerkzZheng in #2265
- update version to 0.6.0 by @nv-yunzheq in #2300
New Contributors
- @YAMY1234 made their first contribution in #2134
- @juju812 made their first contribution in #2140
- @ksukrit made their first contribution in #2132
- @samuellees made their first contribution in #2165
- @jhjpark made their first contribution in #2194
- @jhalabi-nv made their first contribution in #2190
- @djns99 made their first contribution in #2102
- @katec846 made their first contribution in #2131
- @dbari made their first contribution in #2242
- @BLaZeKiLL made their first contribution in #2243
- @Copilot made their first contribution in #2244
- @yaoyaoding made their first contribution in #2125
- @DarkSharpness made their first contribution in #2264
- @danisereb made their first contribution in #2256
- @Bruce-x-1997 made their first contribution in #2268
- @guangyunh-nv made their first contribution in #2276
- @vincentzed made their first contribution in #2277
Full Changelog: v0.5.3...v0.6.0