What's Changed
- [None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
- [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
- [None][test] correct test-db context for perf yaml file by @ruodil in #6686
- [None] [feat] Add model gpt-oss by @hlu1 in #6645
- [https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
- [None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
- [TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
- [TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
- [TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
- [None][infra] Fix guardwords by @EmmaQiaoCh in #6711
- [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
- [None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
- [None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
- [None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
- [None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
- [https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
- [None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
- [https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
- [None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
- [TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
- [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
- [TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
- [TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
- [None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
- [TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
- [TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
- [None][test] fix yml condition error under qa folder by @ruodil in #6734
- [None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
- [TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
- [https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
- [None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
- [https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
- [None][fix] Remove lock related typo in py_executor by @lancelly in #6653
- [None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
- [None][fix]revert kvcache transfer by @chuangz0 in #6709
- [TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
- [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
- [None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
- [TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify
examplesmapping by @venkywonka in #6762 - [None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
- [None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
- [TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
- [None][feat] Core Metrics Implementation by @hcyezhang in #5785
- [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
- [TRTLLM-6637][feat] Resolve KV cache divergence issue by @ziyixiong-nv in #6628
- [None][infra] Waive test main 0808 by @EmmaQiaoCh in #6751
- [#5048][enhance] AutoDeploy: Optimize prepare_inputs by @galagam in #6634
- [None][chore] Dead code elimination, we no longer record/fetch through WindowBlockManager:: mContextBlocksByHash by @eopXD in #6249
- [TRTLLM-6174][feat] Enable FP32 mamba ssm cache by @shaharmor98 in #6574
- [https://nvbugs/5444937][fix] Fixing kv_cache_event unit test by @pcastonguay in #6753
- [TRTLLM-6823][doc] Add checkpoint refactor docs by @shaharmor98 in #6592
- [None][feat] Support SharedTensor on MultimodalParams by @yechank-nvidia in #6254
- [None][feat] improve dataloading for benchmark_dataset by using batch… by @zerollzeng in #6548
- [https://nvbugs/5431127][fix] Run test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8] only on hopper by @bo-nv in #6736
- [None][fix] fix same pp disagg by @chuangz0 in #6730
- [None][feat] Add gpt-oss GSM8K test. by @Tracin in #6732
- [None][test] Test trtllm-bench AD vs, PT BEs on H100 single gpu by @MrGeva in #6487
- [TRTLLM-5633][infra] Force set changed file diff to empty string for post-merge CI by @yiqingy0 in #6777
- [None][chore] remove closed bugs by @xinhe-nv in #6772
- [None][infra] Waive failed tests on main 0811 by @EmmaQiaoCh in #6778
- fix: Ensure that Python stub generation works against libnvidia-ml stubs by @MartinMarciniszyn in #6188
- [TRTLLM-5532][feat] store the block of context request into kv cache by @byshiue in #6683
- [None][doc] Add K2 tool calling examples by @lancelly in #6667
- [None][infra] Unwaive an updated case to test by @EmmaQiaoCh in #6791
- [None][chore] always try-catch when clear build folder in build_wheel.py by @zhenhuaw-me in #6748
- [TRTLLM-6812][feat] Add standardized GitHub issue templates and disable blank issues by @venkywonka in #6494
- [None][fix] Refactoring to avoid circular import when importing torch models by @rakib-hasan in #6720
- [None][chore] Find LLM_ROOT and LLM_BACKEND_ROOT dynamically by @achartier in #6763
- [https://nvbugs/5385987][fix] Fix Qwen2 quantization issue by pinning transformers version by @chang-l in #6673
- [None][perf] Improve the performance of online EPLB on Hopper by better overlapping by @jinyangyuan-nvidia in #6624
- [https://nvbugs/5441438][fix] Set correct draft length for the cuda graph dummy request by @ziyixiong-nv in #6701
- [TRTLLM-6854][feat] Enable guided decoding with CUDA graph padding and draft model chunked prefill by @syuoni in #6774
- [#4403][autodeploy] Refactor: Move more transformations to new inf optimizer, Add quantization_source to factory interface by @Fridah-nv in #6760
- [None][feat] CUTLASS MoE FC2+Finalize fusion by @sklevtsov-nvidia in #3294
- [TRTLLM-6906][chore] Using pybind to bind functions in thop/attentionOp by @lancelly in #6745
- [None][fix] Fix attention dp log by @Shunkangz in #6570
- [None][fix] fix ci by @QiJune in #6814
- [TRTQA-2920][chore] improve hang tests by @xinhe-nv in #6781
- [https://nvbugs/5438869][fix] Set nvfp4 expert w1 w3 weight scale to the same value if they're not by @jhaotingc in #6656
- [None][feat] Add GPT OSS support for AutoDeploy by @nvchenghaoz in #6641
- [#6187][feat] add LayerNorm module by @Funatiq in #6625
- [None][refactor] Simplify decoder state initialization by @Funatiq in #6559
- [TRTLLM-7008][fix] fix wideEP weights loading and args by @dongxuy04 in #6789
- [None][fix] Refactoring input prep to allow out-of-tree models by @rakib-hasan in #6497
- feat: Support custom repo_dir for SLURM script by @kaiyux in #6546
- [None][fix] Pre-allocate workspaces for DeepGEMM MoE to avoid frequent cudaFree/cudaMalloc by @lfr-0531 in #6811
- [TRTLLM-6772][feat] Multimodal benchmark_serving support by @yechank-nvidia in #6622
- [https://nvbugs/5452167][fix] Fix ngram padding issue by @mikeiovine in #6837
- [#6530][fix] Fix script when using calibration tensors from modelopt by @achartier in #6803
- [https://nvbugs/5412456][fix] Fix an illegal instruction was encountered by @zhou-yuxin in #6776
- [None][feat] DeepEP LL combine FP4 by @yilin-void in #6822
- [TRTLLM-4501][feat] AutoTuner tuning config refactor and valid tactic generalization. by @hyukn in #6545
- [TRTLLM-7030][fix] Refactor the example doc of dist-serving by @Shixiaowei02 in #6766
- [TRTLLM-7093][fix] the perf regression to cvt_fp4 kernels by @PerkzZheng in #6851
- [https://nvbugs/5412885][doc] Add the workaround doc for H200 OOM by @zhenhuaw-me in #6853
- [https://nvbugs/5378031] [feat] Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend by @rosenrodt in #6200
- [None][infra] Waive failed cases on main by @EmmaQiaoCh in #6863
- [None][feat] Support running heterogeneous model execution for Nemotron-H by @danielafrimi in #6866
- [https://nvbugs/5302040][feat] Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) by @wu6u3tw in #5527
- [https://nvbugs/5394685][fix] the bug with spec-decoding + SWA && an accuracy issue related to 2CTA MLA by @PerkzZheng in #6834
- [https://nvbugs/5410399][chore] Unwaive mtp llmapi test by @mikeiovine in #6833
- [None][fix] max_num_sequences argument in nanobind by @Linda-Stadter in #6862
- [None][feat] Add test for speculative rejection sampler (2-model) by @IzzyPutterman in #6542
- [None][chore] fix markdown format for the deployment guide by @zhenhuaw-me in #6879
- [None][feat] Add support for Hopper MLA chunked prefill by @jmydurant in #6655
- [TRTLLM-6675][infra] Cherry-pick #6623 by @bo-nv in #6735
- [https://nvbugs/5427043][fix] request length exceeds max_num_tokens by @Superjomn in #6821
- [None][fix] Add FP4 all2all unitest and fix a bug for module WideEPMoE by @StudyingShao in #6784
- [None][doc] update moe support matrix for DS R1 by @litaotju in #6883
- [None][test] Add perf-sweep scripts by @chenfeiz0326 in #6738
- [TRTLLM-7030][fix] BREAKING CHANGE: Mismatch between docs and actual commands by @Shixiaowei02 in #6323
- [https://nvbugs/5445466][fix] fix deepseek r1 hang by not enabling mnnvl by default by @pengbowang-nv in #6860
- [TRTLLM-6853][feat] refactor deepseekv3 model by @kris1025 in #6698
- [None][fix] Fix python-only build that uses TRTLLM_USE_PRECOMPILED by @jiaganc in #6825
- [None][infra] Waive failed cases on main 08/14 by @EmmaQiaoCh in #6902
- [TRTLLM-5966][feat] Helix: extend mapping to support different CP types by @MatthiasKohl in #6816
- [https://nvbugs/5450262][fix] Fix unsupported alltoall use case by @bobboli in #6882
- [https://nvbugs/5455651][fix] Make ngram use XQA attention on Blackwell by @mikeiovine in #6873
- [https://nvbugs/5441714][chore] remove skip on disagg n-gram test by @raayandhar in #6872
- [None] [feat] Add Tencent HunYuanMoEV1 model support by @qianbiaoxiang in #5521
- [None][chore] Add tests for non-existent and completed request cancellation by @achartier in #6840
- [None][doc] Update gpt-oss doc on MoE support matrix by @hlu1 in #6908
- [https://nvbugs/5394685][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue by @PerkzZheng in #6896
- [https://nvbugs/5437106][fix] Add L4 Scout benchmarking WAR option in deploy guide by @JunyiXu-nv in #6829
- [None][fix] Fix the issue of responsibility boundary between the assert and tllmException files by @Fan-Yunfan in #6723
- [None][fix] Correct reporting of torch_dtype for ModelConfig class. by @FrankD412 in #6800
- [None][fix] Fix perfect router. by @bobboli in #6797
- [https://nvbugs/5415862][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 by @Wanli-Jiang in #6501
- [None][fix] Update tests to use standardized uppercase backend identifiers by @bo-nv in #6921
- [TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures by @chzblych in #6836
- [None][doc] Modify the description for mla chunked context by @jmydurant in #6929
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #6914
- [None][chore] add a EditorConfig config by @zhenhuaw-me in #6897
- [https://nvbugs/5451373][fix] : Fix the accuracy issue when using FP8 context MLA by @peaceh-nv in #6881
- [https://nvbugs/5405041][fix] Update wide-ep doc by @qiaoxj07 in #6933
- [None] [chore] Mamba cache in separate file by @tomeras91 in #6796
- [https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in #6858
- [https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels by @PerkzZheng in #6941
- [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in #6537
- [None][test] Add accuracy evaluation for AutoDeploy by @ajrasane in #6764
- [None][fix] Make TP working for Triton MOE (in additional to EP we are using) by @dongfengy in #6722
- [TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #6629
- [https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in #6952
- [None][chore] Bump version to 1.1.0rc1 by @yiqingy0 in #6953
- [TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options by @dcampora in #6831
- [None][fix] Skip Topk if 0 by @IzzyPutterman in #6934
- [None][fix] Fix: Using RAII to automatically manage the allocation and release of va_list for potential resource leak by @Fan-Yunfan in #6758
- [None][feat] Support Yarn on Qwen3 by @byshiue in #6785
- [None][feat] Add single block version renormalized routing kernel by @ChristinaZ in #6756
- [None][infra] Waive failed cases in main branch by @EmmaQiaoCh in #6951
- [https://nvbugs/5390853][fix] Fix _test_openai_lora.py - disable cuda graph by @amitz-nv in #6965
- [https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters to prevent OOMs by @Naveassaf in #6970
- [None][infra] update feature_combination_matrix of disaggregated and Eagle3 by @leslie-fang25 in #6945
- [None][doc] Update gpt oss doc by @bobboli in #6954
- [None] [feat] Support accurate device iter time by @kaiyux in #6906
- [TRTLLM-7030][fix] uppercase def value in pd-config by @Shixiaowei02 in #6981
- [None] [fix] Fix the macro name by @ChristinaZ in #6983
- [None][infra] Waive failed tests on main 0818 by @EmmaQiaoCh in #6992
- [None][chore] Remove duplicate test waives by @yiqingy0 in #6998
- [None][fix] Clean up linking to CUDA stub libraries in build_wheel.py by @MartinMarciniszyn in #6823
- [None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) by @chzblych in #7005
- [TRTLLM-7158][feat] Introduce sampler options in trtllm bench by @dcampora in #6855
- [None][infra] Enable accuracy test for mtp and chunked prefill by @leslie-fang25 in #6314
- [None][autodeploy] Doc: fix link path in trtllm bench doc by @Fridah-nv in #7007
- [https://nvbugs/5371480][fix] Enable test_phi3_small_8k by @Wanli-Jiang in #6938
- [TRTLLM-7014][chore] Add accuracy test for ctx and gen workers with different models by @reasonsolo in #6741
- [None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic by @yizhang-nv in #6615
- [None] [infra] stricter coderabbit pr title generation instructions by @venkywonka in #6918
- [TRTLLM-6960][fix] enable scaled_mm tests by @dc3671 in #6936
- [TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell by @lfr-0531 in #6710
- [TRTLLM-6541][test] Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] by @fredricz-20070104 in #6939
- [https://nvbugs/5454875][ci] Unwaive Mistral Small 3.1 test by @2ez4bz in #7011
- [TRTLLM-6541][test] Add NIM Related Cases Part 1 by @crazydemo in #6684
- [https://nvbugs/5458798][fix] Relaxed test threshold, added documentation by @MrGeva in #6997
- [None][opt] Add batch wait timeout in fetching requests by @Shunkangz in #6923
- [None][chore] Remove closed bugs by @xinhe-nv in #6969
- [None][fix] acceptance rate calculation fix in benchmark_serving by @zerollzeng in #6746
- [None] [doc] Add more documents for large scale EP by @kaiyux in #7029
- [None] [chore] Update wide-ep genonly scripts by @qiaoxj07 in #6995
- [TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in #6968
- [https://nvbugs/5458874][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6996
- [https://nvbugs/5455140][fix] unwaive DSR1-fp4 throughput_tp8 by @lfr-0531 in #7022
- [None][chore] Remove duplicate test waives by @yiqingy0 in #7044
- [None][infra] Waive failed tests on main 08/19 by @EmmaQiaoCh in #7037
- [None][feat] Use Separate QKV Input Layout for Context MLA by @zhhuang-nv in #6538
- [https://nvbugs/5444937][chore] Fixing KV events tests by @pcastonguay in #7004
- [https://nvbugs/5451296][bug] Cherry-pick #7017 from release/1.0 branch by @chzblych in #7043
- [None][fix] Accommodate Phi3/4 to work with ModelOpt's FP8 ckpts in Torch by @moraxu in #6761
- [None][fix] Fix assertion errors of quantization when using online EPLB by @jinyangyuan-nvidia in #6922
- [None][autodeploy] Add group attention pattern that supports attention masks by @Fridah-nv in #7054
- [None][chore] unwaive test_disaggregated_genbs1 by @bo-nv in #6944
- [None][fix] fix llmapi import error by @crazydemo in #7030
- [TRTLLM-7326][feat] Add standalone multimodal encoder by @chang-l in #6743
- [None][infra] update feature_combination_matrix of disaggregated and chunked prefill by @leslie-fang25 in #6661
- [TRTLLM-7205][feat] add llama4 tp4 tests by @xinhe-nv in #6989
- [None][infra] "[TRTLLM-6960][fix] enable scaled_mm tests (#6936)" by @Tabrizian in #7059
- [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse by @eopXD in #6767
- [None][fix] fix scaffolding dynasor test by @dc3671 in #7070
- [None][chore] Update namelist in blossom-ci by @karljang in #7015
- [None][ci] move unittests to sub-directories by @Funatiq in #6635
- [None][infra] Waive failed tests on main branch 8/20 by @EmmaQiaoCh in #7092
- [None][fix] Fix W4A8 MoE kernel issue by @yuhyao in #7072
- [TRTLLM-7348] [feat] Enable Cross-Attention to use XQA kernels for Whisper by @DomBrown in #7035
- [None][chore] Only check the bindings lib for current build by @liji-nv in #7026
- [None][ci] move some tests of b200 to post merge by @QiJune in #7093
- [https://nvbugs/5457489][fix] unwaive some tests by @byshiue in #6991
- [TRTLLM-6771][feat] Support MMMU for multimodal models by @yechank-nvidia in #6828
- [None][fix] Fix llama4 multimodal by skipping request validation by @chang-l in #6957
- [None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 by @BatshevaBlack in #7024
- [None][fix] update accelerate dependency to 1.7+ for AutoDeploy by @Fridah-nv in #7077
- [None][fix] Fix const modifier inconsistency in log function declaration/implementation by @Fan-Yunfan in #6679
- [None][chore] waive failed cases on H100 by @xinhe-nv in #7084
- [None][fix] Use safeInitRowMax instead of fp32_lowest to avoid NaN by @lowsfer in #7087
- [https://nvbugs/5443039][fix] Fix AutoDeploy pattern matcher for torch 2.8 by @Fridah-nv in #7076
- [https://nvbugs/5437405][fix] qwen3 235b eagle3 ci by @byshiue in #7000
- [None][doc] Update gpt-oss deployment guide to latest release image by @farshadghodsian in #7101
- [https://nvbugs/5392414] [fix] Add customized default routing method by @ChristinaZ in #6818
- [https://nvbugs/5453827][fix] Fix RPATH of th_common shared library to find pip-installed NCCL by @tongyuantongyu in #6984
- [None][chore] No-op changes to support context parallelism in disaggregated serving later by @brb-nv in #7063
- [https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend by @dbari in #6714
- [None][infra] Waive failed case for main branch 08/21 by @EmmaQiaoCh in #7129
- [#4403][refactor] Move fusion, kvcache, and compile to modular inference optimizer by @Fridah-nv in #7057
- [None][perf] Make finalize fusion part of the tactic selection logic by @djns99 in #6915
- [None][chore] Mass integration of release/1.0 by @dominicshanshan in #6864
- [None][docs] update stale link for AutoDeploy by @suyoggupta in #7135
- [TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in #6817
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7109
- [None][fix] Fix mm_placholder_counts extraction issue. by @hyukn in #7118
- [TRTLLM-7155][feat] Unify sampler handle logits implementation. by @dcampora in #6867
- [TRTLLM-5801][infra] Add more RTX Pro 6000 test stages by @EmmaQiaoCh in #5126
- [None][feat] Enable nanobind as the default binding library by @Linda-Stadter in #6608
- [TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site by @dongfengy in #7143
- [TRTLLM-7245][feat] add test_multi_nodes_eval tests by @xinhe-nv in #7108
- [None][ci] move all B200 TensorRT test cases to post merge by @QiJune in #7165
- [None][chore] Bump version to 1.1.0rc2 by @yiqingy0 in #7167
- [#7136][feat] trtllm-serve + autodeploy integration by @suyoggupta in #7141
- [TRTLLM-4921][feat] Enable chunked prefill for Nemotron-H by @tomeras91 in #6334
- [None][refactor] Simplify decoder state initialization for speculative decoding by @Funatiq in #6869
- [None][feat] Deepseek: Start Eagle work by @IzzyPutterman in #6210
- [None][fix] Correct KV cache percentage report out. by @FrankD412 in #7102
- [None] [feat] nsys profile output kernel classifier by @gracehonv in #7020
- [None][fix] Waive test by @Tabrizian in #7185
- [https://nvbugs/5467232][fix] Fix load_torch_hf_lora to override lora_config.trtllm_modules_to_hf_modules with default only when it has no value by @amitz-nv in #7132
- [TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP by @dongxuy04 in #6973
- [TRTLLM-7321][doc] Refine GPT-OSS doc by @dongfengy in #7180
- [None][infra] Prepare for single GPU GB200 test pipeline by @chzblych in #7073
- [None][chore] Enable auto deploy accuracy test in CI by @ajrasane in #7179
- [None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests by @Funatiq in #6754
- [None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge by @yiqingy0 in #7074
- [TRTLLM-7096][infra] Testing cache transmission functionality in Python by @bo-nv in #7025
- [None][feat] add gpt-osss tests to sanity list by @xinhe-nv in #7158
- [None][chore] cherry-pick 6940 by @bo-nv in #7097
- [None][feat] Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time. by @hyukn in #7113
- [None][ci] waive test_mamba2_chunk_scan_combined_prefill_chunking[seqlens1-8] by @QiJune in #7194
- [None][test] add l20 specific qa test list by @crazydemo in #7067
- [None][fix] Fix MoE load balancer config loading by @syuoni in #7150
- [TRTLLM-7346][fix] Improve performance of PyTorchModelEngine._get_lora_params_from_requests by @amitz-nv in #7033
- [None][chore] remove CLI support for mamba cache dtype setting by @shaharmor98 in #7119
- [None][refactor] refactor the CUDA graph runner to manage all CUDA graphs by @QiJune in #6846
- [None][infra] Waive failed tests on main branch by @EmmaQiaoCh in #7201
- [https://nvbugs/5440241][fix] Fix 70B GSM8K Accuracy drop by @chenfeiz0326 in #6967
- [None][fix] Update to pull LLM from a central location. by @FrankD412 in #6458
- [None][chore] Refactored the handle logits pp communication by @dcampora in #7154
- [TRTLLM-7319][perf] Fuse slicing into MoE. by @bobboli in #6728
- [None][fix][AutoDeploy] canonicalize_graph before shape prop for consistent state_dict by @lucaslie in #7223
- [TRTLLM-6342][feat] TP Sharding read from the model config by @greg-kwasniewski1 in #6972
- [None][doc] update feature_combination_matrix doc by @leslie-fang25 in #6691
- [None][test] add kv cache size in bench metric and fix failed cases by @ruodil in #7160
- [None][chore] Create PyExecutor from TorchLlmArgs Part 1 by @leslie-fang25 in #7105
- [https://nvbugs/5452463][doc] update disagg doc about UCX_MAX_RNDV_RAILS by @zhengd-nv in #7205
- [None][feat] Skip prefetching consolidated safetensors when appropriate by @2ez4bz in #7013
- [None] [fix] improve kvcache allocation in PyTorch runtime by @qixiang-99 in #5933
- [None][chore] Update CI allowlist 2025-08-25 by @yuanjingx87 in #7229
- [None][test] Update qwen3 timeout to 60 minutes by @nvamyt in #7200
- [https://nvbugs/5457504][fix] fix kv cache event test in disaggregated worker tests by @zhengd-nv in #7028
- [TRTLLM-6549][feat] add perf metrics endpoint to openai server and openai disagg server by @zhengd-nv in #6985
- [None][doc] Display tech blog for nvidia.github.io domain. by @nv-guomingz in #7241
- [https://nvbugs/5477332][fix] Relax atol in test_mamba2_chunk_scan_combined_prefill_chunking by @amitz-nv in #7215
- [None][feat] Hopper Fp8 context mla by @zhou-yuxin in #7116
- [None][infra] Add retry 3 times if ssh cluster failed by @EmmaQiaoCh in #6859
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7251
- [None][fix] Updated blog9_Deploying_GPT_OSS_on_TRTLLM by @Maurits-de-Groot in #7260
- [None][ci] move qwen3 tests from b200 to gb200 by @QiJune in #7257
- [None][perf] Accelerate global scale calculations for deepEP fp4 combine by @yilin-void in #7126
- [None][fix] Fix data type of KV Cache percentage in bench. by @FrankD412 in #7230
- [None][doc] Update autodeploy README.md, deprecate lm_eval in examples folder by @Fridah-nv in #7233
- [None][update] Update disagg code owners by @Tabrizian in #7266
- [TRTLLM-6633][feat] Padding for piecewise cudagraph by @liji-nv in #6750
- [https://nvbugs/5412456][fix] Remove from waives.txt by @zhou-yuxin in #7248
- [None][fix] Remove and fuse some element-wise ops in the ds-r1-fp8 model by @lfr-0531 in #7238
- [None][opt] Balance the request based on number of tokens in AttentionDP by @Shunkangz in #7183
- [TRTLLM-6960][fix] replace flasky scaled_mm test with more stable config by @dc3671 in #7089
- [None][feat] Add logging for OAI disagg server by @Tabrizian in #7232
- [TRTLLM-7457][ci] Update & cleanup unittest parallel config by @tongyuantongyu in #7254
- [None][chore] update disagg readme and scripts for pipeline parallelism by @raayandhar in #6875
- [None][chore] Wrap the swiglu into custom op to avoid redundant device copy. by @hyukn in #7021
- [None][fix] Fix possible hang issue in WideEP and move some tests to pre-merge by @dongxuy04 in #7262
- [None][ci] remove test_llm_api_autodeploy from B200 test db by @QiJune in #7282
- [https://nvbugs/5453727][fix] Fix bug of how GPT-OSS setup the parameters in CI by @byshiue in #7151
- [None][fix] Update maxnt of llama_v3.2_1b bench by @nvamyt in #7279
- [None][refactor] Move draft token padding out of Drafter by @mikeiovine in #7134
- [TRTLLM-7250][fix] waive failed cases by @xinhe-nv in #7292
- [None][infra] Waive failed tests on main 08/27 by @EmmaQiaoCh in #7300
- [None][ci] parallelize unit tests of auto deploy in B200 by @QiJune in #7291
- [https://nvbugs/5458798][fix] AD perf test outliers handling, tightened threshold, re-enabled in CI, fixed mem threshold by @MrGeva in #7189
- [https://nvbugs/5453727][fix] unwaive qwen3 CI tests by @byshiue in #7293
- [None][fix] Remove the wheel from intermediate docker storage by @MartinMarciniszyn in #7175
- [None] [chore] Make disagg example compatible with recommended usage by @kaiyux in #7121
- [TRTLLM-6822][infra] Add PR-Checklist github action and modify PR template by @venkywonka in #6029
- [TRTLLM-7207][feat] Chat completions API for gpt-oss by @LinPoly in #7261
- [None][ci] fix test list name by @QiJune in #7321
- [None][fix] Disable mandatory PR checklist enforcement by @venkywonka in #7325
- [https://nvbugs/5430124][ci] Unwaive Mistral 3.1 Small tests by @2ez4bz in #7274
- [None][ci] skip TestGPTOSS by @QiJune in #7333
- [TRTLLM-6876][feat] Add low precision all2all for mnnvl by @zongfeijing in #7155
- [None] [feat] Use numa to bind CPU by @kaiyux in #7304
- [https://nvbugs/5474453][fix] fix path to tested model by @nzmora-nvidia in #7272
- [None][doc] add adp balance blog by @yunruis in #7213
- [None][infra] Waive failed tests on main branch 08/26 by @EmmaQiaoCh in #7346
- [None][fix] mxfp4 padding bug for TRT-LLM and CUTLASS MoE backends by @nekorobov in #7214
- [None][chore] Some improvements for CI stability by @chzblych in #7199
- [None][feat] Refactor llama4 for multimodal encoder IFB by @dongfengy in #6844
- [https://nvbugs/5445466][fix] Bypass MLP TP split for MNNVL in DeepSeek V3 to avoid hanging. by @timlee0212 in #6886
- [TRTLLM-7457][ci] Update unittest parallel config by @tongyuantongyu in #7297
- [None][perf] Disable Swap AB when num tokens exceeds N dimension by @djns99 in #7104
- [TRTLLM-6646][test] NIM migration to TRT-LLM LLMAPI : Add QWQ-32b torch test by @aalanwyr in #7284
- [None][feat] KV Cache Connector API by @richardhuo-nv in #7228
- [None] [chore] Update .coderabbit.yaml review configuration by @venkywonka in #7351
- [https://nvbugs/5445466][fix] Eliminate race when loading HF dynamic modules by @chang-l in #7268
- [TRTLLM-7280][test] Add beam search CudaGraph + Overlap Scheduler tests by @fredricz-20070104 in #7326
- [None][fix] fix doc formula by @yunruis in #7367
- [https://nvbugs/5481385][fix] Fix max_seq_len in cuda graph warmup and intermediate_size in fused_moe_deepgemm by @lfr-0531 in #7345
- [None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss by @pengbowang-nv in #7192
- [None][infra] Waive failed tests on main branch 08/29 by @EmmaQiaoCh in #7370
- [None][doc] Exposing the ADP balance strategy tech blog by @juney-nvidia in #7380
- [None][feat] Update TargetInfo to accommodate CP in disagg by @brb-nv in #7224
- [None][docs] Update Dynasor paper info by @AndyDai-nv in #7137
- [None] [fix] store blog 10 media via lfs by @Funatiq in #7375
- [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7342
- [None][chore] Bump version to 1.1.0rc3 by @yiqingy0 in #7394
- [TRTLLM-6747][feat] Merge add sparse exp and shared exp into local reduction by @zongfeijing in #7369
- [None][feat] Support NVFP4 KV Cache by @Tom-Zheng in #6244
- [None][ci] Some improvements for Slurm CI setup by @chzblych in #7407
- [None][chore] Mass integration of release/1.0 - 2nd by @dominicshanshan in #7171
- [None][test] Update case that not support passing quantization fp8 for pytorch backend by @nvamyt in #7302
- [None][infra] Disable GB200-PyTorch-1 due to OOM issue by @yuanjingx87 in #7386
- [https://nvbugs/5481087][fix] fix bug of ci when we use mocker by @byshiue in #7332
- [None][infra] Waive failed case on main 0901 by @EmmaQiaoCh in #7447
- [TRTLLM-7353][feat] Implement capturable drafting loops for speculation by @mikeiovine in #7100
- [None] [doc] Update DeepSeek example doc by @jiahanc in #7358
- [None][fix] Fix nanobind failure by @Tom-Zheng in #7425
- [None][chore] Use llm args in create_py_executor by @leslie-fang25 in #7239
- [https://nvbugs/5485430][fix] Copy the nanobind file when using precompiled package by @jiaganc in #7334
- [None][infra] Using local variables in rerun function by @yiqingy0 in #7198
- [None][ci] Correct docker args for GPU devices and remove some stale CI codes by @chzblych in #7417
- [https://nvbugs/5476580][fix] unwaive test_nvfp4_4gpus by @Superjomn in #7454
- [None][test] auto reuse torch empty cache on qa test by @crazydemo in #7421
- [None][doc] fix example in docstring by @tomeras91 in #7410
- [TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test by @aalanwyr in #7413
- [None][infra] waive test case failed on post-merge by @HuiGao-NV in #7471
- [TRTLLM-7208][feat] Implement basic functionalities for Responses API by @JunyiXu-nv in #7341
- [https://nvbugs/5453992][unwaive] Unwaive llama quickstart test by @peaceh-nv in #7242
- [None][infra] Waive failed tests on main branch 0902 by @EmmaQiaoCh in #7482
- [None][chore] Fix formatting error in Gemma3 readme by @karljang in #7352
- [https://nvbugs/5470782][fix] Add specific test names for test_deepseek.py by @SimengLiu-nv in #7318
- [https://nvbugs/5458798][fix] Disabled test_trtllm_bench_backend_comparison due to timeout by @MrGeva in #7397
- [None][chore] Add note about trtllm-serve to the devel container by @MartinMarciniszyn in #7483
- [None][chore] rm executor config in kv cache connector by @leslie-fang25 in #7372
- [None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … by @djns99 in #6126
- [None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs by @jinyangyuan-nvidia in #7285
- [TRTLLM-7261][feat] Support phi-4 model in pytorch backend by @Wanli-Jiang in #7371
- [https://nvbugs/5480289][fix] release slot manager in mtp MTPHiddenStatesManager by @yweng0828 in #7340
- [https://nvbugs/5488141][fix] Unwaive llama3 test_eagle3 by @mikeiovine in #7486
- [https://nvbugs/5472947][fix] wait on isend handles before reusing buffers by @amukkara in #7462
- [TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 by @StanleySun639 in #7083
- [https://nvbugs/5485593][fix] improve accuracy/test_disaggregated_serving.py by @reasonsolo in #7366
- [None][doc] add GPT OSS Eagle3 blog by @IzzyPutterman in #7140
- [None][fix] Fix KV cache recompute in draft_target spec decode by @mikeiovine in #7348
- [TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) by @syuoni in #6948
- [None][chore] Remove two unused parameters in create_py_executor by @leslie-fang25 in #7458
- [#7222][autodeploy] Separate run_shape_prop as another graph utility by @Fridah-nv in #7313
- [None][fix] Fix a numerical stability issue for XQA with spec dec by @lowsfer in #7114
- [https://nvbugs/5470769][fix] fix disagg-serving accuracy test case by @reasonsolo in #7338
- [TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options by @StanleySun639 in #7492
- [https://nvbugs/5485102][fix] Correctly set stride for piecewise outp… by @liji-nv in #7442
- [TRTLLM-7442][model] Remove unnecessary D2H copies by @2ez4bz in #7273
- [TRTLLM-6199][infra] Update for using open driver from BSL by @EmmaQiaoCh in #7430
- [None][fix] Fix a typo in the Slurm CI codes by @chzblych in #7485
- [TRTLLM-6342][fix] Fixed triggering BMM sharding by @greg-kwasniewski1 in #7389
- [None][fix] fix hunyuan_moe init bug by @sorenwu in #7502
- [None][chore] Bump version to 1.1.0rc4 by @yiqingy0 in #7525
- [https://nvbugs/5485886][fix] Fix resource free of Eagle3ResourceManager by @kris1025 in #7437
- [TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage by @ZhanruiSunCh in #6729
- [https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larger than ep_size by @WeiHaocheng in #7331
- [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #7521
- [None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests by @QiJune in #7489
- [None][test] update nim and full test list by @crazydemo in #7468
- [None][feat] MultiLayer Eagle by @IzzyPutterman in #7234
- [TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec by @syuoni in #7481
- [OMNIML-2336][feat] Add NVFP4 x FP8 by @sychen52 in #6809
- [https://nvbugs/5492485][fix] Use offline dataset from llm-models instead. by @yuxianq in #7435
- [TRTLLM-7410][feat] Support hashing and KV cache reuse for videos by @chang-l in #7360
- [https://nvbugs/5369366] [fix] Report failing requests by @arekay in #7060
- [None][feat] Add Request specific exception by @Shunkangz in #6931
- [#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding by @therealnaveenkamal in #7490
- [https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… by @liji-nv in #7441
- [None][ci] remove unnecessary test_modeling_deepseek.py by @QiJune in #7542
- [None][chore] Remove closed bugs by @xinhe-nv in #7408
- [TRTLLM-6642][feat] add gptoss 20g tests by @xinhe-nv in #7361
- [None][ci] Increase the number of retries in docker image generation by @chzblych in #7557
- [None][infra] update nspect version by @niukuo in #7552
- [https://nvbugs/5461761][fix] Remove the waiver by @ziyixiong-nv in #7476
- [#6186][feat] Introduce QKNormRoPEAttention module by @Funatiq in #6830
- [None][chore] Remove executor_config in create_py_executor_instance by @leslie-fang25 in #7463
- [None][infra] Waive failed tests on main branch 0905 by @EmmaQiaoCh in #7564
- [https://nvbugs/5453806][unwaive] Unwaive fp8 kvcache attention test by @peaceh-nv in #7243
- [#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example by @lucaslie in #7221
- [None][ci] Revert "[https://nvbugs/5461761][fix] Remove the waiver (#7476)" by @QiJune in #7584
- [None][ci] move some test cases of DGX H100 to post merge by @QiJune in #7569
- [None][ci] Improve SSH connection stability by @chzblych in #7567
- [None][ci] Waive qwen3 test for accuracy bug in https://nvbugs/5505402 by @dominicshanshan in #7585
- [None][fix] DeepSeek-R1 W4A8 weight loading issue; fixes regression from #6200 by @rosenrodt in #7123
- [None][chore] share input_ids buffers among different cuda graphs by @QiJune in #7236
- [TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse by @chang-l in #7106
- [TRTLLM-4629] [feat] Step1: trtllm-gen kernels support sm103 by @VALLIS-NERIA in #7570
- [TRTLLM-7440][fix] Split
fused_input_embedto separate out host sync by @chang-l in #7280 - [https://nvbugs/5502352][fix] Fix 2-model CDL path by @mikeiovine in #7543
- [TRTLLM-5950][infra] Removing remaining turtle keywords from the code base by @EmmaQiaoCh in #7086
- [https://nvbugs/5448767][fix] sync termination of requests across PP ranks by @raayandhar in #7455
- [None][infra] Skip RTX Pro 6000 test stages due to HW are offline by @EmmaQiaoCh in #7592
- [TRTLLM-7153] [feat] Move stop_criteria to sample_async by @netanel-haber in #7041
- [None][ci] Block some nodes to avoid unstable network access by @chzblych in #7593
- [None][fix] fixing the math on asymmetric tp+pp tests by @raayandhar in #7098
- [TRTLLM-7187][fix] Build wheel with NIXL by @BatshevaBlack in #7472
- [None][chore] expose tokens_per_block into KvCacheConfig by @Superjomn in #5911
- [None][docs] refine docs for accuracy evaluation of gpt-oss models by @binghanc in #7252
- [TRTLLM-7779][feat] Support multiple postprocess workers for chat completions API by @JunyiXu-nv in #7508
- [None][chore] Mass integration of release/1.0 - 3rd by @dominicshanshan in #7519
- [https://nvbugs/5506683][fix] adjust the CI by @byshiue in #7604
- [None][infra] Add back rtx-pro-6000 stages since the node is available by @EmmaQiaoCh in #7601
- [None][feat] Update multimodal utility
get_num_tokens_per_imagefor better generalization by @chang-l in #7544 - [TRTLLM-6142][feat] Reland: set torch recompile_limit based on cuda_graph_batch_sizes and refactored by @MrGeva in #7219
- [None][chore] remove executor config in instantiate sampler by @leslie-fang25 in #7516
- [TRTLLM-7361][feat] KV cache transfer for uneven pp by @chuangz0 in #7117
- [None][infra] Try to fix docker container failed to be killed issue by @yuanjingx87 in #7388
- [None][fix] Add try-catch in stream generator by @zhanghaotong in #7467
- [https://nvbugs/5481080][fix] Fix GPTOSS W4A16 reference by @dongfengy in #7323
- [None][test] Skip eagle3 test by @Tabrizian in #7627
- [https://nvbugs/5453709][fix] Remove transformers version limit in Qwen2VL by @Wanli-Jiang in #7152
- [TRTLLM-5877][infra] Add fmha tests and auto trigger rules by @yiqingy0 in #6050
- [None][chore] Mass integration of release/1.0 - 4th (release/1.0 doc change mainly) by @dominicshanshan in #7607
- [None][feat] Nixl support for GDS by @tshmilnvidia in #5488
- [TRTLLM-4366][infra] Don't call reinstall_rockylinux_cuda when the base CUDA image is up to dated by @ZhanruiSunCh in #5980
- [#6529][feat] CMake option to link statically with cublas/curand by @WilliamTambellini in #7178
- [None][feat] Extend VLM factory and add Mistral3 factory by @2ez4bz in #7583
- [None][fix] add the missing import raised by #7607 by @nv-guomingz in #7639
- [None][chore] Remove closed bugs by @xinhe-nv in #7591
- [https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp by @Linda-Stadter in #7449
- [None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture by @tomeras91 in #7589
- [None][feat] Optimize MLA kernels with separate reduction kernels by @PerkzZheng in #7597
- [https://nvbugs/5445466][fix] unwaive DS R1 test cases with bug already fixed by @lancelly in #7429
- [#6798][fix] fix compilation error in ub_allocator in single device build by @WilliamTambellini in #6874
- [https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. by @StudyingShao in #7615
- [None][chore] add TorchLlmArgs to the connector api by @richardhuo-nv in #7493
- [TRTLLM-6707][fix] nanobind fix for executor exit call by @Linda-Stadter in #7565
- [None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline by @QiJune in #7629
- [TRTLLM-7408][feat] Wrap MOE with custom op. by @liji-nv in #7277
- [TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next by @chang-l in #7349
- [None][fix] fix post-merge issue raised by #5488 by @nv-guomingz in #7655
- [https://nvbugs/5410687][test] Add deepseek r1-w4afp8 quickstart by @fredricz-20070104 in #7645
- [None][fix]UCX zmq ip support ipv6 by @chuangz0 in #7530
- [None][feat] Make the should_use_spec_decode logic a bit smarter by @zheyuf in #7112
- [#5861][autodeploy] Refactor: Quantization Transforms with Inheritance by @Fridah-nv in #7227
- [#7208][fix] Fix config type of MedusaConfig by @karljang in #7320
- [None][infra] Bump version to 1.1.0rc5 by @yiqingy0 in #7668
- [TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. by @bo-nv in #7503
- [https://nvbugs/5494698][fix] skip gemma3 27b on blackwell by @xinhe-nv in #7505
- [https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result by @Linda-Stadter in https://github.com/NVIDIA/TensorRT-LLM/pull/7672
- [None][chore] remove executor config in kv cache creator by @leslie-fang25 in https://github.com/NVIDIA/TensorRT-LLM/pull/7526
- [https://nvbugs/5488212][waive] Waive failed tests for L20 by @nvamyt in https://github.com/NVIDIA/TensorRT-LLM/pull/7664
- [None][feat] Use a shell context to install dependancies by @v-shobhit in https://github.com/NVIDIA/TensorRT-LLM/pull/7383
- [https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues by @DomBrown in https://github.com/NVIDIA/TensorRT-LLM/pull/7616
- [None][infra] Waive failed cases on main 0910 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7676
- [None][infra] Adjust labeling llm prompt for bug issues by @karljang in https://github.com/NVIDIA/TensorRT-LLM/pull/7385
- [None][ci] move some test cases from l40s to a30 by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7684
- [None][fix] Fix the incorrect header file import in dataType.h by @Fan-Yunfan in https://github.com/NVIDIA/TensorRT-LLM/pull/7133
- [https://nvbugs/5498165][fix] fix permission error for config file lock by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/7656
- [https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7679
- [TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/6097
- [TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/6856
- [TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/6742
- [None][ci] Some improvements for Slurm CI by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/7689
- [None][ci] Test waives for the main branch 09/14 by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/7698
- [None][feat] support gpt-oss with fp8 kv cache by @PerkzZheng in https://github.com/NVIDIA/TensorRT-LLM/pull/7612
- [TRTLLM-6903][feat] Support chunked prefill for multimodal models by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/6843
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7682
- [None][chore] Enable multiple postprocess workers tests for chat completions api by @JunyiXu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7602
- [TRTLLM-7279][test] add accuracy test for deepseek-r1 with chunked_prefill by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/7365
- [https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding by @DylanChen-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7122
- [None][chore] move some cases from post-merge to pre-merge to detect errors in early stage by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7699
- [TRTLLM-7918][feat] Support kvcache reuse for phi4mm by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7563
- [None][test] add test for min_tokens by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/7678
- [TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7722
- [None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow by @zhengd-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7553
- [TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill by @jmydurant in https://github.com/NVIDIA/TensorRT-LLM/pull/7477
- [None][ci] Test waives for the main branch 09/15 by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/7709
- [None][feat] Eagle, use last hidden post norm by @IzzyPutterman in https://github.com/NVIDIA/TensorRT-LLM/pull/7546
- [None][infra] AutoDeploy: codeowners for autodeploy unit tests by @lucaslie in https://github.com/NVIDIA/TensorRT-LLM/pull/7743
- [TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7651
- [None][ci] move qwen3 tests from GB200 to B200 by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7733
- [None][feat] support attention dp for qwen3 dense model by @Nekofish-L in https://github.com/NVIDIA/TensorRT-LLM/pull/7618
- [None][doc] Fix the link in the doc by @Shixiaowei02 in https://github.com/NVIDIA/TensorRT-LLM/pull/7713
- [TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/7568
- [TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/7739
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7735
- [None][fix] Ensure that the W4A8 custom input scale remains aligned across all ranks by @yilin-void in https://github.com/NVIDIA/TensorRT-LLM/pull/7614
- [None][chore] Fix error when running trtllm-bench without cuda graph. by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/7725
- [None][doc] Clean the doc folder and move the outdated docs into lega… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7729
- [TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op by @limin2021 in https://github.com/NVIDIA/TensorRT-LLM/pull/7632
- [None] [chore] cherry pick changes on slurm scripts from
release/1.1.0rc2by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/7750 - [https://nvbugs/5503529][fix] Change test_llmapi_example_multilora to get adapters path from cmd line to avoid downloading from HF by @amitz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7740
- [TRTLLM-7070][feat] add gpt-oss serve benchmark tests by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7638
- [None][fix] waive hang tests on main by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7720
- [https://nvbugs/5471106][fix] Remove the waivers by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7711
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7746
- Revert "[None][feat] support attention dp for qwen3 dense model" by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/7765
- [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/7659
- [None][chore] AutoDeploy: neat disablement of transforms in pipeline by @lucaslie in https://github.com/NVIDIA/TensorRT-LLM/pull/7736
- [None][chore] Remove unused get_quant_scales methods by @achartier in https://github.com/NVIDIA/TensorRT-LLM/pull/7687
- [None][infra] add nspect allow list for false positive secrets by @yuanjingx87 in https://github.com/NVIDIA/TensorRT-LLM/pull/5797
- [TRTLLM-7398][doc] Add doc for KV cache salting support by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/7772
- [None][infra] Update CI allowlist 2025-09-16 by @yuanjingx87 in https://github.com/NVIDIA/TensorRT-LLM/pull/7773
- [None][infra] Add nightly pipeline to generate lock files by @yuanjingx87 in https://github.com/NVIDIA/TensorRT-LLM/pull/5798
- [https://nvbugs/5516666][fix] cherrypick fix to the CUDA graph warmup issue when using speculative decoding by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7737
- [None][waive] Waive tests by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/7775
- [https://nvbugs/5489015][fix] Support communicator split in MNNVL allreduce and fix the binding issues. by @timlee0212 in https://github.com/NVIDIA/TensorRT-LLM/pull/7387
- [https://nvbugs/5488582][fix] Cherry-pick 7495: Avoid unexpected Triton recompilation in DG fused_moe by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/7708
- [TRTLLM-6741] [feat] enable LM tp for MTP, under attention dp case (cherry-pick #7128) by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/7571
- [None][chore] AutoDeploy: clean up of model unit test configuration by @lucaslie in https://github.com/NVIDIA/TensorRT-LLM/pull/7742
- [None][ci] waive test_llm_gemma_1gpu_summary_vswa by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7781
- [https://nvbugs/5517260][fix] move scaffolding contrib module's import to subdirectory by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/7758
- [None][feat] add an example of KV cache host offloading by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7767
- [https://nvbugs/5485325][fix] Cherry-pick #7373: fix the CUDA graph warmup issue when using speculative decoding by @lfr-0531 in https://github.com/NVIDIA/TensorRT-LLM/pull/7734
- [None][ci] waive test_llama_eagle3[True-FLASHINFER-False-False-False-False-True] by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7788
- [None][chore] Remove closed bugs by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7697
- [None][test] add gpt oss model for trtllm perf test by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/7328
- [TRTLLM-7250][fix] waive block tests by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7782
- [None][doc] fix section header of llm_kv_cache_offloading example by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7795
- [TRTLLM-7410][feat] Enable KV cache reuse and chunked prefill for mistral3.1 by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/7628
- [None][infra] Waive failed tests on main 09/17 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7812
- [None][doc] Update Documentation link to point to docs instead of docs source code by @asrivas in https://github.com/NVIDIA/TensorRT-LLM/pull/6495
- [TRTLLM-5966][feat] Helix: make softmax stats pointer available to attention gen by @MatthiasKohl in https://github.com/NVIDIA/TensorRT-LLM/pull/6865
- [https://nvbugs/5516661][fix] Drop waive case 5516661 by @yunruis in https://github.com/NVIDIA/TensorRT-LLM/pull/7791
- [https://nvbugs/5508536][fix] Revert #7041: Move stop_criteria to sample_async (#7041) by @netanel-haber in https://github.com/NVIDIA/TensorRT-LLM/pull/7796
- [#7308] [feat] AutoDeploy: graph-less transformers mode for HF by @lucaslie in https://github.com/NVIDIA/TensorRT-LLM/pull/7635
- [None][ci] restore unwaive list by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7802
- [None][fix] Make tile_tokens_dim calculation just in time before kernel launching. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/7529
- [None][chore] Version bump for 1.1.0rc6 by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/7824
- [https://nvbugs/5519544][fix] fix invalid expression for disabling pa… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7806
- [TRTLLM-8070][test] add generation logits case for llama3 by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/7759
- [https://nvbugs/5523080][fix] Correct the batch index in device tensors by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7803
- [None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 by @Barry-Delaney in https://github.com/NVIDIA/TensorRT-LLM/pull/7716
- [None][fix] Fix CI issue for dsl pkg install by @limin2021 in https://github.com/NVIDIA/TensorRT-LLM/pull/7784
- [https://nvbugs/5508890][fix] gen. result cleanup when using PostprocWorker by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/7771
- [None][infra] update ci allow list 2025/09/17 by @yuanjingx87 in https://github.com/NVIDIA/TensorRT-LLM/pull/7816
- [None][chore] Remove executor config in create_py_executor by @leslie-fang25 in https://github.com/NVIDIA/TensorRT-LLM/pull/7599
- [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7801
- [https://nvbugs/5519530][fix] Fix gptoss 2-gpu test by @dongfengy in https://github.com/NVIDIA/TensorRT-LLM/pull/7819
- [TRTLLM-6577][feat] Support nano_v2_vlm in pytorch backend by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7207
- [None][fix] Add TP information in weight scale loading in WeightOnlyQuantLinearMethod by @stnie in https://github.com/NVIDIA/TensorRT-LLM/pull/7732
- [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7807
- [TRTLLM-7918][feat] Support kvcache reuse and chunk prefill for phi4mm by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7723
- [https://nvbugs/5519462][fix] skip deepseek test on preHopper by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7817
- [None][chore] remove generated fmha_cubin.h from source tree by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7836
- [None][fix] Revert "Revert "[None][feat] support attention dp for qwen3 dense model"" by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/7780
- [TRTLLM-6898][feat] Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm by @limin2021 in https://github.com/NVIDIA/TensorRT-LLM/pull/7764
- [None][doc] Cherry-pick deployment guide update from 1.1.0rc2 branch to main branch by @dongfengy in https://github.com/NVIDIA/TensorRT-LLM/pull/7774
- [TRTLLM-6746][feat] Enable two-model spec dec for MTP Eagle by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/7001
- [None][ci] set TORCHINDUCTOR_COMPILE_THREADS correctly by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7800
- [https://nvbugs/5522851][fix] Correct the logic to update kv_lens_cuda by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7790
- [TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick #6059 from release/1.1.0rc2) by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/7610
- [TRTLLM-6286] [feat] Update CUTLASS to 4.2 and enable SM103 group gemm by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/7832
- [None][fix] get Local IP by connect remote by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/7719
- [TRTLLM-7183][test] Feature fix model issue for disagg serving by @fredricz-20070104 in https://github.com/NVIDIA/TensorRT-LLM/pull/7785
- [https://nvbugs/5481434][feat] cherry-pick fix to reuse pytorch memory segments occupied by cudagraph by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7747
- [None][test] add deepseek r1/v3 model with chunked prefill cases by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/7124
- [None][chore] polish error message in cute_dsl_utils.py by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7852
- [None][fix] fix load_model_on_cpu on qwen/convert_checkpoint.py by @lkm2835 in https://github.com/NVIDIA/TensorRT-LLM/pull/2382
- [None][infra] Waive failed tests in post-merge by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7859
- [None][ci] Waive llama3 auto dtype test bug in https://nvbugs/5527956. by @dominicshanshan in https://github.com/NVIDIA/TensorRT-LLM/pull/7853
- [None][test] Add accuracy benchmark in stress test by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/7561
- [None][chore] remove cli cases for rtx6k by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/7833
- [None][feat] Support EPLB in Qwen3 MoE by @lucifer1004 in https://github.com/NVIDIA/TensorRT-LLM/pull/7443
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7841
- [https://nvbugs/5503440][fix] Fix potential hang due to wrong type of ZMQ socket and protocol for worker_init_status_queue by @lancelly in https://github.com/NVIDIA/TensorRT-LLM/pull/7646
- [None][doc] Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/7864
- [https://nvbugs/5522332][fix] Pin numpy version for Gemma. (cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/7783) by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/7797
- [TRTLLM-5966][feat] Helix: add custom position ids to MLA kernels by @MatthiasKohl in https://github.com/NVIDIA/TensorRT-LLM/pull/6904
- [https://nvbugs/5471108][chore] Unwaiving disagg acc test by @pcastonguay in https://github.com/NVIDIA/TensorRT-LLM/pull/7686
- [https://nvbugs/5522462][fix] Fix FP8 scout illegal memory access by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/7845
- [#7704][chore] Enable MathJax to fix formulas in documentation by @karljang in https://github.com/NVIDIA/TensorRT-LLM/pull/7744
- [TRTLLM-6342][feat] Support for partial sharding from factory by @greg-kwasniewski1 in https://github.com/NVIDIA/TensorRT-LLM/pull/7393
- [https://nvbugs/5520490][fix] Fix intermittent test failures by avoiding external web data pulls by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/7879
- [None][doc] Update tech blog12 by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/7884
- [TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7624
- [TRTLLM-8188][chore] refactor GenerationExecutorWorker with WorkerBase for better code reusing by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7840
- [https://nvbugs/5517404][fix] Use the correct cuda graph for dynamic spec dec by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7728
- [TRTLLM-6286] [perf] Add NoSmem epilogue schedule and dynamic cluster shape for sm10x group gemm by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/7757
- [TRTLLM-7008][fix] cherrypick to main Add automatic shared memory delete if already exist by @dongxuy04 in https://github.com/NVIDIA/TensorRT-LLM/pull/7727
- [None][fix] Disable torch.compile for CapturableGuidedDecoder by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/7871
- [None][fix] cherrypick to main: Fix possible mpi broadcast and gather issue on large object by @dongxuy04 in https://github.com/NVIDIA/TensorRT-LLM/pull/7854
- [https://nvbugs/5512556][unwaive] Unwaive DeepSeek PP tests by @peaceh-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7828
- [https://nvbugs/5513423][fix] Correctly respect min_tokens in PyTorch Workflow by @stnie in https://github.com/NVIDIA/TensorRT-LLM/pull/7808
- [None][fix] Fix DeepGEMM commit by @Barry-Delaney in https://github.com/NVIDIA/TensorRT-LLM/pull/7875
- [None][chore] Mass integration of release/1.0 - 5th by @dominicshanshan in https://github.com/NVIDIA/TensorRT-LLM/pull/7640
- [TRTLLM-7070][feat] add gpt-oss chunked prefill tests by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7779
- [None][infra] Waive a failed case on main by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7901
- [TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package by @bo-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7766
- [https://nvbugs/5525849][fix] Cherry-pick to fix mismatch of max seq len between kv cache manager and dummy requests by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7855
- [TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance by @yechank-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/7250
- [None][infra] Skip failed test for nvbugs 5532023 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7905
- [https://nvbugs/5351244][fix] CHERRY-PICK test_mpi_session (#7501) by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7900
- [None][chore] Upgrade transformers to 4.56.0 by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7523
- [https://nvbugs/5477359][fix] Removing test waivers by @Linda-Stadter in https://github.com/NVIDIA/TensorRT-LLM/pull/7877
- [https://nvbugs/5516665][fix] Fix CUTLASS moe fake impl errors by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7714
- [None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend by @ChristinaZ in https://github.com/NVIDIA/TensorRT-LLM/pull/6794
- [https://nvbugs/5504086][fix] Fix MTP vanilla by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/7904
- [TRTLLM-7831][feat] Cherry-pick from #7423 Support fp8 block wide ep cherry pick by @xxi-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7712
- [TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/7893
- [https://nvbugs/5522847][fix] Disable GC on disagg server and client by @yuantailing in https://github.com/NVIDIA/TensorRT-LLM/pull/7858
- [None][feat] Add Tencent HunYuanDenseV1 model support by @sorenwu in https://github.com/NVIDIA/TensorRT-LLM/pull/7081
- [TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/7577
- [None][opt] Add batch waiting when scheduling by @yunruis in https://github.com/NVIDIA/TensorRT-LLM/pull/7416
- [https://nvbugs/5355128][fix] Add missing wgmma intrinsic for starcoder by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7643
- [None][fix] Read eos_token_id from generation_config for kimi_k2 by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7120
- [None][fix] Fix and add test for TRTLLM MoE backend by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7755
- [None][test] rename llm_perf_full to llm_perf_core and add missing cases by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/7899
- [None][fix] CHERRY-PICK trtllm-serve yaml loading (#7551) by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7897
- [https://nvbugs/5367180][fix] Fix xgrammar import before loading tensorrt_llm binary by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/7906
- [None][fix] fix a bug with trtllm-gen kernels + attention sinks by @PerkzZheng in https://github.com/NVIDIA/TensorRT-LLM/pull/7919
- [https://nvbugs/5532023][fix] executor with-statement bug by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/7895
- [None][fix] Re-add the import for allgather that was mistakenly removed. by @ChristinaZ in https://github.com/NVIDIA/TensorRT-LLM/pull/7920
- [None][chore] Update benchmark script by @zerollzeng in https://github.com/NVIDIA/TensorRT-LLM/pull/7860
- [None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off by @zheyuf in https://github.com/NVIDIA/TensorRT-LLM/pull/7511
- [None][test] Waive another intermittent OOM test by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/7930
- [None][feat] Use list instead of torch tensor for new tokens in update requests by @dcampora in https://github.com/NVIDIA/TensorRT-LLM/pull/7730
- [None][feat] Enable gpt oss on DGX H100. by @Tracin in https://github.com/NVIDIA/TensorRT-LLM/pull/6775
- [TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve (cherry-pick) by @nv-yilinf in https://github.com/NVIDIA/TensorRT-LLM/pull/7776
- [TRTLLM-6549][fix] add kv cache time output back by @zhengd-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7798
- [None][feat] support JIT mha.cu for SPEC_DEC in runtime by @jhaotingc in https://github.com/NVIDIA/TensorRT-LLM/pull/6078
- [TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/7294
- [TRTLLM-7182][test] add multi-nodes test for disagg-serving by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/7470
- [TRTLLM-7015] [feat] Enable
prompt_logprobsin pytorch backend by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/7580 - [https://nvbugs/5528405][fix] Set up draft_tokens before scheduling by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7903
- [https://nvbugs/5477404][chore] unwaive test_disaggregated_single_gpu.py::test_disaggregated_llama_context_capacity by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/7857
- [None][fix] refine
backendoption handling for commands by @tongyuantongyu in https://github.com/NVIDIA/TensorRT-LLM/pull/7829 - [#7692][fix] recognize RequestError as per-request error in background handler by @tongyuantongyu in https://github.com/NVIDIA/TensorRT-LLM/pull/7726
- [None][chore] Make sampler type beta. by @dcampora in https://github.com/NVIDIA/TensorRT-LLM/pull/7934
- [TRTLLM-6341][feature] Support SWA KV cache by @eopXD in https://github.com/NVIDIA/TensorRT-LLM/pull/6768
- [https://nvbugs/5532225] [fix] MoE use stream-dependent workspace by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/7940
- [None][infra] Skip failed test for nvbugs 5537738 by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7946
- [None][chore] remove cubins for ci cases by @qsang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7902
- [None][chore] update chunked prefill cases by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7921
- [None][feat] Return topk logprobs in torch backend by @dcaox in https://github.com/NVIDIA/TensorRT-LLM/pull/7756
- [None][ci] optimize test cases of dgx b200 by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7948
- [None][chore] Recover cutlass-dsl pkg install and dsl op testing. by @limin2021 in https://github.com/NVIDIA/TensorRT-LLM/pull/7945
- [https://nvbugs/5521799][fix] Trim incorrectly generated harmony messages by @JunyiXu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7849
- [https://nvbugs/5532248][fix] Fix fused_moe OOM by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7931
- [None][test] Update llm_models_root to improve path handling on BareMetal environment by @yufeiwu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7876
- [None][ci] remove duplicate test cases by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7956
- [None][chore] add test_w4_1gpu[True-True-cutlass-fp8] & TestKimiK2::test_fp8_blocks… by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7944
- [TRTLLM-5235][feat] Enable regex and EBNF grammar in trtllm-serve by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/7925
- [None][feat] add model seed-oss by @Nekofish-L in https://github.com/NVIDIA/TensorRT-LLM/pull/7496
- [None][ci] Waive some intermittent failures by @HuiGao-NV in https://github.com/NVIDIA/TensorRT-LLM/pull/7955
- [None][fix] trtllm-gen cubins compiled with wrong arch. by @PerkzZheng in https://github.com/NVIDIA/TensorRT-LLM/pull/7953
- [None][chore] cleanup build script by @tongyuantongyu in https://github.com/NVIDIA/TensorRT-LLM/pull/7865
- [#7675][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) by @MrGeva in https://github.com/NVIDIA/TensorRT-LLM/pull/7888
- [None][fix] fix get_iteration_stats IndexError by @macrocell in https://github.com/NVIDIA/TensorRT-LLM/pull/7216
- [None][fix] Fix dummy load format for DeepSeek. by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/7874
- [TRTLLM-7399][test] Add DS-R1/Qwen3 test cases for RTX 6000 by @pamelap-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/7662
- [https://nvbugs/5473781][fix] Fix llama 4 FP8 for PP>1 by @mikeiovine in https://github.com/NVIDIA/TensorRT-LLM/pull/7220
- [None][bug] Fix transformers version for Triton backend by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/7964
- [OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels by @sychen52 in https://github.com/NVIDIA/TensorRT-LLM/pull/7821
- [None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756)" by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/7969
- [None][chore] Validate features combination by @leslie-fang25 in https://github.com/NVIDIA/TensorRT-LLM/pull/7630
- [https://nvbugs/5456485][bug] unwaive triton test by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/7966
- [None][feat] DeepEP LL fp8 dispatch/combine by @yilin-void in https://github.com/NVIDIA/TensorRT-LLM/pull/7927
- [None][chore] Update trtllm-bench documentation on setting FP8 KV cache by @achartier in https://github.com/NVIDIA/TensorRT-LLM/pull/7885
- [None][chroe] Update the cuda and tensorrt version in homepage icons. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/7963
- [TRTLLM-6541][test] Add NIM perf test cases by @fredricz-20070104 in https://github.com/NVIDIA/TensorRT-LLM/pull/7924
- [None][doc] scaffolding tech blog part one by @WeiHaocheng in https://github.com/NVIDIA/TensorRT-LLM/pull/7835
- [TRTLLM-7758][feat] Optimize phi4-mm image modality inference by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/7918
- [None][infra] Unwaive some tests since dev already have a PR to collect more info by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/7984
- [None][perf] Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices by @jinyangyuan-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/7419
- [https://nvbugs/5536141][fix] fix_disagg_single_gpu_test by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/7990
- [https://nvbugs/4955671][fix] update test list by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7980
- [None][chore] Mass integration of release/1.0 - 6th by @dominicshanshan in https://github.com/NVIDIA/TensorRT-LLM/pull/7928
- [None][chore] Remove developer name in comment by @eopXD in https://github.com/NVIDIA/TensorRT-LLM/pull/7981
- [None][chore] relax version constraints on fastapi by @PeganovAnton in https://github.com/NVIDIA/TensorRT-LLM/pull/7935
- [TRTLLM-5966][feat] Helix: add alltoall op by @MatthiasKohl in https://github.com/NVIDIA/TensorRT-LLM/pull/6815
- [None][fix] fix a bug in wideEp use DeepEP with num_chunks > 1 by @xxi-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7954
- [None][doc] Add acknowledgements in scaffolding tech blog by @WeiHaocheng in https://github.com/NVIDIA/TensorRT-LLM/pull/7983
- [None][infra] Waive failed tests on main 09/25 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8001
- [TRTLLM-8533][chore] extract weights loading related logic to model loader by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/7579
- [https://nvbugs/5525951][fix] Clarify that PP is not supported for GPTOSS by @dongfengy in https://github.com/NVIDIA/TensorRT-LLM/pull/7911
- [None][chore] Some clean-ups for CUDA 13.0 dependencies by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/7979
- [TRTLLM-7999][infra] Add B300/GB300 single gpu test by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/7951
- [None][infra] Improve the failure message for accuracy test suite by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/7994
- [#6102][fix] support non-system python installation by @tongyuantongyu in https://github.com/NVIDIA/TensorRT-LLM/pull/7763
- [None][ci] Waive test_mm_encoder_standalone.py::test_multi_request_batch_chat[llava-v1.6-mistral-7b-hf] by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/8010
- [None][feat] Optimize kv cache transfer TEP by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/7613
- [TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/7363
- [None][chore] Bump version to 1.1.0 by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/7942
- [None][doc] Refine perf overview.md and correct the error link in per… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/8036
- [None][fix] Fix chunked prefill state of draft request by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/8067
- [https://nvbugs/5548098][fix] Fix flakey unit test for dynamic spec decode by @zheyuf in https://github.com/NVIDIA/TensorRT-LLM/pull/8078
- [None][ci] Waive failing tests on release/1.1 by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8088
- [https://nvbugs/5451280][fix] Reduce memory fraction problem by warmu… by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/7999
- [https://nvbugs/5541494] [fix] Fix missing sm100f/103a kernels and add tests by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/8098
- [https://nvbugs/5550283][fix] update to the latest MoE API by @xxi-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8169
- [https://nvbugs/5536131][fix] Fix illegal access issue when scale is not provided in Llama3/4. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/7960
- [None][chore] Waive tests failing on release/1.1 post merge by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8185
- [https://nvbugs/5550283][fix] update test case to call post quantization explicitly due to code refactor by @xxi-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8188
- [None][fix] cherry-pick !8217 pin flashinfer-python version (#8217) by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8252
- [https://nvbugs/5538098][fix] Checking connection to etcd server in unit test by @pcastonguay in https://github.com/NVIDIA/TensorRT-LLM/pull/8269
- [https://nvbugs/5532023][fix] unwaive GenerationExecutor tests by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/8251
- [https://nvbugs/5565590][fix] test_request_perf_metrics_draft by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/8257
- [None][infra] Remove WAR code for GH200 node by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8267
- [None][infra] Update and waive failed tests for release branch by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8291
- [None][chore] Waive test failing on pre-merge CI by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8295
- [None][chore] Update constaint for release by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/8211
- [TRTLLM-8246][test] add multimodal kvcache+chunked_prefil cases in to QA test list by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/8212
- [https://nvbugs/5522746][fix] unwaive tests caused by node issues after rebooting by @lancelly in https://github.com/NVIDIA/TensorRT-LLM/pull/8268
- [None][chore] Update test configs for release by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/8224
- [https://nvbugs/5547434][fix] Fix Qwen2.5-VL device_path error by @yechank-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/8057
- [https://nvbugs/5550722][fix] Fix image load by @yechank-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/8093
- [https://nvbugs/5532789] [doc] Add documents about CUDA 12.9 by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/8192
- [https://nvbugs/5546202][fix] Fix concurrent bug for NIXL cache transceiver by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/8147
- [https://nvbugs/5563653][infra] reduce docker image layers by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8250
- [https://nvbugs/5568951][fix] Fix guided decoding disagg tests by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/8311
- [https://nvbugs/5534837][fix] Fix KV cache split on long context by @pcastonguay in https://github.com/NVIDIA/TensorRT-LLM/pull/8247
- [https://nvbugs/5565530][fix] Unwaive test by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/8273
- [https://nvbugs/5470769][chore] unwaive test for PR7338 by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/8258
- [None][infra] cherry pick numexpr fix to release/1.1 by @yuanjingx87 in https://github.com/NVIDIA/TensorRT-LLM/pull/8333
- [https://nvbugs/5543770][fix] Update to Cutlass v4.2.1 by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8055
- [https://nvbugs/5465642][fix] Increase server timeout to wait weight loading by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/8297
- [https://nvbugs/5550671][fix] fix disagg-serving multinodes test failure by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/8307
- [https://nvbugs/5537878][fix] Reserve an extra slot for padded batch … by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8231
- [https://nvbugs/5574556][fix] fix bug of Qwen3_235B_A22B::test_fp8 CI by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/8351
- [https://nvbugs/5565541][fix] Add timeout threshold for H100 FHMA test by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/8354
- [https://nvbugs/5537348][fix] Use device tensor index for MTP by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8062
- [https://nvbugs/5565565] [fix] fp8 wideep support sm103 by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/8228
- [TRTLLM-8113][test] Add pytorch workflow e2e tests with pp enabled by @StanleySun639 in https://github.com/NVIDIA/TensorRT-LLM/pull/8357
- [None][infra] Waive failed tests in release post-merge 10/15 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8386
- [None][chore] Update nim test list by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/8356
- [https://nvbugs/5521949][fix] Update FP8 model with BF16 LoRA test, fix test_bielik_11b_v2_2_instruct_multi_lora by @amitz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8324
- [https://nvbugs/5510879][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 by @amitz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8313
- [https://nvbugs/5552889][fix] fix: Prevent empty batch when using attention DP with disagg by @pcastonguay in https://github.com/NVIDIA/TensorRT-LLM/pull/8372
- [https://nvbugs/5545522][fix] move PREEXIT in UB kernels to fix accuracy issue by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/8318
- [https://nvbugs/5534705][fix] Skip unnecessary CUDA graph capture (#8… by @ziyixiong-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8344
- [TRTLLM-8129][feat] Allreduce tuning and benchmark script revising by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/7870
- [None][test] cherry-pick: add test-model-suites in integration conftest.py by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/8388
- [None][bug] Set NCCL_GRAPH_REGISTER to false to avoid hang by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/8409
- [https://nvbugs/5437384][test] fix trtllm-llmapi-launch multi tests with single launch by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/8397
- [None][chore] Remove duplicate log outputs in test_perf.py by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/8418
- [https://nvbugs/5565565] [fix] Remove waiver by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/8450
- [https://nvbugs/5524714][fix] Fix TP sharding of fused-QKV weight scales in W4A16 AWQ by @danielafrimi in https://github.com/NVIDIA/TensorRT-LLM/pull/8432
- [TRTLLM-8580][test] save runtime report periodically (#8312) by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/8455
- [https://nvbugs/5516666][fix] cherry-pick PR 8130 to unwaive the Qwen3 CI by @byshiue in https://github.com/NVIDIA/TensorRT-LLM/pull/8444
- [https://nvbugs/5501820][fix] Add requirements for numba-cuda version to WAR mem corruption (#7992) by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8414
- [https://nvbugs/5569081][fix] Upgrade fmha_v2. (cherry-pick from https://github.com/NVIDIA/TensorRT-LLM/pull/8364) by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/8499
- [None][infra] Waive tests for release 1021 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8522
- [TRTLLM-8650][fix] beam search request validation by @ixlmar in https://github.com/NVIDIA/TensorRT-LLM/pull/8433
- [https://nvbugs/5569713][fix] Disable fp8 deep gemm for EXAONE-4.0-32B-FP8 by @JunyiXu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8429
- [https://nvbugs/5515753][ci] Add NCCL_DEBUG=INFO flag to collect more… by @SimengLiu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8440
- [https://nvbugs/5504095][fix] Unwaive test_user_specify_workspace case. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/8316
- [https://nvbugs/5546510][fix] Move torch.cuda.Stream out of torch com… by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8494
- [https://nvbugs/5565549][fix] unwaive test_disaggregated_spec_dec_bat… by @bo-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8500
- [None][infra] Waive failed tests for release 10/22 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8574
- [https://nvbugs/5575829][fix] Unwaive gpt-oss test by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/8576
- [https://nvbugs/5488576][fix] Propagate disable_finalize_fusion config flag in WIDEEP MoE backend (cherry-pick #8141) by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/8566
- [https://nvbugs/5569754][fix] trtllm llmapi launch port conflict by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/8582
- [https://nvbugs/5582277][fix] rework DisaggPPTerminationHandler to fix hang issue by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/8519
- [https://nvbugs/5568961][fix] Fix a merge conflict (cherrypick from PR 8365) by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/8553
- [https://nvbugs/5549081][fix] Fix device id assignment for some visio… by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/8552
- [TRTLLM-8785][fix] create output_dir before test begin (cherry-pick #8518) by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/8575
- [None][infra] Disable rtxpro6000 stages due to nodes will be offline temporarily by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8616
- [https://nvbugs/5575902][fix] set max_batch_size=1 to stabilize accuracy test result by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/8609
- [https://nvbugs/5576192][fix] Unwaive the test for test_weight_only_quant_gemm. by @zheyuf in https://github.com/NVIDIA/TensorRT-LLM/pull/8546
- [https://nvbugs/5608461][fix] exclude InductorSubproc from thread leak check by @leslie-fang25 in https://github.com/NVIDIA/TensorRT-LLM/pull/8624
- [https://nvbugs/5541145][fix] Remove DeepSeekR1 test case from H20 to prevent OOM by @jieli-matrix in https://github.com/NVIDIA/TensorRT-LLM/pull/8610
- [None][chore] Disable GB300 stages in release branch due to nodes will be offline temporarily by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/8645
- [https://nvbugs/5587456][fix] Remove multimodal test cases using TRT backend by @jieli-matrix in https://github.com/NVIDIA/TensorRT-LLM/pull/8611
- [None][test] Clean cache for certain easily hang cases by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/8619
- [None][infra] Waive failed tests for release 10/24 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8656
- [None][docs] Update Python wheel's short-/long-descriptions by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/8485
- [https://nvbugs/5597647][fix] Fix MNNVL Allreduce accuracy issue on Hopper by @timlee0212 in https://github.com/NVIDIA/TensorRT-LLM/pull/8612
- [https://nvbugs/5608489][fix] Fix output unpack issues for Llama3/4 NVFP4 models. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/8679
- [https://nvbugs/5572320][fix] Ported test_ad_trtllm_bench.py from main by @MrGeva in https://github.com/NVIDIA/TensorRT-LLM/pull/8671
- [https://nvbugs/5564465][fix] Overwrite only if default_max_tokens is legal by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/8538
- [https://nvbugs/5578175][fix] Fix block range index by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/8470
- [https://nvbugs/5601203] [fix]Restrict fp8 blockscale moe case by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/8583
- [None][fix] add readme copy to wheel stage to avoid setup.py failure (cherry-pick #8736) by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8754
- [https://nvbugs/5556020][fix] cherry-pick fix test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3 dimension mismatch by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/8644
- [https://nvbugs/5580099][fix] Separate cuda graph workspace to prevent IMA by @JunyiXu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8685
- [None][infra] Waive failed tests for release branch 10/29 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8760
- [https://nvbugs/5422621][fix] fix EPLB init hang (cherry-pick #8649) by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/8727
- [https://nvbugs/5569534][fix] Warm up with different sizes for more s… by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8515
- [https://nvbugs/5575841] [test] Move test_moe.py to serial tests to improve stability + unwaive FP4 MoE torch unit tests by @DomBrown in https://github.com/NVIDIA/TensorRT-LLM/pull/8422
- [TRTLLM-8971][infra] Cherry-pick for Update gpu key for B300/GB300 (#8724) by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8796
- [https://nvbugs/5488118][fix] Unwaive passed tests by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8758
- [https://nvbugs/5623960][fix] Compress the warning log of AutoTuner when encountering tactic failures. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/8795
- [None][infra] Skip failed tests for release branch by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8833
- [None][infra] Remove invaild waived tests which not in release branch by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8841
- [https://nvbugs/5606166][fix] AutoDeploy: use tuples for cudagraph shape lookup by @lucaslie in https://github.com/NVIDIA/TensorRT-LLM/pull/8772
- [https://nvbugs/5325296][fix] Enable relaxed acceptance test on Blackwell by @Barry-Delaney in https://github.com/NVIDIA/TensorRT-LLM/pull/8709
- [None][fix] WAR for tensorrt depending on the archived nvidia-cuda-runtime-cu13 package by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/8858
- [https://nvbugs/5474119][fix] Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/8809 by @dongfengy in https://github.com/NVIDIA/TensorRT-LLM/pull/8847
- [https://nvbugs/5444687][fix] Cherrypick online EPLB CI fix from main to release 1.1 by @dongxuy04 in https://github.com/NVIDIA/TensorRT-LLM/pull/8854
- [TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8621
- [https://nvbugs/5606266][fix] Unwaive some test by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/8867
- [https://nvbugs/5606268][fix] Fix program exit segment fault triggered CublasMMWarpper deconstructor by @yunruis in https://github.com/NVIDIA/TensorRT-LLM/pull/8834
- [https://nvbugs/5608930][fix] Unwaive test 5608930 by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/8831
- [https://nvbugs/5461796][fix] Unwaive test test_llmapi_speculative_decoding_mtp by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/8832
- [None][infra] Modify wheel path from cuda13/ to dlfw/ by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8868
- [None][infra] Waive failed tests for release branch on 11/03 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8879
- [https://nvbugs/5521253][fix] Enable Gemma3 12B & 27B on SM100 by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8666
- [None][chore] Update test list by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/8835
- [https://nvbugs/5596343] [test] Update accuracy baseline for GPT-OSS-20B by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/8842
- [https://nvbugs/5451272][fix] unwaive the test by @Shixiaowei02 in https://github.com/NVIDIA/TensorRT-LLM/pull/8608
- [https://nvbugs/5606266][test] move qwen3 multi-node test to the qa list by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/8908
- [https://nvbugs/5569754][chore] Adjust max batch size to prevent OOM by @JunyiXu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8876
- [https://nvbugs/5606136][fix] Fix torch.onnx.export with pytorch upgrade to fallback to dynamo=False. by @SimengLiu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8917
- [https://nvbugs/5601682][fix] unwaive test_disaggregated_deepseek_v3_… by @bo-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8888
- [https://nvbugs/5634220][fix] Add developer guide back and fix some i… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/8911
- [TRTLLM-8813][infra] Reduce GB200 multi-node test stages for release by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8860
- [https://nvbugs/5608930][fix] Wavie TestQwen3_8B::test_chunked_prefill for bug 5608930 by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/8940
- [https://nvbugs/5467531][fix] Fix moe test and wide ep fake impl by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8883
- [https://nvbugs/5630700][chore] Unwaive Qwen3_235B_A22B test by @shuyixiong in https://github.com/NVIDIA/TensorRT-LLM/pull/8901
- [https://nvbugs/5570599][fix] Set KVCache free_gpu_memory_fraction fo… by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/8780
- [https://nvbugs/5597647][fix] Fix MNNVL unit test failed due to accuracy issue on Hopper by @timlee0212 in https://github.com/NVIDIA/TensorRT-LLM/pull/8891
- [https://nvbugs/5642736][fix] fix AutoDeploy pattern matcher for torch 2.9 (#8920) by @lucaslie in https://github.com/NVIDIA/TensorRT-LLM/pull/8958
- [None][infra] Waive failed tests for release branch 11/06 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8966
- [https://nvbugs/5636946][fix] Update test model by @crazydemo in https://github.com/NVIDIA/TensorRT-LLM/pull/8993
- [TRTLLM-9213][infra] Fix boost issue by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/9005
- [None][doc] Replace the relative links with absolute links in README.md. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/8997
- [https://nvbugs/5575920][fix] Fix cublas/cublasLt handle creation memory not sufficient error by @dominicshanshan in https://github.com/NVIDIA/TensorRT-LLM/pull/8900
- [None][infra] Waive failed tests for release branch 11/07 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/9026
- [None][chore] Lock onnx version <1.20.0 and remove WAR for TRT 10.13 by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/9007
- [TRTLLM-9073][doc] Add the missing content for model support section and fix… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/9033
- [TRTLLM-9080][infra] upgrade tritonserver DLFW 25.10 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/8877
- [https://nvbugs/5608743][chore] unwaive test by @reasonsolo in https://github.com/NVIDIA/TensorRT-LLM/pull/8994
- [https://nvbugs/5284463][fix] fix ada fp8 group gemm lacks shared memory by @inocsin in https://github.com/NVIDIA/TensorRT-LLM/pull/9044
- [https://nvbugs/5570575][fix] : Use less kv cache memory on SM120 by @peaceh-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/9054
- [https://nvbugs/5628952][fix] avoid cudaFree overlap with cuda graph by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/8903
- [https://nvbugs/5628204][fix] Stop token IDs - fast path optimization for single stop token IDs only by @moraxu in https://github.com/NVIDIA/TensorRT-LLM/pull/9014
- [TRTLLM-7971][doc] Doc update for multimodal in v1.1 by @chang-l in https://github.com/NVIDIA/TensorRT-LLM/pull/9015
- [https://nvbugs/5652552][fix] Log the llm args by @leslie-fang25 in https://github.com/NVIDIA/TensorRT-LLM/pull/9119
- [https://nvbugs/5643814] [fix] Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/9127
- [https://nvbugs/5568836][fix] Skip keyword matching for Gemma3 e2e test by @brb-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/9158
- [TRTLLM-9159][doc] Add KV Connector docs by @Shunkangz in https://github.com/NVIDIA/TensorRT-LLM/pull/9043
- [https://nvbugs/5649826][fix] Unwaive test test_llm_commandr_plus_4gpus_summary by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/9201
- [None][fix] Bypass key-word matching for multimodal tests by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/9170
- [https://nvbugs/5582133][fix] unwaive nixl test by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/9244
- [https://nvbugs/5461796][fix] Unwaive and extend time for test_llmapi_speculative_decoding_mtp by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/9092
- [TRTLLM-9092][doc] Add a pre-quantized example in quick start guide by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/9223
- [https://nvbugs/5648685][fix] Fix openAI server waiting time to avoid large model weight loading out time by @dominicshanshan in https://github.com/NVIDIA/TensorRT-LLM/pull/9254
- [https://nvbugs/5670793][fix] Solve trtllm-serve launch_disaggregated… by @JunyiXu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/9324
- [https://nvbugs/5601682][fix] Fix cacheTransceiver hang by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/9311
- [https://nvbugs/5545522][fix] Correct Cutlass with PDL support by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/9335
- [TRTLLM-9199][docs] KV Connector Docs by @jthomson04 in https://github.com/NVIDIA/TensorRT-LLM/pull/9325
- [https://nvbugs/5676748][fix] Cherry-pick #9336: Fix mismatched nvfp4 gemm sf shape. by @hyukn in https://github.com/NVIDIA/TensorRT-LLM/pull/9437
- [TRTLLM-9160][doc] add doc to llm_runtime.py by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/9482
- [None][doc] VDR 1.0 trtllm-serve doc enhancement by @LinPoly in https://github.com/NVIDIA/TensorRT-LLM/pull/9443
- [TRTLLM-9086][doc] Clean up TODOs in documentation by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/9292
- [TRTLLM-9157][doc] Guided decoding doc improvement by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/9359
- [https://nvbugs/5687820][fix] Remove self.abort() in DetokenizedGenerationResult by @syuoni in https://github.com/NVIDIA/TensorRT-LLM/pull/9450
- [None][infra] Updated Linux installation guide by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/9485
- [None][infra] Waive failed tests for release branch on 11/30 by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/9553
- [TRTLLM-9075][doc] refine the slurm examples by @Superjomn in https://github.com/NVIDIA/TensorRT-LLM/pull/9548
- [TRTLLM-9093][doc] update hyper links in overview by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/9568
- [TRTLLM-9092][doc] link to modelopt checkpoints in quick start guide by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/9571
- [None][chore] cherry-pick: Design diagram review process change (#8748) by @yibinl-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/9596
- [TRTLLM-9090] [doc] Update online benchmarking docs by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/9611
- [None][infra] add attribution files for release/1.1 by @yuanjingx87 in https://github.com/NVIDIA/TensorRT-LLM/pull/9495
- [TRTLLM-9082][doc] Address Dynamo Example feedback by @Tabrizian in https://github.com/NVIDIA/TensorRT-LLM/pull/9619
- [https://nvbugs/5652552][fix] cherry-pick add printing for llm args by @ruodil in https://github.com/NVIDIA/TensorRT-LLM/pull/9206
- [TRTLLM-4629][doc] Add B300 & GB300 in documents by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/9663
- [TRTLLM-9124][infra] Modify the requirement of tensorrt from 10.13.0 to 10.13.3 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/9128
- [https://nvbugs/5503138] [fix] Remove compile warnings by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/9733
- [https://nvbugs/5537738][fix] Add fp8 post-quant allgather support to release 1.1 by @ChristinaZ in https://github.com/NVIDIA/TensorRT-LLM/pull/8322
- [IB-1920][doc] Update Perf_Overview.md with Benchmarking Results for Release 1.1 by @zbpatel in https://github.com/NVIDIA/TensorRT-LLM/pull/9723
- [None][doc] Update release notes by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/9739
- [TRTLLM-9811][infra] Update urllib3 version >= 2.6.0 to fix high vulnerability issue by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/9824
- [https://nvbugs/5729847][doc] fix broken links to modelopt by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/9868
- [None][doc] remove nano-vl-v2 model support in release notes by @QiJune in https://github.com/NVIDIA/TensorRT-LLM/pull/9887
- [None][chore] Upgrade starlette and FastAPI (#9319) by @chzblych in https://github.com/NVIDIA/TensorRT-LLM/pull/9904
New Contributors
- @chenopis made their first contribution in #6531
- @hcyezhang made their first contribution in #5785
- @qianbiaoxiang made their first contribution in #5521
- @yuhyao made their first contribution in #7072
- @gracehonv made their first contribution in #7020
- @nvamyt made their first contribution in #7200
- @Maurits-de-Groot made their first contribution in #7260
- @aalanwyr made their first contribution in #7284
- @AndyDai-nv made their first contribution in #7137
- @sorenwu made their first contribution in #7502
- @therealnaveenkamal made their first contribution in #7490
- @asrivas made their first contribution in https://github.com/NVIDIA/TensorRT-LLM/pull/6495
- @macrocell made their first contribution in https://github.com/NVIDIA/TensorRT-LLM/pull/7216
- @PeganovAnton made their first contribution in https://github.com/NVIDIA/TensorRT-LLM/pull/7935
- @inocsin made their first contribution in https://github.com/NVIDIA/TensorRT-LLM/pull/9044
Full Changelog: v1.0.0...v1.1.0