Announcement Highlights
-
Model Support
-
API
- Support out-of-tree models in
trtllm-serve(#9269)
- Support out-of-tree models in
-
Feature
-
Fix
- Use fp32 for indexer
weight_projGEMM (#9243) - Fix multimodal
InputProcessordummy builder (#8916) - Set correct
lm_head_tp_size_upper_bound(#9300) - Move
torch.cuda.Streamout of criticaltorchcomputation region (#8494) - Fix
trtllm-llmapi-launchport conflict (#8582) - Rework
DisaggPPTerminationHandlerto fix hang issue (#8519) - Overwrite only if
default_max_tokensis legal (#8538) - Fix block range index (#8470)
- Restrict FP8 blockscale MoE case to valid configurations (#8583)
- Fix
L0_backend_trtllmbehavior (#9282) - Improve beam search request validation (#9228)
- Avoid incorrectly filling tensors with 0 (#9296)
- Fallback to greedy sampling in two-model overlap scheduler to improve stability (#9321)
- Use fp32 for indexer
-
Documentation
-
Benchmark
- Set
max_batch_size=1to stabilize accuracy test results (#8609)
- Set
-
Test & Infra
- Use greedy decoding in
test_openai_compatible_json_schema(#9305) - Enable checking duplicate items in
waives.txtin pre-commit (#9265) - Fix test case where chunked attention is not supported on
sm_120(#9260) - Add
NCCL_DEBUG=INFOflag to collect more information on CI failures (#8440) - Remove multimodal test cases using TRT backend (#8611)
- Clean cache for easily hanging test cases (#8619)
- Enable relaxed acceptance test on Blackwell (#8709)
- Update linter rules for mass integration (#8918)
- Upgrade
starletteandFastAPIdependencies (#9319) - Update
goggles_actionrepository (#9240) - Move third-party components to their own list file (#8986)
- Add fallback when fetching wheel from build stage fails (#9290)
- Add
--waives-fileflag in rerunpytestcommand (#8971) - Add periodic JUnit XML path in
conftest(#9337) - Consume
SlurmClustersshPortfor clusters with custom SSH port (#9313) - Add one-model and overlap-scheduling to Eagle tests for GPTOSS (#9312)
- Use greedy decoding in
What's Changed
- [#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models by @nvchenghaoz in #9317
- [#9096][feature] Auto Deploy: configurable fused MoE backend by @nzmora-nvidia in #9194
- [None][fix] Use fp32 for indexer weight_proj GEMM by @chang-l in #9243
- [None][fix] Multimodal InputProcessor dummy builder fix by @yechank-nvidia in #8916
- [None][ci] waive test_disagg_server_restart by @QiJune in #9326
- [None][chore] Revise the description of enable_autotuner. by @hyukn in #9320
- [TRTLLM-9295][fix] use greedy decoding in test_openai_compatible_json_schema by @ixlmar in #9305
- [TRTLLM-9164][infra] Enable checking duplicate items in waives.txt in pre-commit by @EmmaQiaoCh in #9265
- [#9236][feature] Make sharing of activation_type across SW layers more robust by @nzmora-nvidia in #9238
- [https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound by @lancelly in #9300
- [https://nvbugs/5667454][test] Fix Test Case as Chunked Attention not Supported on sm_120 by @yufeiwu-nv in #9260
- [None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #8918
- [None][chore] Upgrade starlette and FastAPI by @tburt-nv in #9319
- [None][infra] Update goggles_action repository by @karljang in #9240
- [TRTLLM-9197][infra] Move thirdparty stuff to it's own listfile by @cheshirekow in #8986
- [TRI-332] [fix] Fix L0_backend_trtllm by @yinggeh in #9282
- [None][ci] waive test_llm_context_only_timed_out_kv_cache_exhausted by @QiJune in #9351
- [None][infra] Add fallback when get wheel from build stage is fail by @ZhanruiSunCh in #9290
- [TRTLLM-9183][infra] Add --waives-file in rerun pytest command by @yiqingy0 in #8971
- [TRTLLM-8957][feat] create communication related classes by @xxi-nv in #8968
- [None][chore] Add periodic junit xml path in conftest by @crazydemo in #9337
- [None][ci] waive a test case of test_ad_build_small_multi.py by @QiJune in #9355
- [None][infra] Waive failed cases in main post-merge on 11/21 by @EmmaQiaoCh in #9360
- [None][chore] Bump version to 1.2.0rc4 by @yiqingy0 in #9363
- [TRTLLM-8650][fix] beam search request validation (#8433) by @ixlmar in #9228
- [TRTLLM-9191][feat] support out-of-tree models in trtllm-serve by @ixlmar in #9269
- [https://nvbugs/5629833][fix] Don't fill tensors by @HuiGao-NV in #9296
- [None][feat] TRT-LLM Gen MoE optimize DeepSeek Fp8 activation kernel by @nekorobov in #9175
- [https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler by @ziyixiong-nv in #9321
- [TRTLLM-9208][infra] Document the process for C++ deps by @cheshirekow in #9016
- [TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) by @syuoni in #9288
- [None][feat] Eagle: PostNorm and multilayer options by @IzzyPutterman in #9233
- [TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT by @nvchenghaoz in #9106
- [#9388][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation by @nzmora-nvidia in #9339
- [TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port by @mlefeb01 in #9313
- [None][test] Add one-model and overlap-scheduling to eagle tests for GPTOSS by @dongfengy in #9312
Full Changelog: v1.2.0rc3...v1.2.0rc4