vllm-project/vllm v0.9.2rc2 on GitHub

What's Changed

[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in #20400
[Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
Change warn_for_unimplemented_methods to debug by @mgoin in #20455
[Platform] Add custom default max tokens by @gmarinho2 in #18557
Add ignore consolidated file in mistral example code by @princepride in #20420
[Misc] small update by @reidliu41 in #20462
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
[Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
Support Llama 4 for fused_marlin_moe by @mgoin in #20457
[Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
[Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
[Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
[CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
[Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
[feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
[CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
[Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
[Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
[doc] small fix by @reidliu41 in #20506
[Misc] Remove the unused LoRA test code by @jeejeelee in #20494
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
[v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
[Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
[Misc] remove unused import by @reidliu41 in #20517
test_attention compat with coming xformers change by @bottler in #20487
[BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
[BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
[TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
[Frontend] Support image object in llm.chat by @sfeng33 in #19635
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516
[Misc] call the pre-defined func by @reidliu41 in #20518
[V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
[V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
[BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in #20527

New Contributors

@sangbumlikeagod made their first contribution in #18809
@djmmoss made their first contribution in #19757
@GuyStone made their first contribution in #20497
@bottler made their first contribution in #20487

Full Changelog: v0.9.2rc1...v0.9.2rc2