This release contains important bug fixes for v0.8.0. We highly recommend upgrading!
-
V1 Fixes
-
TPU
-
Model
What's Changed
- [Bugfix] Fix interface for Olmo2 on V1 by @ywang96 in #14976
- [CI/Build] Use
AutoModelForImageTextToText
to load image models in tests by @DarkLight1337 in #14945 - [V1] Guard Against Main Thread Usage by @robertgshaw2-redhat in #14972
- [V1] TPU - Fix CI/CD runner for V1 and remove V0 tests by @alexm-redhat in #14974
- [Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights by @tristanleclercq in #14950
- [Neuron] trim attention kernel tests to fit trn1.2x instance by @liangfu in #14988
- [Doc][V1] Fix V1 APC doc by @shen-shanshan in #14920
- [Kernels] LoRA - Retire SGMV and BGMV Kernels by @varun-sundar-rabindranath in #14685
- [Mistral-Small 3.1] Update docs and tests by @patrickvonplaten in #14977
- [Misc] Embedding model support LoRA by @jeejeelee in #14935
- [Bugfix] torchrun compatibility by @hiyouga in #14899
- [Bugfix][Frontend] Fix validation of
logprobs
inChatCompletionRequest
by @schoennenbeck in #14352 - [Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros by @yangsijia-serena in #14347
- [Bugfix] Loosen type check to avoid errors in V1 by @DarkLight1337 in #15021
- [Bugfix] Register serializers for V0 MQ Engine by @simon-mo in #15009
- [TPU][V1][Bugfix] Fix chunked prefill with padding by @NickLucche in #15037
- MI325 configs, fused_moe_kernel bugfix by @ekuznetsov139 in #14987
- [MODEL] Add support for Zamba2 models by @yury-tokpanov in #13185
- [Bugfix] Fix broken CPU quantization due to triton import by @Isotr0py in #15038
- [Bugfix] Fix LoRA extra vocab size by @jeejeelee in #15047
- [V1] Refactor Structured Output for multiple backends by @russellb in #14694
- [V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels by @WoosukKwon in #14930
- [V1] TPU - CI/CD use smaller model by @alexm-redhat in #15054
- fix long dtype in topk sampling by @chujiezheng in #15049
- [Doc] Minor v1_user_guide update by @JenZhao in #15064
- [Misc][V1] Skip device checking if not available by @comaniac in #15061
- [Model] Pixtral: Remove layer instantiation duplication by @juliendenize in #15053
- [Model] Remove duplicated message check in Mistral chat completion request by @b8zhong in #15069
- [Core] Update dtype detection and defaults by @DarkLight1337 in #14858
- [V1] Ensure using int64 for sampled token ids by @WoosukKwon in #15065
- [Bugfix] Re-enable Gemma3 for V1 by @DarkLight1337 in #14980
- [CI][Intel GPU] update XPU dockerfile and CI script by @jikunshang in #15109
- [V1][Bugfix] Fix oracle for device checking by @ywang96 in #15104
- [Misc] Avoid unnecessary HF
do_rescale
warning when passing dummy data by @DarkLight1337 in #15107 - [Bugfix] Fix size calculation of processing cache by @DarkLight1337 in #15114
- [Doc] Update tip info on using latest transformers when creating a custom Dockerfile by @MarcCote in #15070
- [Misc][Benchmark] Add support for different
tokenizer_mode
by @aarnphm in #15040 - [Bugfix] Adjust mllama to regional compilation by @jkaniecki in #15112
- [Doc] Update the "the first vLLM China Meetup" slides link to point to the first page by @imkero in #15134
- [Frontend] Remove custom_cache_manager by @fulvius31 in #13791
- [V1] Minor V1 async engine test refactor by @andoorve in #15075
New Contributors
- @tristanleclercq made their first contribution in #14950
- @hiyouga made their first contribution in #14899
- @ekuznetsov139 made their first contribution in #14987
- @yury-tokpanov made their first contribution in #13185
- @juliendenize made their first contribution in #15053
- @MarcCote made their first contribution in #15070
- @jkaniecki made their first contribution in #15112
- @fulvius31 made their first contribution in #13791
Full Changelog: v0.8.0...v0.8.1