vllm 0.8.1 on Python PyPI

This release contains important bug fixes for v0.8.0. We highly recommend upgrading!

V1 Fixes
- Ensure using int64 for sampled token ids (#15065)
- Fix long dtype in topk sampling (#15049)
- Refactor Structured Output for multiple backends (#14694)
- Fix size calculation of processing cache (#15114)
- Optimize Rejection Sampler with Triton Kernels (#14930)
- Fix oracle for device checking (#15104)
TPU
- Fix chunked prefill with padding (#15037)
- Enhanced CI/CD (#15054, 14974)
Model
- Re-enable Gemma3 for V1 (#14980)
- Embedding model support LoRA (#14935)
- Pixtral: Remove layer instantiation duplication (#15053)

What's Changed

[Bugfix] Fix interface for Olmo2 on V1 by @ywang96 in #14976
[CI/Build] Use AutoModelForImageTextToText to load image models in tests by @DarkLight1337 in #14945
[V1] Guard Against Main Thread Usage by @robertgshaw2-redhat in #14972
[V1] TPU - Fix CI/CD runner for V1 and remove V0 tests by @alexm-redhat in #14974
[Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights by @tristanleclercq in #14950
[Neuron] trim attention kernel tests to fit trn1.2x instance by @liangfu in #14988
[Doc][V1] Fix V1 APC doc by @shen-shanshan in #14920
[Kernels] LoRA - Retire SGMV and BGMV Kernels by @varun-sundar-rabindranath in #14685
[Mistral-Small 3.1] Update docs and tests by @patrickvonplaten in #14977
[Misc] Embedding model support LoRA by @jeejeelee in #14935
[Bugfix] torchrun compatibility by @hiyouga in #14899
[Bugfix][Frontend] Fix validation of logprobs in ChatCompletionRequest by @schoennenbeck in #14352
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros by @yangsijia-serena in #14347
[Bugfix] Loosen type check to avoid errors in V1 by @DarkLight1337 in #15021
[Bugfix] Register serializers for V0 MQ Engine by @simon-mo in #15009
[TPU][V1][Bugfix] Fix chunked prefill with padding by @NickLucche in #15037
MI325 configs, fused_moe_kernel bugfix by @ekuznetsov139 in #14987
[MODEL] Add support for Zamba2 models by @yury-tokpanov in #13185
[Bugfix] Fix broken CPU quantization due to triton import by @Isotr0py in #15038
[Bugfix] Fix LoRA extra vocab size by @jeejeelee in #15047
[V1] Refactor Structured Output for multiple backends by @russellb in #14694
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels by @WoosukKwon in #14930
[V1] TPU - CI/CD use smaller model by @alexm-redhat in #15054
fix long dtype in topk sampling by @chujiezheng in #15049
[Doc] Minor v1_user_guide update by @JenZhao in #15064
[Misc][V1] Skip device checking if not available by @comaniac in #15061
[Model] Pixtral: Remove layer instantiation duplication by @juliendenize in #15053
[Model] Remove duplicated message check in Mistral chat completion request by @b8zhong in #15069
[Core] Update dtype detection and defaults by @DarkLight1337 in #14858
[V1] Ensure using int64 for sampled token ids by @WoosukKwon in #15065
[Bugfix] Re-enable Gemma3 for V1 by @DarkLight1337 in #14980
[CI][Intel GPU] update XPU dockerfile and CI script by @jikunshang in #15109
[V1][Bugfix] Fix oracle for device checking by @ywang96 in #15104
[Misc] Avoid unnecessary HF do_rescale warning when passing dummy data by @DarkLight1337 in #15107
[Bugfix] Fix size calculation of processing cache by @DarkLight1337 in #15114
[Doc] Update tip info on using latest transformers when creating a custom Dockerfile by @MarcCote in #15070
[Misc][Benchmark] Add support for different tokenizer_mode by @aarnphm in #15040
[Bugfix] Adjust mllama to regional compilation by @jkaniecki in #15112
[Doc] Update the "the first vLLM China Meetup" slides link to point to the first page by @imkero in #15134
[Frontend] Remove custom_cache_manager by @fulvius31 in #13791
[V1] Minor V1 async engine test refactor by @andoorve in #15075

New Contributors

@tristanleclercq made their first contribution in #14950
@hiyouga made their first contribution in #14899
@ekuznetsov139 made their first contribution in #14987
@yury-tokpanov made their first contribution in #13185
@juliendenize made their first contribution in #15053
@MarcCote made their first contribution in #15070
@jkaniecki made their first contribution in #15112
@fulvius31 made their first contribution in #13791

Full Changelog: v0.8.0...v0.8.1

vllm 0.8.1 v0.8.1 on Python PyPI

What's Changed

New Contributors

vllm 0.8.1
v0.8.1

on Python PyPI