Highlights
-
Support Deepseek V3 (#11523, #11502) model.
- On 8xH200s or MI300x:
vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192
. The context lenght can be increased to about 32K beyond running into memory issue. - For other devices, follow our distributed inference guide to enable tensor parallel and/or pipeline parallel inference
- We are just getting started for enhancing the support and unlock more performance. See #11539 for planned work.
- On 8xH200s or MI300x:
-
Last mile stretch for V1 engine refactoring: API Server (#11529, #11530), penalties for sampler (#10681), prefix caching for vision language models (#11187, #11305), TP Ray executor (#11107,#11472)
-
Breaking change:
X-Request-ID
echoing is now opt-in instead of on by default for performance reason. Set--enable-request-id-headers
to enable it.
Model Support
- IBM Granite 3.1 (#11307), JambaForSequenceClassification model (#10860)
- Add
QVQ
andQwQ
to the list of supported models (#11509)
Performance
- Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Production Engine
- Support streaming model from S3 using RunAI Model Streamer as optional loader (#10192)
- Online Pooling API (#11457)
- Load video from base64 (#11492)
Others
- Add pypi index for every commit and nightly build (#11404)
What's Changed
- [Bugfix] Set temperature=0.7 in test_guided_choice_chat by @mgoin in #11264
- [V1] Prefix caching for vision language models by @comaniac in #11187
- [Bugfix] Restore support for larger block sizes by @kzawora-intel in #11259
- [Bugfix] Fix guided decoding with tokenizer mode mistral by @wallashss in #11046
- [MISC][XPU]update ipex link for CI fix by @yma11 in #11278
- [Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support by @dsikka in #10995
- [Bugfix] Fix broken phi3-v mm_processor_kwargs tests by @Isotr0py in #11263
- [CI][Misc] Remove Github Action Release Workflow by @simon-mo in #11274
- [FIX] update openai version by @jikunshang in #11287
- [Bugfix] fix minicpmv test by @joerunde in #11304
- [V1] VLM - enable processor cache by default by @alexm-neuralmagic in #11305
- [Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) by @tlrmchlsmth in https://github.com//pull/11311
- [Model] IBM Granite 3.1 by @tjohnson31415 in #11307
- [CI] Expand test_guided_generate to test all backends by @mgoin in #11313
- [V1] Simplify prefix caching logic by removing
num_evictable_computed_blocks
by @heheda12345 in #11310 - [VLM] Merged multimodal processor for Qwen2-Audio by @DarkLight1337 in #11303
- [Kernel] Refactor Cutlass c3x by @varun-sundar-rabindranath in #10049
- [Misc] Optimize ray worker initialization time by @ruisearch42 in #11275
- [misc] benchmark_throughput : Add LoRA by @varun-sundar-rabindranath in #11267
- [Feature] Add load generation config from model by @liuyanyi in #11164
- [Bugfix] Cleanup Pixtral HF code by @DarkLight1337 in #11333
- [Model] Add JambaForSequenceClassification model by @yecohn in #10860
- [V1] Fix multimodal profiling for
Molmo
by @ywang96 in #11325 - [Model] Refactor Qwen2-VL to use merged multimodal processor by @Isotr0py in #11258
- [Misc] Clean up and consolidate LRUCache by @DarkLight1337 in #11339
- [Bugfix] Fix broken CPU compressed-tensors test by @Isotr0py in #11338
- [Misc] Remove unused vllm/block.py by @Ghjk94522 in #11336
- [CI] Adding CPU docker pipeline by @zhouyuan in #11261
- [Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 by @Akashcodes732 in #11331
- [ci][gh200] dockerfile clean up by @youkaichao in #11351
- [Misc] Add tqdm progress bar during graph capture by @mgoin in #11349
- [Bugfix] Fix spec decoding when seed is none in a batch by @wallashss in #10863
- [misc] add early error message for custom ops by @youkaichao in #11355
- [doc] backward compatibility for 0.6.4 by @youkaichao in #11359
- [V1] Fix profiling for models with merged input processor by @ywang96 in #11370
- [CI/Build] fix pre-compiled wheel install for exact tag by @dtrifiro in #11373
- [Core] Loading model from S3 using RunAI Model Streamer as optional loader by @omer-dayan in #10192
- [Bugfix] Don't log OpenAI field aliases as ignored by @mgoin in #11378
- [doc] explain nccl requirements for rlhf by @youkaichao in #11381
- Add ray[default] to wget to run distributed inference out of box by @Jeffwan in #11265
- [V1][Bugfix] Skip hashing empty or None mm_data by @WoosukKwon in #11386
- [Bugfix] update should_ignore_layer by @horheynm in #11354
- [V1] Make AsyncLLMEngine v1-v0 opaque by @rickyyx in #11383
- [Bugfix] Fix issues for
Pixtral-Large-Instruct-2411
by @ywang96 in #11393 - [CI] Fix flaky entrypoint tests by @ywang96 in #11403
- [cd][release] add pypi index for every commit and nightly build by @youkaichao in #11404
- [cd][release] fix race conditions by @youkaichao in #11407
- [Bugfix] Fix fully sharded LoRAs with Mixtral by @n1hility in #11390
- [CI] Unboock H100 Benchmark by @simon-mo in #11419
- [misc][perf] remove old code by @youkaichao in #11425
- mypy type checking for vllm/worker by @lucas-tucker in #11418
- [Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF by @mgoin in #11389
- [Bugfix] torch nightly version in ROCm installation guide by @terrytangyuan in #11423
- [Misc] Add assertion and helpful message for marlin24 compressed models by @dsikka in #11388
- [Misc] add w8a8 asym models by @dsikka in #11075
- [CI] Expand OpenAI test_chat.py guided decoding tests by @mgoin in #11048
- [Bugfix] Add kv cache scales to gemma2.py by @mgoin in #11269
- [Doc] Fix typo in the help message of '--guided-decoding-backend' by @yansh97 in #11440
- [Docs] Convert rST to MyST (Markdown) by @rafvasq in #11145
- [V1] TP Ray executor by @ruisearch42 in #11107
- [Misc]Suppress irrelevant exception stack trace information when CUDA… by @shiquan1988 in #11438
- [Frontend] Online Pooling API by @DarkLight1337 in #11457
- [Bugfix] Fix Qwen2-VL LoRA weight loading by @jeejeelee in #11430
- [Bugfix][Hardware][CPU] Fix CPU
input_positions
creation for text-only inputs with mrope by @Isotr0py in #11434 - [OpenVINO] Fixed installation conflicts by @ilya-lavrenov in #11458
- [attn][tiny fix] fix attn backend in MultiHeadAttention by @MengqingCao in #11463
- [Misc] Move weights mapper by @jeejeelee in #11443
- [Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 by @terrytangyuan in #11435
- [Model] Automatic conversion of classification and reward models by @DarkLight1337 in #11469
- [V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor by @ruisearch42 in #11472
- [Misc] Update disaggregation benchmark scripts and test logs by @Jeffwan in #11456
- [Frontend] Enable decord to load video from base64 by @DarkLight1337 in #11492
- [Doc] Improve GitHub links by @DarkLight1337 in #11491
- [Misc] Move some multimodal utils to modality-specific modules by @DarkLight1337 in #11494
- Mypy checking for vllm/compilation by @lucas-tucker in #11496
- [Misc][LoRA] Fix LoRA weight mapper by @jeejeelee in #11495
- [Doc] Add
QVQ
andQwQ
to the list of supported models by @ywang96 in #11509 - [V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler by @sroy745 in #10681
- [Model] Modify MolmoForCausalLM MLP by @jeejeelee in #11510
- [Misc] Add placeholder module by @DarkLight1337 in #11501
- [Doc] Add video example to openai client for multimodal by @Isotr0py in #11521
- [V1] [1/N] [Breaking Change] API Server (Remove Proxy) by @robertgshaw2-neuralmagic in #11529
- [Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization by @mgoin in #11523
- [2/N] API Server: Avoid ulimit footgun by @robertgshaw2-neuralmagic in #11530
- Deepseek v3 by @simon-mo in #11502
New Contributors
- @Ghjk94522 made their first contribution in #11336
- @Akashcodes732 made their first contribution in #11331
- @omer-dayan made their first contribution in #10192
- @horheynm made their first contribution in #11354
- @n1hility made their first contribution in #11390
- @lucas-tucker made their first contribution in #11418
- @shiquan1988 made their first contribution in #11438
Full Changelog: v0.6.5...v0.6.6