vllm-project/vllm v0.6.6 on GitHub

Highlights

Support Deepseek V3 (#11523, #11502) model.
- On 8xH200s or MI300x: vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192. The context lenght can be increased to about 32K beyond running into memory issue.
- For other devices, follow our distributed inference guide to enable tensor parallel and/or pipeline parallel inference
- We are just getting started for enhancing the support and unlock more performance. See #11539 for planned work.
Last mile stretch for V1 engine refactoring: API Server (#11529, #11530), penalties for sampler (#10681), prefix caching for vision language models (#11187, #11305), TP Ray executor (#11107,#11472)
Breaking change: X-Request-ID echoing is now opt-in instead of on by default for performance reason. Set --enable-request-id-headers to enable it.

Model Support

IBM Granite 3.1 (#11307), JambaForSequenceClassification model (#10860)
Add QVQ and QwQ to the list of supported models (#11509)

Performance

Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)

Production Engine

Support streaming model from S3 using RunAI Model Streamer as optional loader (#10192)
Online Pooling API (#11457)
Load video from base64 (#11492)

Others

Add pypi index for every commit and nightly build (#11404)

What's Changed

[Bugfix] Set temperature=0.7 in test_guided_choice_chat by @mgoin in #11264
[V1] Prefix caching for vision language models by @comaniac in #11187
[Bugfix] Restore support for larger block sizes by @kzawora-intel in #11259
[Bugfix] Fix guided decoding with tokenizer mode mistral by @wallashss in #11046
[MISC][XPU]update ipex link for CI fix by @yma11 in #11278
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support by @dsikka in #10995
[Bugfix] Fix broken phi3-v mm_processor_kwargs tests by @Isotr0py in #11263
[CI][Misc] Remove Github Action Release Workflow by @simon-mo in #11274
[FIX] update openai version by @jikunshang in #11287
[Bugfix] fix minicpmv test by @joerunde in #11304
[V1] VLM - enable processor cache by default by @alexm-neuralmagic in #11305
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) by @tlrmchlsmth in https://github.com//pull/11311
[Model] IBM Granite 3.1 by @tjohnson31415 in #11307
[CI] Expand test_guided_generate to test all backends by @mgoin in #11313
[V1] Simplify prefix caching logic by removing num_evictable_computed_blocks by @heheda12345 in #11310
[VLM] Merged multimodal processor for Qwen2-Audio by @DarkLight1337 in #11303
[Kernel] Refactor Cutlass c3x by @varun-sundar-rabindranath in #10049
[Misc] Optimize ray worker initialization time by @ruisearch42 in #11275
[misc] benchmark_throughput : Add LoRA by @varun-sundar-rabindranath in #11267
[Feature] Add load generation config from model by @liuyanyi in #11164
[Bugfix] Cleanup Pixtral HF code by @DarkLight1337 in #11333
[Model] Add JambaForSequenceClassification model by @yecohn in #10860
[V1] Fix multimodal profiling for Molmo by @ywang96 in #11325
[Model] Refactor Qwen2-VL to use merged multimodal processor by @Isotr0py in #11258
[Misc] Clean up and consolidate LRUCache by @DarkLight1337 in #11339
[Bugfix] Fix broken CPU compressed-tensors test by @Isotr0py in #11338
[Misc] Remove unused vllm/block.py by @Ghjk94522 in #11336
[CI] Adding CPU docker pipeline by @zhouyuan in #11261
[Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 by @Akashcodes732 in #11331
[ci][gh200] dockerfile clean up by @youkaichao in #11351
[Misc] Add tqdm progress bar during graph capture by @mgoin in #11349
[Bugfix] Fix spec decoding when seed is none in a batch by @wallashss in #10863
[misc] add early error message for custom ops by @youkaichao in #11355
[doc] backward compatibility for 0.6.4 by @youkaichao in #11359
[V1] Fix profiling for models with merged input processor by @ywang96 in #11370
[CI/Build] fix pre-compiled wheel install for exact tag by @dtrifiro in #11373
[Core] Loading model from S3 using RunAI Model Streamer as optional loader by @omer-dayan in #10192
[Bugfix] Don't log OpenAI field aliases as ignored by @mgoin in #11378
[doc] explain nccl requirements for rlhf by @youkaichao in #11381
Add ray[default] to wget to run distributed inference out of box by @Jeffwan in #11265
[V1][Bugfix] Skip hashing empty or None mm_data by @WoosukKwon in #11386
[Bugfix] update should_ignore_layer by @horheynm in #11354
[V1] Make AsyncLLMEngine v1-v0 opaque by @rickyyx in #11383
[Bugfix] Fix issues for Pixtral-Large-Instruct-2411 by @ywang96 in #11393
[CI] Fix flaky entrypoint tests by @ywang96 in #11403
[cd][release] add pypi index for every commit and nightly build by @youkaichao in #11404
[cd][release] fix race conditions by @youkaichao in #11407
[Bugfix] Fix fully sharded LoRAs with Mixtral by @n1hility in #11390
[CI] Unboock H100 Benchmark by @simon-mo in #11419
[misc][perf] remove old code by @youkaichao in #11425
mypy type checking for vllm/worker by @lucas-tucker in #11418
[Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF by @mgoin in #11389
[Bugfix] torch nightly version in ROCm installation guide by @terrytangyuan in #11423
[Misc] Add assertion and helpful message for marlin24 compressed models by @dsikka in #11388
[Misc] add w8a8 asym models by @dsikka in #11075
[CI] Expand OpenAI test_chat.py guided decoding tests by @mgoin in #11048
[Bugfix] Add kv cache scales to gemma2.py by @mgoin in #11269
[Doc] Fix typo in the help message of '--guided-decoding-backend' by @yansh97 in #11440
[Docs] Convert rST to MyST (Markdown) by @rafvasq in #11145
[V1] TP Ray executor by @ruisearch42 in #11107
[Misc]Suppress irrelevant exception stack trace information when CUDA… by @shiquan1988 in #11438
[Frontend] Online Pooling API by @DarkLight1337 in #11457
[Bugfix] Fix Qwen2-VL LoRA weight loading by @jeejeelee in #11430
[Bugfix][Hardware][CPU] Fix CPU input_positions creation for text-only inputs with mrope by @Isotr0py in #11434
[OpenVINO] Fixed installation conflicts by @ilya-lavrenov in #11458
[attn][tiny fix] fix attn backend in MultiHeadAttention by @MengqingCao in #11463
[Misc] Move weights mapper by @jeejeelee in #11443
[Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 by @terrytangyuan in #11435
[Model] Automatic conversion of classification and reward models by @DarkLight1337 in #11469
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor by @ruisearch42 in #11472
[Misc] Update disaggregation benchmark scripts and test logs by @Jeffwan in #11456
[Frontend] Enable decord to load video from base64 by @DarkLight1337 in #11492
[Doc] Improve GitHub links by @DarkLight1337 in #11491
[Misc] Move some multimodal utils to modality-specific modules by @DarkLight1337 in #11494
Mypy checking for vllm/compilation by @lucas-tucker in #11496
[Misc][LoRA] Fix LoRA weight mapper by @jeejeelee in #11495
[Doc] Add QVQ and QwQ to the list of supported models by @ywang96 in #11509
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler by @sroy745 in #10681
[Model] Modify MolmoForCausalLM MLP by @jeejeelee in #11510
[Misc] Add placeholder module by @DarkLight1337 in #11501
[Doc] Add video example to openai client for multimodal by @Isotr0py in #11521
[V1] [1/N] [Breaking Change] API Server (Remove Proxy) by @robertgshaw2-neuralmagic in #11529
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization by @mgoin in #11523
[2/N] API Server: Avoid ulimit footgun by @robertgshaw2-neuralmagic in #11530
Deepseek v3 by @simon-mo in #11502

New Contributors

@Ghjk94522 made their first contribution in #11336
@Akashcodes732 made their first contribution in #11331
@omer-dayan made their first contribution in #10192
@horheynm made their first contribution in #11354
@n1hility made their first contribution in #11390
@lucas-tucker made their first contribution in #11418
@shiquan1988 made their first contribution in #11438

Full Changelog: v0.6.5...v0.6.6