vllm-project/vllm v0.6.1 on GitHub

Highlights

Model Support

Added support for Pixtral (mistralai/Pixtral-12B-2409). (#8377, #8168)
Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
Multi-input support for LLaVA (#8238), InternVL2 models (#8201)

Performance Enhancements

Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)

Production Engine

Support load and unload LoRA in api server (#6566)
Add progress reporting to batch runner (#8060)
Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)

Others

Update the docker image to use Python 3.12 for small performance bump. (#8133)
Added CODE_OF_CONDUCT.md (#8161)

What's Changed

[Doc] [Misc] Create CODE_OF_CONDUCT.md by @mmcelaney in #8161
[bugfix] Upgrade minimum OpenAI version by @SolitaryThinker in #8169
[Misc] Clean up RoPE forward_native by @WoosukKwon in #8076
[ci] Mark LoRA test as soft-fail by @khluu in #8160
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. by @elfiegg in #8173
[Doc] Add multi-image input example and update supported models by @DarkLight1337 in #8181
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by @Manikandan-Thangaraj-ZS0321 in #7860
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by @alex-jw-brooks in #8029
Move verify_marlin_supported to GPTQMarlinLinearMethod by @mgoin in #8165
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by @sroy745 in #7962
[Core] Support load and unload LoRA in api server by @Jeffwan in #6566
[BugFix] Fix Granite model configuration by @njhill in #8216
[Frontend] Add --logprobs argument to benchmark_serving.py by @afeldman-nm in #8191
[Misc] Use ray[adag] dependency instead of cuda by @ruisearch42 in #7938
[CI/Build] Increasing timeout for multiproc worker tests by @alexeykondrat in #8203
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by @rasmith in #8248
[Misc] Remove SqueezeLLM by @dsikka in #8220
[Model] Allow loading from original Mistral format by @patrickvonplaten in #8168
[misc] [doc] [frontend] LLM torch profiler support by @SolitaryThinker in #7943
[Bugfix] Fix Hermes tool call chat template bug by @K-Mistele in #8256
[Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by @DarkLight1337 in #8238
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by @wschin in #8241
[tpu][misc] fix typo by @youkaichao in #8260
[Bugfix] Fix broken OpenAI tensorizer test by @DarkLight1337 in #8258
[Model][VLM] Support multi-images inputs for InternVL2 models by @Isotr0py in #8201
[Model][VLM] Decouple weight loading logic for Paligemma by @Isotr0py in #8269
ppc64le: Dockerfile fixed, and a script for buildkite by @sumitd2 in #8026
[CI/Build] Use python 3.12 in cuda image by @joerunde in #8133
[Bugfix] Fix async postprocessor in case of preemption by @alexm-neuralmagic in #8267
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by @K-Mistele in #8272
[Frontend] Add progress reporting to run_batch.py by @alugowski in #8060
[Bugfix] Correct adapter usage for cohere and jamba by @vladislavkruglikov in #8292
[Misc] GPTQ Activation Ordering by @kylesayrs in #8135
[Misc] Fused MoE Marlin support for GPTQ by @dsikka in #8217
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by @simon-mo in #8319
[Bugfix] Fix missing post_layernorm in CLIP by @DarkLight1337 in #8155
[CI/Build] enable ccache/scccache for HIP builds by @dtrifiro in #8327
[Frontend] Clean up type annotations for mistral tokenizer by @DarkLight1337 in #8314
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by @alexeykondrat in #8130
Fix ppc64le buildkite job by @sumitd2 in #8309
[Spec Decode] Move ops.advance_step to flash attn advance_step by @kevin314 in #8224
[Misc] remove peft as dependency for prompt models by @prashantgupta24 in #8162
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by @comaniac in #8342
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by @alexm-neuralmagic in #8340
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by @SolitaryThinker in #8172
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by @tlrmchlsmth in #8043
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by @jeejeelee in #8329
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by @Isotr0py in #8299
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. by @pavanimajety in #6112
[model] Support for Llava-Next-Video model by @TKONIY in #7559
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch by @pooyadavoodi in #8347
[Model][VLM] Add Qwen2-VL model support by @fyabc in #7905
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by @bigPYJ1151 in #7257
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by @alexeykondrat in #8373
[Bugfix] Add missing attributes in mistral tokenizer by @DarkLight1337 in #8364
[Kernel][Misc] Add meta functions for ops to prevent graph breaks by @bnellnm in #6917
[Misc] Move device options to a single place by @akx in #8322
[Speculative Decoding] Test refactor by @LiuXiaoxuanPKU in #8317
Pixtral by @patrickvonplaten in #8377
Bump version to v0.6.1 by @simon-mo in #8379

New Contributors

@mmcelaney made their first contribution in #8161
@elfiegg made their first contribution in #8173
@Manikandan-Thangaraj-ZS0321 made their first contribution in #7860
@sumitd2 made their first contribution in #8026
@alugowski made their first contribution in #8060
@vladislavkruglikov made their first contribution in #8292
@kevin314 made their first contribution in #8224
@TKONIY made their first contribution in #7559
@akx made their first contribution in #8322

Full Changelog: v0.6.0...v0.6.1