Highlights
Model Support
-
Support Llama 3.2 models (#8811, #8822)
vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
-
Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)
-
⚠️ You will see the following error now, this is breaking change!
Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the
vllm.LLM.use_beam_search
method for dedicated beam search instead, or set the environment variableVLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1
to suppress this error. For more details, see #8306
-
-
Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)
-
Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)
Hardware Support
- TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
- CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
- AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)
Production Engine
- Initial support for priority sheduling (#5958)
- Support Lora lineage and base model metadata management (#6315)
- Batch inference for llm.chat() API (#8648)
Performance
- Introduce
MQLLMEngine
for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584) - Multi-step scheduling enhancements
- Add cuda graph support during decoding for encoder-decoder models (#7631)
Others
- Support sample from HF datasets and image input for benchmark_serving (#8495)
- Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)
What's Changed
- [MISC] Dump model runner inputs when crashing by @comaniac in #8305
- [misc] remove engine_use_ray by @youkaichao in #8126
- [TPU] Use Ray for default distributed backend by @WoosukKwon in #8389
- Fix the AMD weight loading tests by @mgoin in #8390
- [Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in #8366
- [Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in #8338
- [Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in #8355
- [Misc] Use RoPE cache for MRoPE by @WoosukKwon in #8396
- [torch.compile] hide slicing under custom op for inductor by @youkaichao in #8384
- [Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in #8399
- [Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in #8375
- [Model] Support multiple images for qwen-vl by @alex-jw-brooks in #8247
- [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in #8403
- [BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in #8423
- [Bugfix] Offline mode fix by @joerunde in #8376
- [multi-step] add flashinfer backend by @SolitaryThinker in #7928
- [Core] Add engine option to return only deltas or final output by @njhill in #7381
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in #8427
- [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in #8425
- [CI/Build] Disable multi-node test for InternVL2 by @ywang96 in #8428
- [Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in #8415
- [Bugfix] Fix weight loading issue by rename variable. by @wenxcs in #8293
- [Misc] Update Pixtral example by @ywang96 in #8431
- [BugFix] fix group_topk by @dsikka in #8430
- [Core] Factor out input preprocessing to a separate class by @DarkLight1337 in #7329
- [Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in #8290
- [Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in #8435
- [CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in #8436
- [Bugfix] Fix async log stats by @alexm-neuralmagic in #8417
- [bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in #8354
- bump version to v0.6.1.post1 by @simon-mo in #8440
- [CI/Build] Enable InternVL2 PP test only on single node by @Isotr0py in #8437
- [doc] recommend pip instead of conda by @youkaichao in #8446
- [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by @jeejeelee in #8442
- [misc][ci] fix quant test by @youkaichao in #8449
- [Installation] Gate FastAPI version for Python 3.8 by @DarkLight1337 in #8456
- [plugin][torch.compile] allow to add custom compile backend by @youkaichao in #8445
- [CI/Build] Reorganize models tests by @DarkLight1337 in #7820
- [Doc] Add oneDNN installation to CPU backend documentation by @Isotr0py in #8467
- [HotFix] Fix final output truncation with stop string + streaming by @njhill in #8468
- bump version to v0.6.1.post2 by @simon-mo in #8473
- [Hardware][intel GPU] bump up ipex version to 2.3 by @jikunshang in #8365
- [Kernel][Hardware][Amd]Custom paged attention kernel for rocm by @charlifu in #8310
- [Model] support minicpm3 by @SUDA-HLT-ywfang in #8297
- [torch.compile] fix functionalization by @youkaichao in #8480
- [torch.compile] add a flag to disable custom op by @youkaichao in #8488
- [TPU] Implement multi-step scheduling by @WoosukKwon in #8489
- [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations by @chrisociepa in #8490
- [Bugfix][Kernel] Add
IQ1_M
quantization implementation to GGUF kernel by @Isotr0py in #8357 - [Kernel] Enable 8-bit weights in Fused Marlin MoE by @ElizaWszola in #8032
- [Frontend] Expose revision arg in OpenAI server by @lewtun in #8501
- [BugFix] Fix clean shutdown issues by @njhill in #8492
- [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel by @sasha0552 in #8506
- [Kernel] AQ AZP 3/4: Asymmetric quantization kernels by @ProExpertProg in #7270
- [doc] update doc on testing and debugging by @youkaichao in #8514
- [Bugfix] Bind api server port before starting engine by @kevin314 in #8491
- [perf bench] set timeout to debug hanging by @simon-mo in #8516
- [misc] small qol fixes for release process by @simon-mo in #8517
- [Bugfix] Fix 3.12 builds on main by @joerunde in #8510
- [refactor] remove triton based sampler by @simon-mo in #8524
- [Frontend] Improve Nullable kv Arg Parsing by @alex-jw-brooks in #8525
- [Misc][Bugfix] Disable guided decoding for mistral tokenizer by @ywang96 in #8521
- [torch.compile] register allreduce operations as custom ops by @youkaichao in #8526
- [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change by @ruisearch42 in #8509
- [Benchmark] Support sample from HF datasets and image input for benchmark_serving by @Isotr0py in #8495
- [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models by @sroy745 in #7631
- [Feature][kernel] tensor parallelism with bitsandbytes quantization by @chenqianfzh in #8434
- [Model] Add mistral function calling format to all models loaded with "mistral" format by @patrickvonplaten in #8515
- [Misc] Don't dump contents of kvcache tensors on errors by @njhill in #8527
- [Bugfix] Fix TP > 1 for new granite by @joerunde in #8544
- [doc] improve installation doc by @youkaichao in #8550
- [CI/Build] Excluding kernels/test_gguf.py from ROCm by @alexeykondrat in #8520
- [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching by @tlrmchlsmth in #8012
- [CI/Build] fix Dockerfile.cpu on podman by @dtrifiro in #8540
- [Misc] Add argument to disable FastAPI docs by @Jeffwan in #8554
- [CI/Build] Avoid CUDA initialization by @DarkLight1337 in #8534
- [CI/Build] Update Ruff version by @aarnphm in #8469
- [Core][Bugfix][Perf] Introduce
MQLLMEngine
to avoidasyncio
OH by @alexm-neuralmagic in #8157 - [Core] Prompt logprobs support in Multi-step by @afeldman-nm in #8199
- [Core] zmq: bind only to 127.0.0.1 for local-only usage by @russellb in #8543
- [Model] Support Solar Model by @shing100 in #8386
- [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call by @gshtras in #8380
- [Kernel] Change interface to Mamba selective_state_update for continuous batching by @tlrmchlsmth in #8039
- [BugFix] Nonzero exit code if MQLLMEngine startup fails by @njhill in #8572
- [Bugfix] add
dead_error
property to engine client by @joerunde in #8574 - [Kernel] Remove marlin moe templating on thread_m_blocks by @tlrmchlsmth in #8573
- [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. by @sroy745 in #8545
- Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" by @ywang96 in #8593
- [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py by @KuntaiDu in #8616
- [MISC] remove engine_use_ray in benchmark_throughput.py by @jikunshang in #8615
- [Frontend] Use MQLLMEngine for embeddings models too by @njhill in #8584
- [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention by @charlifu in #8577
- [Core] simplify logits resort in _apply_top_k_top_p by @hidva in #8619
- [Doc] Add documentation for GGUF quantization by @Isotr0py in #8618
- Create SECURITY.md by @simon-mo in #8642
- [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail by @alexeykondrat in #8551
- [Misc] guard against change in cuda library name by @bnellnm in #8609
- [Bugfix] Fix Phi3.5 mini and MoE LoRA inference by @garg-amit in #8571
- [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata by @SolitaryThinker in #8474
- [Core] Support Lora lineage and base model metadata management by @Jeffwan in #6315
- [Model] Add OLMoE by @Muennighoff in #7922
- [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build by @alexeykondrat in #8670
- [Bugfix] Validate SamplingParam n is an int by @saumya-saran in #8548
- [Misc] Show AMD GPU topology in
collect_env.py
by @DarkLight1337 in #8649 - [Bugfix] Config.init() got an unexpected keyword argument 'engine' api_server args by @Juelianqvq in #8556
- [Bugfix][Core] Fix tekken edge case for mistral tokenizer by @patrickvonplaten in #8640
- [Doc] neuron documentation update by @omrishiv in #8671
- [Hardware][AWS] update neuron to 2.20 by @omrishiv in #8676
- [Bugfix] Fix incorrect llava next feature size calculation by @zyddnys in #8496
- [Core] Rename
PromptInputs
toPromptType
, andinputs
toprompt
by @DarkLight1337 in #8673 - [MISC] add support custom_op check by @jikunshang in #8557
- [Core] Factor out common code in
SequenceData
andSequence
by @DarkLight1337 in #8675 - [beam search] add output for manually checking the correctness by @youkaichao in #8684
- [Kernel] Build flash-attn from source by @ProExpertProg in #8245
- [VLM] Use
SequenceData.from_token_counts
to create dummy data by @DarkLight1337 in #8687 - [Doc] Fix typo in AMD installation guide by @Imss27 in #8689
- [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 by @rasmith in #8646
- [dbrx] refactor dbrx experts to extend FusedMoe class by @divakar-amd in #8518
- [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu by @tlrmchlsmth in #8643
- [Bugfix] Refactor composite weight loading logic by @Isotr0py in #8656
- [ci][build] fix vllm-flash-attn by @youkaichao in #8699
- [Model] Refactor BLIP/BLIP-2 to support composite model loading by @DarkLight1337 in #8407
- [Misc] Use NamedTuple in Multi-image example by @alex-jw-brooks in #8705
- [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler by @statelesshz in #8703
- [Model][VLM] Add LLaVA-Onevision model support by @litianjian in #8486
- [SpecDec][Misc] Cleanup, remove bonus token logic. by @LiuXiaoxuanPKU in #8701
- [build] enable existing pytorch (for GH200, aarch64, nightly) by @youkaichao in #8713
- [misc] upgrade mistral-common by @youkaichao in #8715
- [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building by @tlrmchlsmth in #8702
- [Bugfix] Fix CPU CMake build by @ProExpertProg in #8723
- [Bugfix] fix docker build for xpu by @yma11 in #8652
- [Core][Frontend] Support Passing Multimodal Processor Kwargs by @alex-jw-brooks in #8657
- [Hardware][CPU] Refactor CPU model runner by @Isotr0py in #8729
- [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner by @bigPYJ1151 in #8733
- [Model] Support pp for qwen2-vl by @liuyanyi in #8696
- [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size by @janimo in #8707
- [CI/Build] use setuptools-scm to set version by @dtrifiro in #4738
- [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin by @LucasWilkinson in #7701
- [Kernel][LoRA] Add assertion for punica sgmv kernels by @jeejeelee in #7585
- [Core] Allow IPv6 in VLLM_HOST_IP with zmq by @russellb in #8575
- Fix typical acceptance sampler with correct recovered token ids by @jiqing-feng in #8562
- Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse by @alexm-neuralmagic in #8335
- [Hardware][AMD] ROCm6.2 upgrade by @hongxiayang in #8674
- Fix tests in test_scheduler.py that fail with BlockManager V2 by @sroy745 in #8728
- re-implement beam search on top of vllm core by @youkaichao in #8726
- Revert "[Core] Rename
PromptInputs
toPromptType
, andinputs
toprompt
" by @simon-mo in #8750 - [MISC] Skip dumping inputs when unpicklable by @comaniac in #8744
- [Core][Model] Support loading weights by ID within models by @petersalas in #7931
- [Model] Expose Phi3v num_crops as a mm_processor_kwarg by @alex-jw-brooks in #8658
- [Bugfix] Fix potentially unsafe custom allreduce synchronization by @hanzhi713 in #8558
- [Kernel] Split Marlin MoE kernels into multiple files by @ElizaWszola in #8661
- [Frontend] Batch inference for llm.chat() API by @aandyw in #8648
- [Bugfix] Fix torch dynamo fixes caused by
replace_parameters
by @LucasWilkinson in #8748 - [CI/Build] fix setuptools-scm usage by @dtrifiro in #8771
- [misc] soft drop beam search by @youkaichao in #8763
- [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 by @jeejeelee in https://github.com//pull/8768
- [Core][Bugfix] Support prompt_logprobs returned with speculative decoding by @tjohnson31415 in #8047
- [Core] Adding Priority Scheduling by @apatke in #5958
- [Bugfix] Use heartbeats instead of health checks by @joerunde in #8583
- Fix test_schedule_swapped_simple in test_scheduler.py by @sroy745 in #8780
- [Bugfix][Kernel] Implement acquire/release polyfill for Pascal by @sasha0552 in #8776
- Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 by @sroy745 in #8752
- [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv by @zifeitong in #8250
- [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend by @Isotr0py in #8770
- [Bugfix] load fc bias from config for eagle by @sohamparikh in #8790
- [Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer by @agt in #8672
- [Bugfix] Ray 2.9.x doesn't expose available_resources_per_node by @darthhexx in #8767
- [Misc] Fix minor typo in scheduler by @wooyeonlee0 in #8765
- [CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade by @hongxiayang in #8777
- [Kernel] Fullgraph and opcheck tests by @bnellnm in #8479
- [[Misc]] Add extra deps for openai server image by @jeejeelee in #8792
- [VLM][Bugfix] enable internvl running with num_scheduler_steps > 1 by @DefTruth in #8614
- [Core] Rename
PromptInputs
andinputs
, with backwards compatibility by @DarkLight1337 in #8760 - [Frontend] MQLLMEngine supports profiling. by @abatom in #8761
- [Misc] Support FP8 MoE for compressed-tensors by @mgoin in #8588
- Revert "rename PromptInputs and inputs with backward compatibility (#8760) by @simon-mo in #8810
- [Model] Add support for the multi-modal Llama 3.2 model by @heheda12345 in #8811
- [Doc] Update doc for Transformers 4.45 by @ywang96 in #8817
- [Misc] Support quantization of MllamaForCausalLM by @mgoin in #8822
New Contributors
- @blueyo0 made their first contribution in #8338
- @lnykww made their first contribution in #8403
- @vegaluisjose made their first contribution in #8423
- @chrisociepa made their first contribution in #8490
- @lewtun made their first contribution in #8501
- @russellb made their first contribution in #8543
- @shing100 made their first contribution in #8386
- @hidva made their first contribution in #8619
- @Muennighoff made their first contribution in #7922
- @saumya-saran made their first contribution in #8548
- @zyddnys made their first contribution in #8496
- @Imss27 made their first contribution in #8689
- @statelesshz made their first contribution in #8703
- @litianjian made their first contribution in #8486
- @yma11 made their first contribution in #8652
- @liuyanyi made their first contribution in #8696
- @janimo made their first contribution in #8707
- @jiqing-feng made their first contribution in #8562
- @aandyw made their first contribution in #8648
- @apatke made their first contribution in #5958
- @sohamparikh made their first contribution in #8790
- @darthhexx made their first contribution in #8767
- @abatom made their first contribution in #8761
- @heheda12345 made their first contribution in #8811
Full Changelog: v0.6.1...v0.6.2