vllm 0.6.2 on Python PyPI

Highlights

Model Support

Support Llama 3.2 models (#8811, #8822)

 vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16

Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)
- ⚠️ You will see the following error now, this is breaking change!

Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the vllm.LLM.use_beam_search method for dedicated beam search instead, or set the environment variable VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 to suppress this error. For more details, see #8306

Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)
Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)

Hardware Support

TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)

Production Engine

Initial support for priority sheduling (#5958)
Support Lora lineage and base model metadata management (#6315)
Batch inference for llm.chat() API (#8648)

Performance

Introduce MQLLMEngine for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584)
Multi-step scheduling enhancements
- Prompt logprobs support in Multi-step (#8199)
- Add output streaming support to multi-step + async (#8335)
- Add flashinfer backend (#7928)
Add cuda graph support during decoding for encoder-decoder models (#7631)

Others

Support sample from HF datasets and image input for benchmark_serving (#8495)
Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)

What's Changed

[MISC] Dump model runner inputs when crashing by @comaniac in #8305
[misc] remove engine_use_ray by @youkaichao in #8126
[TPU] Use Ray for default distributed backend by @WoosukKwon in #8389
Fix the AMD weight loading tests by @mgoin in #8390
[Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in #8366
[Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in #8338
[Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in #8355
[Misc] Use RoPE cache for MRoPE by @WoosukKwon in #8396
[torch.compile] hide slicing under custom op for inductor by @youkaichao in #8384
[Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in #8399
[Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in #8375
[Model] Support multiple images for qwen-vl by @alex-jw-brooks in #8247
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in #8403
[BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in #8423
[Bugfix] Offline mode fix by @joerunde in #8376
[multi-step] add flashinfer backend by @SolitaryThinker in #7928
[Core] Add engine option to return only deltas or final output by @njhill in #7381
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in #8427
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in #8425
[CI/Build] Disable multi-node test for InternVL2 by @ywang96 in #8428
[Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in #8415
[Bugfix] Fix weight loading issue by rename variable. by @wenxcs in #8293
[Misc] Update Pixtral example by @ywang96 in #8431
[BugFix] fix group_topk by @dsikka in #8430
[Core] Factor out input preprocessing to a separate class by @DarkLight1337 in #7329
[Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in #8290
[Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in #8435
[CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in #8436
[Bugfix] Fix async log stats by @alexm-neuralmagic in #8417
[bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in #8354
bump version to v0.6.1.post1 by @simon-mo in #8440
[CI/Build] Enable InternVL2 PP test only on single node by @Isotr0py in #8437
[doc] recommend pip instead of conda by @youkaichao in #8446
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by @jeejeelee in #8442
[misc][ci] fix quant test by @youkaichao in #8449
[Installation] Gate FastAPI version for Python 3.8 by @DarkLight1337 in #8456
[plugin][torch.compile] allow to add custom compile backend by @youkaichao in #8445
[CI/Build] Reorganize models tests by @DarkLight1337 in #7820
[Doc] Add oneDNN installation to CPU backend documentation by @Isotr0py in #8467
[HotFix] Fix final output truncation with stop string + streaming by @njhill in #8468
bump version to v0.6.1.post2 by @simon-mo in #8473
[Hardware][intel GPU] bump up ipex version to 2.3 by @jikunshang in #8365
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm by @charlifu in #8310
[Model] support minicpm3 by @SUDA-HLT-ywfang in #8297
[torch.compile] fix functionalization by @youkaichao in #8480
[torch.compile] add a flag to disable custom op by @youkaichao in #8488
[TPU] Implement multi-step scheduling by @WoosukKwon in #8489
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations by @chrisociepa in #8490
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel by @Isotr0py in #8357
[Kernel] Enable 8-bit weights in Fused Marlin MoE by @ElizaWszola in #8032
[Frontend] Expose revision arg in OpenAI server by @lewtun in #8501
[BugFix] Fix clean shutdown issues by @njhill in #8492
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel by @sasha0552 in #8506
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels by @ProExpertProg in #7270
[doc] update doc on testing and debugging by @youkaichao in #8514
[Bugfix] Bind api server port before starting engine by @kevin314 in #8491
[perf bench] set timeout to debug hanging by @simon-mo in #8516
[misc] small qol fixes for release process by @simon-mo in #8517
[Bugfix] Fix 3.12 builds on main by @joerunde in #8510
[refactor] remove triton based sampler by @simon-mo in #8524
[Frontend] Improve Nullable kv Arg Parsing by @alex-jw-brooks in #8525
[Misc][Bugfix] Disable guided decoding for mistral tokenizer by @ywang96 in #8521
[torch.compile] register allreduce operations as custom ops by @youkaichao in #8526
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change by @ruisearch42 in #8509
[Benchmark] Support sample from HF datasets and image input for benchmark_serving by @Isotr0py in #8495
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models by @sroy745 in #7631
[Feature][kernel] tensor parallelism with bitsandbytes quantization by @chenqianfzh in #8434
[Model] Add mistral function calling format to all models loaded with "mistral" format by @patrickvonplaten in #8515
[Misc] Don't dump contents of kvcache tensors on errors by @njhill in #8527
[Bugfix] Fix TP > 1 for new granite by @joerunde in #8544
[doc] improve installation doc by @youkaichao in #8550
[CI/Build] Excluding kernels/test_gguf.py from ROCm by @alexeykondrat in #8520
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching by @tlrmchlsmth in #8012
[CI/Build] fix Dockerfile.cpu on podman by @dtrifiro in #8540
[Misc] Add argument to disable FastAPI docs by @Jeffwan in #8554
[CI/Build] Avoid CUDA initialization by @DarkLight1337 in #8534
[CI/Build] Update Ruff version by @aarnphm in #8469
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH by @alexm-neuralmagic in #8157
[Core] Prompt logprobs support in Multi-step by @afeldman-nm in #8199
[Core] zmq: bind only to 127.0.0.1 for local-only usage by @russellb in #8543
[Model] Support Solar Model by @shing100 in #8386
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call by @gshtras in #8380
[Kernel] Change interface to Mamba selective_state_update for continuous batching by @tlrmchlsmth in #8039
[BugFix] Nonzero exit code if MQLLMEngine startup fails by @njhill in #8572
[Bugfix] add dead_error property to engine client by @joerunde in #8574
[Kernel] Remove marlin moe templating on thread_m_blocks by @tlrmchlsmth in #8573
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. by @sroy745 in #8545
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" by @ywang96 in #8593
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py by @KuntaiDu in #8616
[MISC] remove engine_use_ray in benchmark_throughput.py by @jikunshang in #8615
[Frontend] Use MQLLMEngine for embeddings models too by @njhill in #8584
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention by @charlifu in #8577
[Core] simplify logits resort in _apply_top_k_top_p by @hidva in #8619
[Doc] Add documentation for GGUF quantization by @Isotr0py in #8618
Create SECURITY.md by @simon-mo in #8642
[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail by @alexeykondrat in #8551
[Misc] guard against change in cuda library name by @bnellnm in #8609
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference by @garg-amit in #8571
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata by @SolitaryThinker in #8474
[Core] Support Lora lineage and base model metadata management by @Jeffwan in #6315
[Model] Add OLMoE by @Muennighoff in #7922
[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build by @alexeykondrat in #8670
[Bugfix] Validate SamplingParam n is an int by @saumya-saran in #8548
[Misc] Show AMD GPU topology in collect_env.py by @DarkLight1337 in #8649
[Bugfix] Config.init() got an unexpected keyword argument 'engine' api_server args by @Juelianqvq in #8556
[Bugfix][Core] Fix tekken edge case for mistral tokenizer by @patrickvonplaten in #8640
[Doc] neuron documentation update by @omrishiv in #8671
[Hardware][AWS] update neuron to 2.20 by @omrishiv in #8676
[Bugfix] Fix incorrect llava next feature size calculation by @zyddnys in #8496
[Core] Rename PromptInputs to PromptType, and inputs to prompt by @DarkLight1337 in #8673
[MISC] add support custom_op check by @jikunshang in #8557
[Core] Factor out common code in SequenceData and Sequence by @DarkLight1337 in #8675
[beam search] add output for manually checking the correctness by @youkaichao in #8684
[Kernel] Build flash-attn from source by @ProExpertProg in #8245
[VLM] Use SequenceData.from_token_counts to create dummy data by @DarkLight1337 in #8687
[Doc] Fix typo in AMD installation guide by @Imss27 in #8689
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 by @rasmith in #8646
[dbrx] refactor dbrx experts to extend FusedMoe class by @divakar-amd in #8518
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu by @tlrmchlsmth in #8643
[Bugfix] Refactor composite weight loading logic by @Isotr0py in #8656
[ci][build] fix vllm-flash-attn by @youkaichao in #8699
[Model] Refactor BLIP/BLIP-2 to support composite model loading by @DarkLight1337 in #8407
[Misc] Use NamedTuple in Multi-image example by @alex-jw-brooks in #8705
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler by @statelesshz in #8703
[Model][VLM] Add LLaVA-Onevision model support by @litianjian in #8486
[SpecDec][Misc] Cleanup, remove bonus token logic. by @LiuXiaoxuanPKU in #8701
[build] enable existing pytorch (for GH200, aarch64, nightly) by @youkaichao in #8713
[misc] upgrade mistral-common by @youkaichao in #8715
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building by @tlrmchlsmth in #8702
[Bugfix] Fix CPU CMake build by @ProExpertProg in #8723
[Bugfix] fix docker build for xpu by @yma11 in #8652
[Core][Frontend] Support Passing Multimodal Processor Kwargs by @alex-jw-brooks in #8657
[Hardware][CPU] Refactor CPU model runner by @Isotr0py in #8729
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner by @bigPYJ1151 in #8733
[Model] Support pp for qwen2-vl by @liuyanyi in #8696
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size by @janimo in #8707
[CI/Build] use setuptools-scm to set version by @dtrifiro in #4738
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin by @LucasWilkinson in #7701
[Kernel][LoRA] Add assertion for punica sgmv kernels by @jeejeelee in #7585
[Core] Allow IPv6 in VLLM_HOST_IP with zmq by @russellb in #8575
Fix typical acceptance sampler with correct recovered token ids by @jiqing-feng in #8562
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse by @alexm-neuralmagic in #8335
[Hardware][AMD] ROCm6.2 upgrade by @hongxiayang in #8674
Fix tests in test_scheduler.py that fail with BlockManager V2 by @sroy745 in #8728
re-implement beam search on top of vllm core by @youkaichao in #8726
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" by @simon-mo in #8750
[MISC] Skip dumping inputs when unpicklable by @comaniac in #8744
[Core][Model] Support loading weights by ID within models by @petersalas in #7931
[Model] Expose Phi3v num_crops as a mm_processor_kwarg by @alex-jw-brooks in #8658
[Bugfix] Fix potentially unsafe custom allreduce synchronization by @hanzhi713 in #8558
[Kernel] Split Marlin MoE kernels into multiple files by @ElizaWszola in #8661
[Frontend] Batch inference for llm.chat() API by @aandyw in #8648
[Bugfix] Fix torch dynamo fixes caused by replace_parameters by @LucasWilkinson in #8748
[CI/Build] fix setuptools-scm usage by @dtrifiro in #8771
[misc] soft drop beam search by @youkaichao in #8763
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 by @jeejeelee in https://github.com//pull/8768
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding by @tjohnson31415 in #8047
[Core] Adding Priority Scheduling by @apatke in #5958
[Bugfix] Use heartbeats instead of health checks by @joerunde in #8583
Fix test_schedule_swapped_simple in test_scheduler.py by @sroy745 in #8780
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal by @sasha0552 in #8776
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 by @sroy745 in #8752
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv by @zifeitong in #8250
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend by @Isotr0py in #8770
[Bugfix] load fc bias from config for eagle by @sohamparikh in #8790
[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer by @agt in #8672
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node by @darthhexx in #8767
[Misc] Fix minor typo in scheduler by @wooyeonlee0 in #8765
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade by @hongxiayang in #8777
[Kernel] Fullgraph and opcheck tests by @bnellnm in #8479
[[Misc]] Add extra deps for openai server image by @jeejeelee in #8792
[VLM][Bugfix] enable internvl running with num_scheduler_steps > 1 by @DefTruth in #8614
[Core] Rename PromptInputs and inputs, with backwards compatibility by @DarkLight1337 in #8760
[Frontend] MQLLMEngine supports profiling. by @abatom in #8761
[Misc] Support FP8 MoE for compressed-tensors by @mgoin in #8588
Revert "rename PromptInputs and inputs with backward compatibility (#8760) by @simon-mo in #8810
[Model] Add support for the multi-modal Llama 3.2 model by @heheda12345 in #8811
[Doc] Update doc for Transformers 4.45 by @ywang96 in #8817
[Misc] Support quantization of MllamaForCausalLM by @mgoin in #8822

New Contributors

@blueyo0 made their first contribution in #8338
@lnykww made their first contribution in #8403
@vegaluisjose made their first contribution in #8423
@chrisociepa made their first contribution in #8490
@lewtun made their first contribution in #8501
@russellb made their first contribution in #8543
@shing100 made their first contribution in #8386
@hidva made their first contribution in #8619
@Muennighoff made their first contribution in #7922
@saumya-saran made their first contribution in #8548
@zyddnys made their first contribution in #8496
@Imss27 made their first contribution in #8689
@statelesshz made their first contribution in #8703
@litianjian made their first contribution in #8486
@yma11 made their first contribution in #8652
@liuyanyi made their first contribution in #8696
@janimo made their first contribution in #8707
@jiqing-feng made their first contribution in #8562
@aandyw made their first contribution in #8648
@apatke made their first contribution in #5958
@sohamparikh made their first contribution in #8790
@darthhexx made their first contribution in #8767
@abatom made their first contribution in #8761
@heheda12345 made their first contribution in #8811

Full Changelog: v0.6.1...v0.6.2

vllm 0.6.2 v0.6.2 on Python PyPI

Highlights

Model Support

Hardware Support

Production Engine

Performance

Others

What's Changed

New Contributors

vllm 0.6.2
v0.6.2

on Python PyPI