github vllm-project/vllm v0.6.2

one day ago

Highlights

Model Support

  • Support Llama 3.2 models (#8811, #8822)

     vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
    
  • Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)

    • ⚠️ You will see the following error now, this is breaking change!

      Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the vllm.LLM.use_beam_search method for dedicated beam search instead, or set the environment variable VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 to suppress this error. For more details, see #8306

  • Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)

  • Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)

Hardware Support

  • TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
  • CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
  • AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)

Production Engine

  • Initial support for priority sheduling (#5958)
  • Support Lora lineage and base model metadata management (#6315)
  • Batch inference for llm.chat() API (#8648)

Performance

  • Introduce MQLLMEngine for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584)
  • Multi-step scheduling enhancements
    • Prompt logprobs support in Multi-step (#8199)
    • Add output streaming support to multi-step + async (#8335)
    • Add flashinfer backend (#7928)
  • Add cuda graph support during decoding for encoder-decoder models (#7631)

Others

  • Support sample from HF datasets and image input for benchmark_serving (#8495)
  • Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)

What's Changed

  • [MISC] Dump model runner inputs when crashing by @comaniac in #8305
  • [misc] remove engine_use_ray by @youkaichao in #8126
  • [TPU] Use Ray for default distributed backend by @WoosukKwon in #8389
  • Fix the AMD weight loading tests by @mgoin in #8390
  • [Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in #8366
  • [Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in #8338
  • [Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in #8355
  • [Misc] Use RoPE cache for MRoPE by @WoosukKwon in #8396
  • [torch.compile] hide slicing under custom op for inductor by @youkaichao in #8384
  • [Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in #8399
  • [Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in #8375
  • [Model] Support multiple images for qwen-vl by @alex-jw-brooks in #8247
  • [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in #8403
  • [BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in #8423
  • [Bugfix] Offline mode fix by @joerunde in #8376
  • [multi-step] add flashinfer backend by @SolitaryThinker in #7928
  • [Core] Add engine option to return only deltas or final output by @njhill in #7381
  • [Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in #8427
  • [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in #8425
  • [CI/Build] Disable multi-node test for InternVL2 by @ywang96 in #8428
  • [Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in #8415
  • [Bugfix] Fix weight loading issue by rename variable. by @wenxcs in #8293
  • [Misc] Update Pixtral example by @ywang96 in #8431
  • [BugFix] fix group_topk by @dsikka in #8430
  • [Core] Factor out input preprocessing to a separate class by @DarkLight1337 in #7329
  • [Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in #8290
  • [Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in #8435
  • [CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in #8436
  • [Bugfix] Fix async log stats by @alexm-neuralmagic in #8417
  • [bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in #8354
  • bump version to v0.6.1.post1 by @simon-mo in #8440
  • [CI/Build] Enable InternVL2 PP test only on single node by @Isotr0py in #8437
  • [doc] recommend pip instead of conda by @youkaichao in #8446
  • [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by @jeejeelee in #8442
  • [misc][ci] fix quant test by @youkaichao in #8449
  • [Installation] Gate FastAPI version for Python 3.8 by @DarkLight1337 in #8456
  • [plugin][torch.compile] allow to add custom compile backend by @youkaichao in #8445
  • [CI/Build] Reorganize models tests by @DarkLight1337 in #7820
  • [Doc] Add oneDNN installation to CPU backend documentation by @Isotr0py in #8467
  • [HotFix] Fix final output truncation with stop string + streaming by @njhill in #8468
  • bump version to v0.6.1.post2 by @simon-mo in #8473
  • [Hardware][intel GPU] bump up ipex version to 2.3 by @jikunshang in #8365
  • [Kernel][Hardware][Amd]Custom paged attention kernel for rocm by @charlifu in #8310
  • [Model] support minicpm3 by @SUDA-HLT-ywfang in #8297
  • [torch.compile] fix functionalization by @youkaichao in #8480
  • [torch.compile] add a flag to disable custom op by @youkaichao in #8488
  • [TPU] Implement multi-step scheduling by @WoosukKwon in #8489
  • [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations by @chrisociepa in #8490
  • [Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel by @Isotr0py in #8357
  • [Kernel] Enable 8-bit weights in Fused Marlin MoE by @ElizaWszola in #8032
  • [Frontend] Expose revision arg in OpenAI server by @lewtun in #8501
  • [BugFix] Fix clean shutdown issues by @njhill in #8492
  • [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel by @sasha0552 in #8506
  • [Kernel] AQ AZP 3/4: Asymmetric quantization kernels by @ProExpertProg in #7270
  • [doc] update doc on testing and debugging by @youkaichao in #8514
  • [Bugfix] Bind api server port before starting engine by @kevin314 in #8491
  • [perf bench] set timeout to debug hanging by @simon-mo in #8516
  • [misc] small qol fixes for release process by @simon-mo in #8517
  • [Bugfix] Fix 3.12 builds on main by @joerunde in #8510
  • [refactor] remove triton based sampler by @simon-mo in #8524
  • [Frontend] Improve Nullable kv Arg Parsing by @alex-jw-brooks in #8525
  • [Misc][Bugfix] Disable guided decoding for mistral tokenizer by @ywang96 in #8521
  • [torch.compile] register allreduce operations as custom ops by @youkaichao in #8526
  • [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change by @ruisearch42 in #8509
  • [Benchmark] Support sample from HF datasets and image input for benchmark_serving by @Isotr0py in #8495
  • [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models by @sroy745 in #7631
  • [Feature][kernel] tensor parallelism with bitsandbytes quantization by @chenqianfzh in #8434
  • [Model] Add mistral function calling format to all models loaded with "mistral" format by @patrickvonplaten in #8515
  • [Misc] Don't dump contents of kvcache tensors on errors by @njhill in #8527
  • [Bugfix] Fix TP > 1 for new granite by @joerunde in #8544
  • [doc] improve installation doc by @youkaichao in #8550
  • [CI/Build] Excluding kernels/test_gguf.py from ROCm by @alexeykondrat in #8520
  • [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching by @tlrmchlsmth in #8012
  • [CI/Build] fix Dockerfile.cpu on podman by @dtrifiro in #8540
  • [Misc] Add argument to disable FastAPI docs by @Jeffwan in #8554
  • [CI/Build] Avoid CUDA initialization by @DarkLight1337 in #8534
  • [CI/Build] Update Ruff version by @aarnphm in #8469
  • [Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH by @alexm-neuralmagic in #8157
  • [Core] Prompt logprobs support in Multi-step by @afeldman-nm in #8199
  • [Core] zmq: bind only to 127.0.0.1 for local-only usage by @russellb in #8543
  • [Model] Support Solar Model by @shing100 in #8386
  • [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call by @gshtras in #8380
  • [Kernel] Change interface to Mamba selective_state_update for continuous batching by @tlrmchlsmth in #8039
  • [BugFix] Nonzero exit code if MQLLMEngine startup fails by @njhill in #8572
  • [Bugfix] add dead_error property to engine client by @joerunde in #8574
  • [Kernel] Remove marlin moe templating on thread_m_blocks by @tlrmchlsmth in #8573
  • [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. by @sroy745 in #8545
  • Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" by @ywang96 in #8593
  • [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py by @KuntaiDu in #8616
  • [MISC] remove engine_use_ray in benchmark_throughput.py by @jikunshang in #8615
  • [Frontend] Use MQLLMEngine for embeddings models too by @njhill in #8584
  • [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention by @charlifu in #8577
  • [Core] simplify logits resort in _apply_top_k_top_p by @hidva in #8619
  • [Doc] Add documentation for GGUF quantization by @Isotr0py in #8618
  • Create SECURITY.md by @simon-mo in #8642
  • [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail by @alexeykondrat in #8551
  • [Misc] guard against change in cuda library name by @bnellnm in #8609
  • [Bugfix] Fix Phi3.5 mini and MoE LoRA inference by @garg-amit in #8571
  • [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata by @SolitaryThinker in #8474
  • [Core] Support Lora lineage and base model metadata management by @Jeffwan in #6315
  • [Model] Add OLMoE by @Muennighoff in #7922
  • [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build by @alexeykondrat in #8670
  • [Bugfix] Validate SamplingParam n is an int by @saumya-saran in #8548
  • [Misc] Show AMD GPU topology in collect_env.py by @DarkLight1337 in #8649
  • [Bugfix] Config.init() got an unexpected keyword argument 'engine' api_server args by @Juelianqvq in #8556
  • [Bugfix][Core] Fix tekken edge case for mistral tokenizer by @patrickvonplaten in #8640
  • [Doc] neuron documentation update by @omrishiv in #8671
  • [Hardware][AWS] update neuron to 2.20 by @omrishiv in #8676
  • [Bugfix] Fix incorrect llava next feature size calculation by @zyddnys in #8496
  • [Core] Rename PromptInputs to PromptType, and inputs to prompt by @DarkLight1337 in #8673
  • [MISC] add support custom_op check by @jikunshang in #8557
  • [Core] Factor out common code in SequenceData and Sequence by @DarkLight1337 in #8675
  • [beam search] add output for manually checking the correctness by @youkaichao in #8684
  • [Kernel] Build flash-attn from source by @ProExpertProg in #8245
  • [VLM] Use SequenceData.from_token_counts to create dummy data by @DarkLight1337 in #8687
  • [Doc] Fix typo in AMD installation guide by @Imss27 in #8689
  • [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 by @rasmith in #8646
  • [dbrx] refactor dbrx experts to extend FusedMoe class by @divakar-amd in #8518
  • [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu by @tlrmchlsmth in #8643
  • [Bugfix] Refactor composite weight loading logic by @Isotr0py in #8656
  • [ci][build] fix vllm-flash-attn by @youkaichao in #8699
  • [Model] Refactor BLIP/BLIP-2 to support composite model loading by @DarkLight1337 in #8407
  • [Misc] Use NamedTuple in Multi-image example by @alex-jw-brooks in #8705
  • [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler by @statelesshz in #8703
  • [Model][VLM] Add LLaVA-Onevision model support by @litianjian in #8486
  • [SpecDec][Misc] Cleanup, remove bonus token logic. by @LiuXiaoxuanPKU in #8701
  • [build] enable existing pytorch (for GH200, aarch64, nightly) by @youkaichao in #8713
  • [misc] upgrade mistral-common by @youkaichao in #8715
  • [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building by @tlrmchlsmth in #8702
  • [Bugfix] Fix CPU CMake build by @ProExpertProg in #8723
  • [Bugfix] fix docker build for xpu by @yma11 in #8652
  • [Core][Frontend] Support Passing Multimodal Processor Kwargs by @alex-jw-brooks in #8657
  • [Hardware][CPU] Refactor CPU model runner by @Isotr0py in #8729
  • [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner by @bigPYJ1151 in #8733
  • [Model] Support pp for qwen2-vl by @liuyanyi in #8696
  • [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size by @janimo in #8707
  • [CI/Build] use setuptools-scm to set version by @dtrifiro in #4738
  • [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin by @LucasWilkinson in #7701
  • [Kernel][LoRA] Add assertion for punica sgmv kernels by @jeejeelee in #7585
  • [Core] Allow IPv6 in VLLM_HOST_IP with zmq by @russellb in #8575
  • Fix typical acceptance sampler with correct recovered token ids by @jiqing-feng in #8562
  • Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse by @alexm-neuralmagic in #8335
  • [Hardware][AMD] ROCm6.2 upgrade by @hongxiayang in #8674
  • Fix tests in test_scheduler.py that fail with BlockManager V2 by @sroy745 in #8728
  • re-implement beam search on top of vllm core by @youkaichao in #8726
  • Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" by @simon-mo in #8750
  • [MISC] Skip dumping inputs when unpicklable by @comaniac in #8744
  • [Core][Model] Support loading weights by ID within models by @petersalas in #7931
  • [Model] Expose Phi3v num_crops as a mm_processor_kwarg by @alex-jw-brooks in #8658
  • [Bugfix] Fix potentially unsafe custom allreduce synchronization by @hanzhi713 in #8558
  • [Kernel] Split Marlin MoE kernels into multiple files by @ElizaWszola in #8661
  • [Frontend] Batch inference for llm.chat() API by @aandyw in #8648
  • [Bugfix] Fix torch dynamo fixes caused by replace_parameters by @LucasWilkinson in #8748
  • [CI/Build] fix setuptools-scm usage by @dtrifiro in #8771
  • [misc] soft drop beam search by @youkaichao in #8763
  • [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 by @jeejeelee in https://github.com//pull/8768
  • [Core][Bugfix] Support prompt_logprobs returned with speculative decoding by @tjohnson31415 in #8047
  • [Core] Adding Priority Scheduling by @apatke in #5958
  • [Bugfix] Use heartbeats instead of health checks by @joerunde in #8583
  • Fix test_schedule_swapped_simple in test_scheduler.py by @sroy745 in #8780
  • [Bugfix][Kernel] Implement acquire/release polyfill for Pascal by @sasha0552 in #8776
  • Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 by @sroy745 in #8752
  • [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv by @zifeitong in #8250
  • [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend by @Isotr0py in #8770
  • [Bugfix] load fc bias from config for eagle by @sohamparikh in #8790
  • [Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer by @agt in #8672
  • [Bugfix] Ray 2.9.x doesn't expose available_resources_per_node by @darthhexx in #8767
  • [Misc] Fix minor typo in scheduler by @wooyeonlee0 in #8765
  • [CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade by @hongxiayang in #8777
  • [Kernel] Fullgraph and opcheck tests by @bnellnm in #8479
  • [[Misc]] Add extra deps for openai server image by @jeejeelee in #8792
  • [VLM][Bugfix] enable internvl running with num_scheduler_steps > 1 by @DefTruth in #8614
  • [Core] Rename PromptInputs and inputs, with backwards compatibility by @DarkLight1337 in #8760
  • [Frontend] MQLLMEngine supports profiling. by @abatom in #8761
  • [Misc] Support FP8 MoE for compressed-tensors by @mgoin in #8588
  • Revert "rename PromptInputs and inputs with backward compatibility (#8760) by @simon-mo in #8810
  • [Model] Add support for the multi-modal Llama 3.2 model by @heheda12345 in #8811
  • [Doc] Update doc for Transformers 4.45 by @ywang96 in #8817
  • [Misc] Support quantization of MllamaForCausalLM by @mgoin in #8822

New Contributors

  • @blueyo0 made their first contribution in #8338
  • @lnykww made their first contribution in #8403
  • @vegaluisjose made their first contribution in #8423
  • @chrisociepa made their first contribution in #8490
  • @lewtun made their first contribution in #8501
  • @russellb made their first contribution in #8543
  • @shing100 made their first contribution in #8386
  • @hidva made their first contribution in #8619
  • @Muennighoff made their first contribution in #7922
  • @saumya-saran made their first contribution in #8548
  • @zyddnys made their first contribution in #8496
  • @Imss27 made their first contribution in #8689
  • @statelesshz made their first contribution in #8703
  • @litianjian made their first contribution in #8486
  • @yma11 made their first contribution in #8652
  • @liuyanyi made their first contribution in #8696
  • @janimo made their first contribution in #8707
  • @jiqing-feng made their first contribution in #8562
  • @aandyw made their first contribution in #8648
  • @apatke made their first contribution in #5958
  • @sohamparikh made their first contribution in #8790
  • @darthhexx made their first contribution in #8767
  • @abatom made their first contribution in #8761
  • @heheda12345 made their first contribution in #8811

Full Changelog: v0.6.1...v0.6.2

Don't miss a new vllm release

NewReleases is sending notifications on new releases.