github vllm-project/vllm v0.5.4

latest releases: v0.6.3.post1, v0.6.3, v0.6.2...
3 months ago

Highlights

Model Support

  • Enhanced pipeline parallelism support for DeepSeek v2 (#6519), Qwen (#6974), Qwen2 (#6924), and Nemotron (#6863)
  • Enhanced vision language model support for InternVL2 (#6514, #7067), BLIP-2 (#5920), MiniCPM-V (#4087, #7122).
  • Added H2O Danube3-4b (#6451)
  • Added Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611)

Hardware Support

  • TPU enhancements: collective communication, TP for async engine, faster compile time (#6891, #6933, #6856, #6813, #5871)
  • Intel CPU: enable multiprocessing and tensor parallelism (#6125)

Performance

We are progressing along our quest to quickly improve performance. Each of the following PRs contributed some improvements, and we anticipate more enhancements in the next release.

  • Separated OpenAI Server's HTTP request handling and model inference loop with zeromq. This brought 20% speedup over time to first token and 2x speedup over inter token latency. (#6883)
  • Used Python's native array data structure speedup padding. This bring 15% throughput enhancement in large batch size scenarios. (#6779)
  • Reduce unnecessary compute when logprobs=None. This reduced latency of get log probs from ~30ms to ~5ms in large batch size scenarios. (#6532)
  • Optimize get_seqs function, bring 2% throughput enhancements. (#7051)

Production Features

  • Enhancements to speculative decoding: FlashInfer in DraftModelRunner (#6926), observability (#6963), and benchmarks (#6964)
  • Refactor the punica kernel based on Triton (#5036)
  • Support for guided decoding for offline LLM (#6878)

Quantization

  • Support W4A8 quantization for vllm (#5218)
  • Tuned FP8 and INT8 Kernels for Ada Lovelace and SM75 T4 (#6677, #6996, #6848)
  • Support reading bitsandbytes pre-quantized model (#5753)

What's Changed

  • [Docs] Announce llama3.1 support by @WoosukKwon in #6688
  • [doc][distributed] fix doc argument order by @youkaichao in #6691
  • [Bugfix] Fix a log error in chunked prefill by @WoosukKwon in #6694
  • [BugFix] Fix RoPE error in Llama 3.1 by @WoosukKwon in #6693
  • Bump version to 0.5.3.post1 by @simon-mo in #6696
  • [Misc] Add ignored layers for fp8 quantization by @mgoin in #6657
  • [Frontend] Add Usage data in each chunk for chat_serving. #6540 by @yecohn in #6652
  • [Model] Pipeline Parallel Support for DeepSeek v2 by @tjohnson31415 in #6519
  • Bump transformers version for Llama 3.1 hotfix and patch Chameleon by @ywang96 in #6690
  • [build] relax wheel size limit by @youkaichao in #6704
  • [CI] Add smoke test for non-uniform AutoFP8 quantization by @mgoin in #6702
  • [Bugfix] StatLoggers: cache spec decode metrics when they get collected. by @tdoublep in #6645
  • [bitsandbytes]: support read bnb pre-quantized model by @thesues in #5753
  • [Bugfix] fix flashinfer cudagraph capture for PP by @SolitaryThinker in #6708
  • [SpecDecoding] Update MLPSpeculator CI tests to use smaller model by @njhill in #6714
  • [Bugfix] Fix token padding for chameleon by @ywang96 in #6724
  • [Docs][ROCm] Detailed instructions to build from source by @WoosukKwon in #6680
  • [Build/CI] Update run-amd-test.sh. Enable Docker Hub login. by @Alexei-V-Ivanov-AMD in #6711
  • [Bugfix]fix modelscope compatible issue by @liuyhwangyh in #6730
  • Adding f-string to validation error which is missing by @luizanao in #6748
  • [Bugfix] Fix speculative decode seeded test by @njhill in #6743
  • [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. by @AllenDou in #6686
  • [Frontend] split run_server into build_server and run_server by @dtrifiro in #6740
  • [Kernels] Add fp8 support to reshape_and_cache_flash by @Yard1 in #6667
  • [Core] Tweaks to model runner/input builder developer APIs by @Yard1 in #6712
  • [Bugfix] Bump transformers to 4.43.2 by @mgoin in #6752
  • [Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users by @hongxiayang in #6754
  • [core][distributed] fix zmq hang by @youkaichao in #6759
  • [Frontend] Represent tokens with identifiable strings by @ezliu in #6626
  • [Model] Adding support for MiniCPM-V by @HwwwwwwwH in #4087
  • [Bugfix] Fix decode tokens w. CUDA graph by @comaniac in #6757
  • [Bugfix] Fix awq_marlin and gptq_marlin flags by @alexm-neuralmagic in #6745
  • [Bugfix] Fix encoding_format in examples/openai_embedding_client.py by @CatherineSue in #6755
  • [Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V by @HwwwwwwwH in #6787
  • [ Misc ] fp8-marlin channelwise via compressed-tensors by @robertgshaw2-neuralmagic in #6524
  • [Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints by @mgoin in #6761
  • [Bugfix] Add synchronize to prevent possible data race by @tlrmchlsmth in #6788
  • [Doc] Add documentations for nightly benchmarks by @KuntaiDu in #6412
  • [Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors by @LucasWilkinson in #6798
  • [doc][distributed] improve multinode serving doc by @youkaichao in #6804
  • [Docs] Publish 5th meetup slides by @WoosukKwon in #6799
  • [Core] Fix ray forward_dag error mssg by @rkooo567 in #6792
  • [ci][distributed] fix flaky tests by @youkaichao in #6806
  • [ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check by @khluu in #6810
  • Fix ReplicatedLinear weight loading by @qingquansong in #6793
  • [Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. by @eaplatanios in #6770
  • [Core] Use array to speedup padding by @peng1999 in #6779
  • [doc][debugging] add known issues for hangs by @youkaichao in #6816
  • [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) by @mgoin in #6611
  • [Bugfix][Kernel] Promote another index to int64_t by @tlrmchlsmth in #6838
  • [Build/CI][ROCm] Minor simplification to Dockerfile.rocm by @WoosukKwon in #6811
  • [Misc][TPU] Support TPU in initialize_ray_cluster by @WoosukKwon in #6812
  • [Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation by @bigPYJ1151 in #6125
  • [Doc] Add Nemotron to supported model docs by @mgoin in #6843
  • [Doc] Update SkyPilot doc for wrong indents and instructions for update service by @Michaelvll in #4283
  • Update README.md by @gurpreet-dhami in #6847
  • enforce eager mode with bnb quantization temporarily by @chenqianfzh in #6846
  • [TPU] Support collective communications in XLA devices by @WoosukKwon in #6813
  • [Frontend] Factor out code for running uvicorn by @DarkLight1337 in #6828
  • [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b by @LucasWilkinson in #6852
  • [Bugfix]: Fix Tensorizer test failures by @sangstar in #6835
  • [ROCm] Upgrade PyTorch nightly version by @WoosukKwon in #6845
  • [Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron by @omrishiv in #6844
  • [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba by @tomeras91 in #6784
  • [Model] H2O Danube3-4b by @g-eoj in #6451
  • [Hardware][TPU] Implement tensor parallelism with Ray by @WoosukKwon in #5871
  • [Doc] Add missing mock import to docs conf.py by @hmellor in #6834
  • [Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor by @tjohnson31415 in #6802
  • [Misc][VLM][Doc] Consolidate offline examples for vision language models by @ywang96 in #6858
  • [Bugfix] Fix VLM example typo by @ywang96 in #6859
  • [bugfix] make args.stream work by @WrRan in #6831
  • [CI/Build][Doc] Update CI and Doc for VLM example changes by @ywang96 in #6860
  • [Model] Initial support for BLIP-2 by @DarkLight1337 in #5920
  • [Docs] Add RunLLM chat widget by @cw75 in #6857
  • [TPU] Reduce compilation time & Upgrade PyTorch XLA version by @WoosukKwon in #6856
  • [Kernel] Increase precision of GPTQ/AWQ Marlin kernel by @alexm-neuralmagic in #6795
  • Add Nemotron to PP_SUPPORTED_MODELS by @mgoin in #6863
  • [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 by @zeyugao in #6871
  • [Model] Initialize support for InternVL2 series models by @Isotr0py in #6514
  • [Kernel] Tuned FP8 Kernels for Ada Lovelace by @varun-sundar-rabindranath in #6677
  • [Core] Reduce unnecessary compute when logprobs=None by @peng1999 in #6532
  • [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel by @tlrmchlsmth in #6901
  • [TPU] Add TPU tensor parallelism to async engine by @etwk in #6891
  • [Bugfix] Allow vllm to still work if triton is not installed. by @tdoublep in #6786
  • [Frontend] New allowed_token_ids decoding request parameter by @njhill in #6753
  • [Kernel] Remove unused variables in awq/gemm_kernels.cu by @tlrmchlsmth in #6908
  • [ci] GHA workflow to remove ready label upon "/notready" comment by @khluu in #6921
  • [Kernel] Fix marlin divide-by-zero warnings by @tlrmchlsmth in #6904
  • [Kernel] Tuned int8 kernels for Ada Lovelace by @varun-sundar-rabindranath in #6848
  • [TPU] Fix greedy decoding by @WoosukKwon in #6933
  • [Bugfix] Fix PaliGemma MMP by @ywang96 in #6930
  • [Doc] Super tiny fix doc typo by @fzyzcjy in #6949
  • [BugFix] Fix use of per-request seed with pipeline parallel by @njhill in #6698
  • [Kernel] Squash a few more warnings by @tlrmchlsmth in #6914
  • [OpenVINO] Updated OpenVINO requirements and build docs by @ilya-lavrenov in #6948
  • [Bugfix] Fix tensorizer memory profiling bug during testing by @sangstar in #6881
  • [Kernel] Remove scaled_fp8_quant kernel padding footgun by @tlrmchlsmth in #6842
  • [core][misc] improve free_finished_seq_groups by @youkaichao in #6865
  • [Build] Temporarily Disable Kernels and LoRA tests by @simon-mo in #6961
  • [Nightly benchmarking suite] Remove pkill python from run benchmark suite by @cadedaniel in #6965
  • [CI] [nightly benchmark] Do not re-download sharegpt dataset if exists by @cadedaniel in #6706
  • [Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding by @cadedaniel in #6964
  • [mypy] Enable following imports for some directories by @DarkLight1337 in #6681
  • [Bugfix] Fix broadcasting logic for multi_modal_kwargs by @DarkLight1337 in #6836
  • [CI/Build] Fix mypy errors by @DarkLight1337 in #6968
  • [Bugfix][TPU] Set readonly=True for non-root devices by @WoosukKwon in #6980
  • [Bugfix] fix logit processor excceed vocab size issue by @FeiDeng in #6927
  • Support W4A8 quantization for vllm by @HandH1998 in #5218
  • [Bugfix] Clean up MiniCPM-V by @HwwwwwwwH in #6939
  • [Bugfix] Fix feature size calculation for LLaVA-NeXT by @DarkLight1337 in #6982
  • [Model] use FusedMoE layer in Jamba by @avshalomman in #6935
  • [MISC] Introduce pipeline parallelism partition strategies by @comaniac in #6920
  • [Bugfix] Support cpu offloading with quant_method.process_weights_after_loading by @mgoin in #6960
  • [Kernel] Enable FP8 Cutlass for Ada Lovelace by @varun-sundar-rabindranath in #6950
  • [Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) by @varun-sundar-rabindranath in #6996
  • [Misc] Add compressed-tensors to optimized quant list by @mgoin in #7006
  • Revert "[Frontend] Factor out code for running uvicorn" by @simon-mo in #7012
  • [Kernel][RFC] Refactor the punica kernel based on Triton by @jeejeelee in #5036
  • [Model] Pipeline parallel support for Qwen2 by @xuyi in #6924
  • [Bugfix][TPU] Do not use torch.Generator for TPUs by @WoosukKwon in #6981
  • [Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings by @tjohnson31415 in #6758
  • PP comm optimization: replace send with partial send + allgather by @aurickq in #6695
  • [Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user by @zifeitong in #6954
  • [core][scheduler] simplify and improve scheduler by @youkaichao in #6867
  • [Build/CI] Fixing Docker Hub quota issue. by @Alexei-V-Ivanov-AMD in #7043
  • [CI/Build] Update torch to 2.4 by @SageMoore in #6951
  • [Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm by @Isotr0py in #6992
  • [CI/Build] Remove sparseml requirement from testing by @mgoin in #7037
  • [Bugfix] Lower gemma's unloaded_params exception to warning by @mgoin in #7002
  • [Models] Support Qwen model with PP by @andoorve in #6974
  • Update run-amd-test.sh by @okakarpa in #7044
  • [Misc] Support attention logits soft-capping with flash-attn by @WoosukKwon in #7022
  • [CI/Build][Bugfix] Fix CUTLASS header-only line by @tlrmchlsmth in #7034
  • [Performance] Optimize get_seqs by @WoosukKwon in #7051
  • [Kernel] Fix input for flashinfer prefill wrapper. by @LiuXiaoxuanPKU in #7008
  • [mypy] Speed up mypy checking by @DarkLight1337 in #7056
  • [ci][distributed] try to fix pp test by @youkaichao in #7054
  • Fix tracing.py by @bong-furiosa in #7065
  • [cuda][misc] remove error_on_invalid_device_count_status by @youkaichao in #7069
  • [Core] Comment out unused code in sampler by @peng1999 in #7023
  • [Hardware][Intel CPU] Update torch 2.4.0 for CPU backend by @DamonFool in #6931
  • [ci] set timeout for test_oot_registration.py by @youkaichao in #7082
  • [CI/Build] Add support for Python 3.12 by @mgoin in #7035
  • [Misc] Disambiguate quantized types via a new ScalarType by @LucasWilkinson in #6396
  • [Core] Pipeline parallel with Ray ADAG by @ruisearch42 in #6837
  • [Misc] Revive to use loopback address for driver IP by @ruisearch42 in #7091
  • [misc] add a flag to enable compile by @youkaichao in #7092
  • [ Frontend ] Multiprocessing for OpenAI Server with zeromq by @robertgshaw2-neuralmagic in #6883
  • [ci][distributed] shorten wait time if server hangs by @youkaichao in #7098
  • [Frontend] Factor out chat message parsing by @DarkLight1337 in #7055
  • [ci][distributed] merge distributed test commands by @youkaichao in #7097
  • [ci][distributed] disable ray dag tests by @youkaichao in #7099
  • [Model] Refactor and decouple weight loading logic for InternVL2 model by @Isotr0py in #7067
  • [Bugfix] Fix block table for seqs that have prefix cache hits by @zachzzc in #7018
  • [LoRA] ReplicatedLinear support LoRA by @jeejeelee in #7081
  • [CI] Temporarily turn off H100 performance benchmark by @KuntaiDu in #7104
  • [ci][test] finalize fork_new_process_for_each_test by @youkaichao in #7114
  • [Frontend] Warn if user max_model_len is greater than derived max_model_len by @fialhocoelho in #7080
  • Support for guided decoding for offline LLM by @kevinbu233 in #6878
  • [misc] add zmq in collect env by @youkaichao in #7119
  • [core][misc] simply output processing with shortcut for non-parallel sampling and non-beam search usecase by @youkaichao in #7117
  • [Model]Refactor MiniCPMV by @jeejeelee in #7020
  • [Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator by @tdoublep in #7105
  • [misc][distributed] improve libcudart.so finding by @youkaichao in #7127
  • Clean up remaining Punica C information by @jeejeelee in #7027
  • [Model] Add multi-image support for minicpmv offline inference by @HwwwwwwwH in #7122
  • [Frontend] Reapply "Factor out code for running uvicorn" by @DarkLight1337 in #7095
  • [Model] SiglipVisionModel ported from transformers by @ChristopherCho in #6942
  • [Speculative decoding] Add periodic log with time spent in proposal/scoring/verification by @cadedaniel in #6963
  • [SpecDecode] Support FlashInfer in DraftModelRunner by @bong-furiosa in #6926
  • [BugFix] Use IP4 localhost form for zmq bind by @njhill in #7163
  • [BugFix] Use args.trust_remote_code by @VastoLorde95 in #7121
  • [Misc] Fix typo in GroupCoordinator.recv() by @ruisearch42 in #7167
  • [Kernel] Update CUTLASS to 3.5.1 by @tlrmchlsmth in #7085
  • [CI/Build] Suppress divide-by-zero and missing return statement warnings by @tlrmchlsmth in #7001
  • [Bugfix][CI/Build] Fix CUTLASS FetchContent by @tlrmchlsmth in #7171
  • bump version to v0.5.4 by @simon-mo in #7139

New Contributors

Full Changelog: v0.5.3...v0.5.4

Don't miss a new vllm release

NewReleases is sending notifications on new releases.