github vllm-project/vllm v0.5.2

latest releases: v0.6.3.post1, v0.6.3, v0.6.2...
3 months ago

Major Changes

  • ❗Planned breaking change ❗: we plan to remove beam search (see more in #6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
  • The release has moved to a Python version agnostic wheel (#6394). A single wheel can be installed across Python versions vLLM supports.

Highlights

Model Support

Hardware

  • AMD: unify CUDA_VISIBLE_DEVICES usage (#6352)

Performance

  • ZeroMQ fallback for broadcasting large objects (#6183)
  • Simplify code to support pipeline parallel (#6406)
  • Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
  • Use CUTLASS kernels for the FP8 layers with Bias (#6270)

Features

  • Enabling bonus token in speculative decoding for KV cache based models (#5765)
  • Medusa Implementation with Top-1 proposer (#4978)
  • An experimental vLLM CLI for serving and querying OpenAI compatible server (#5090)

Others

  • Add support for multi-node on CI (#5955)
  • Benchmark: add H100 suite (#6047)
  • [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
  • Build some nightly wheels (#6380)

What's Changed

New Contributors

Full Changelog: v0.5.1...v0.5.2

Don't miss a new vllm release

NewReleases is sending notifications on new releases.