github vllm-project/vllm v0.5.3

latest releases: v0.6.1.post2, v0.6.1.post1, v0.6.1...
one month ago

Highlights

Model Support

  • vLLM now supports Meta Llama 3.1! Please checkout our blog here for initial details on running the model.
    • Please checkout this thread for any known issues related to the model.
    • The model runs on a single 8xH100 or 8xA100 node using FP8 quantization (#6606, #6547, #6487, #6593, #6511, #6515, #6552)
    • The BF16 version of the model should run on multiple nodes using pipeline parallelism (docs). If you have fast network interconnect, you might want to consider full tensor paralellism as well. (#6599, #6598, #6529, #6569)
    • In order to support long context, a new rope extension method has been added and chunked prefill has been turned on by default for Meta Llama 3.1 series of model. (#6666, #6553, #6673)
  • Support Mistral-Nemo (#6548)
  • Support Chameleon (#6633, #5770)
  • Pipeline parallel support for Mixtral (#6516)

Hardware Support

Performance Enhancements

  • Add AWQ support to the Marlin kernel. This brings significant (1.5-2x) perf improvements to existing AWQ models! (#6612)
  • Progress towards refactoring for SPMD worker execution. (#6032)
  • Progress in improving prepare inputs procedure. (#6164, #6338, #6596)
  • Memory optimization for pipeline parallelism. (#6455)

Production Engine

  • Correctness testing for pipeline parallel and CPU offloading (#6410, #6549)
  • Support dynamically loading Lora adapter from HuggingFace (#6234)
  • Pipeline Parallel using stdlib multiprocessing module (#6130)

Others

  • A CPU offloading implementation, you can now use --cpu-offload-gb to control how much memory to "extend" the RAM with. (#6496)
  • The new vllm CLI is now ready for testing. It comes with three commands: serve, complete, and chat. Feedback and improvements are greatly welcomed! (#6431)
  • The wheels now build on Ubuntu 20.04 instead of 22.04. (#6517)

What's Changed

New Contributors

Full Changelog: v0.5.2...v0.5.3

Don't miss a new vllm release

NewReleases is sending notifications on new releases.