Highlights
Model Support
- vLLM now supports Meta Llama 3.1! Please checkout our blog here for initial details on running the model.
- Please checkout this thread for any known issues related to the model.
- The model runs on a single 8xH100 or 8xA100 node using FP8 quantization (#6606, #6547, #6487, #6593, #6511, #6515, #6552)
- The BF16 version of the model should run on multiple nodes using pipeline parallelism (docs). If you have fast network interconnect, you might want to consider full tensor paralellism as well. (#6599, #6598, #6529, #6569)
- In order to support long context, a new rope extension method has been added and chunked prefill has been turned on by default for Meta Llama 3.1 series of model. (#6666, #6553, #6673)
- Support Mistral-Nemo (#6548)
- Support Chameleon (#6633, #5770)
- Pipeline parallel support for Mixtral (#6516)
Hardware Support
Performance Enhancements
- Add AWQ support to the Marlin kernel. This brings significant (1.5-2x) perf improvements to existing AWQ models! (#6612)
- Progress towards refactoring for SPMD worker execution. (#6032)
- Progress in improving prepare inputs procedure. (#6164, #6338, #6596)
- Memory optimization for pipeline parallelism. (#6455)
Production Engine
- Correctness testing for pipeline parallel and CPU offloading (#6410, #6549)
- Support dynamically loading Lora adapter from HuggingFace (#6234)
- Pipeline Parallel using stdlib multiprocessing module (#6130)
Others
- A CPU offloading implementation, you can now use
--cpu-offload-gb
to control how much memory to "extend" the RAM with. (#6496) - The new
vllm
CLI is now ready for testing. It comes with three commands:serve
,complete
, andchat
. Feedback and improvements are greatly welcomed! (#6431) - The wheels now build on Ubuntu 20.04 instead of 22.04. (#6517)
What's Changed
- [Docs] Add Google Cloud to sponsor list by @WoosukKwon in #6450
- [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod by @WoosukKwon in #6289
- [CI/Build][TPU] Add TPU CI test by @WoosukKwon in #6277
- Pin sphinx-argparse version by @khluu in #6453
- [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug by @mzusman in #6425
- [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests by @g-eoj in #6419
- [Docs] Announce 5th meetup by @WoosukKwon in #6458
- [CI/Build] vLLM cache directory for images by @DarkLight1337 in #6444
- [Frontend] Support for chat completions input in the tokenize endpoint by @sasha0552 in #5923
- [Misc] Fix typos in spec. decode metrics logging. by @tdoublep in #6470
- [Core] Use numpy to speed up padded token processing by @peng1999 in #6442
- [CI/Build] Remove "boardwalk" image asset by @DarkLight1337 in #6460
- [doc][misc] remind users to cancel debugging environment variables after debugging by @youkaichao in #6481
- [Hardware][TPU] Support MoE with Pallas GMM kernel by @WoosukKwon in #6457
- [Doc] Fix the lora adapter path in server startup script by @Jeffwan in #6230
- [Misc] Log spec decode metrics by @comaniac in #6454
- [Kernel][Attention] Separate
Attention.kv_scale
intok_scale
andv_scale
by @mgoin in #6081 - [ci][distributed] add pipeline parallel correctness test by @youkaichao in #6410
- [misc][distributed] improve tests by @youkaichao in #6488
- [misc][distributed] add seed to dummy weights by @youkaichao in #6491
- [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization by @wushidonguc in #6455
- [ROCm] Cleanup Dockerfile and remove outdated patch by @hongxiayang in #6482
- [Misc][Speculative decoding] Typos and typing fixes by @ShangmingCai in #6467
- [Doc][CI/Build] Update docs and tests to use
vllm serve
by @DarkLight1337 in #6431 - [Bugfix] Fix for multinode crash on 4 PP by @andoorve in #6495
- [TPU] Remove multi-modal args in TPU backend by @WoosukKwon in #6504
- [Misc] Use
torch.Tensor
for type annotation by @WoosukKwon in #6505 - [Core] Refactor _prepare_model_input_tensors - take 2 by @comaniac in #6164
- [DOC] - Add docker image to Cerebrium Integration by @milo157 in #6510
- [Bugfix] Fix Ray Metrics API usage by @Yard1 in #6354
- [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step by @alexm-neuralmagic in #6338
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel by @varun-sundar-rabindranath in #6511
- [Model] Pipeline parallel support for Mixtral by @comaniac in #6516
- [ Kernel ] Fp8 Channelwise Weight Support by @robertgshaw2-neuralmagic in #6487
- [core][model] yet another cpu offload implementation by @youkaichao in #6496
- [BugFix] Avoid secondary error in ShmRingBuffer destructor by @njhill in #6530
- [Core] Introduce SPMD worker execution using Ray accelerated DAG by @ruisearch42 in #6032
- [Misc] Minor patch for draft model runner by @comaniac in #6523
- [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs by @njhill in #6227
- [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash by @noamgat in #6501
- [TPU] Refactor TPU worker & model runner by @WoosukKwon in #6506
- [ Misc ] Improve Min Capability Checking in
compressed-tensors
by @robertgshaw2-neuralmagic in #6522 - [ci] Reword Github bot comment by @khluu in #6534
- [Model] Support Mistral-Nemo by @mgoin in #6548
- Fix PR comment bot by @khluu in #6554
- [ci][test] add correctness test for cpu offloading by @youkaichao in #6549
- [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm by @tlrmchlsmth in #6552
- [CI/Build] Build on Ubuntu 20.04 instead of 22.04 by @tlrmchlsmth in #6517
- Add support for a rope extension method by @simon-mo in #6553
- [Core] Multiprocessing Pipeline Parallel support by @njhill in #6130
- [Bugfix] Make spec. decode respect per-request seed. by @tdoublep in #6034
- [ Misc ] non-uniform quantization via
compressed-tensors
forLlama
by @robertgshaw2-neuralmagic in #6515 - [Bugfix][Frontend] Fix missing
/metrics
endpoint by @DarkLight1337 in #6463 - [BUGFIX] Raise an error for no draft token case when draft_tp>1 by @wooyeonlee0 in #6369
- [Model] RowParallelLinear: pass bias to quant_method.apply by @tdoublep in #6327
- [Bugfix][Frontend] remove duplicate init logger by @dtrifiro in #6581
- [Misc] Small perf improvements by @Yard1 in #6520
- [Docs] Update docs for wheel location by @simon-mo in #6580
- [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection by @tdoublep in #6578
- [bugfix][distributed] fix multi-node bug for shared memory by @youkaichao in #6597
- [ Kernel ] Enable Dynamic Per Token
fp8
by @robertgshaw2-neuralmagic in #6547 - [Docs] Update PP docs by @andoorve in #6598
- [build] add ib so that multi-node support with infiniband can be supported out-of-the-box by @youkaichao in #6599
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub by @varun-sundar-rabindranath in #6593
- [Core] Allow specifying custom Executor by @Yard1 in #6557
- [Bugfix][Core]: Guard for KeyErrors that can occur if a request is aborted with Pipeline Parallel by @tjohnson31415 in #6587
- [Misc] Consolidate and optimize logic for building padded tensors by @DarkLight1337 in #6541
- [ Misc ]
fbgemm
checkpoints by @robertgshaw2-neuralmagic in #6559 - [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes by @mawong-amd in #6543
- [ Kernel ] Enable
fp8-marlin
forfbgemm-fp8
models by @robertgshaw2-neuralmagic in #6606 - [Misc] Fix input_scale typing in w8a8_utils.py by @mgoin in #6579
- [ Bugfix ] Fix AutoFP8 fp8 marlin by @robertgshaw2-neuralmagic in #6609
- [Frontend] Move chat utils by @DarkLight1337 in #6602
- [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. by @sroy745 in #6485
- [Misc] Remove abused noqa by @WoosukKwon in #6619
- [Model] Refactor and decouple phi3v image embedding by @Isotr0py in #6621
- [Kernel][Core] Add AWQ support to the Marlin kernel by @alexm-neuralmagic in #6612
- [Model] Initial Support for Chameleon by @ywang96 in #5770
- [Misc] Add a wrapper for torch.inference_mode by @WoosukKwon in #6618
- [Bugfix] Fix
vocab_size
field access in LLaVA models by @jaywonchung in #6624 - [Frontend] Refactor prompt processing by @DarkLight1337 in #4028
- [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels by @tlrmchlsmth in #6649
- [ci] Use different sccache bucket for CUDA 11.8 wheel build by @khluu in #6656
- [Core] Support dynamically loading Lora adapter from HuggingFace by @Jeffwan in #6234
- [ci][build] add back vim in docker by @youkaichao in #6661
- [Misc] Remove deprecation warning for beam search by @WoosukKwon in #6659
- [Core] Modulize prepare input and attention metadata builder by @comaniac in #6596
- [Bugfix] Fix null
modules_to_not_convert
in FBGEMM Fp8 quantization by @cli99 in #6665 - [Misc] Enable chunked prefill by default for long context models by @WoosukKwon in #6666
- [misc] add start loading models for users information by @youkaichao in #6670
- add tqdm when loading checkpoint shards by @zhaotyer in #6569
- [Misc] Support FP8 kv cache scales from compressed-tensors by @mgoin in #6528
- [doc][distributed] add more doc for setting up multi-node environment by @youkaichao in #6529
- [Misc] Manage HTTP connections in one place by @DarkLight1337 in #6600
- [misc] only tqdm for first rank by @youkaichao in #6672
- [VLM][Model] Support image input for Chameleon by @ywang96 in #6633
- support ignore patterns in model loader by @simon-mo in #6673
- Bump version to v0.5.3 by @simon-mo in #6674
New Contributors
- @g-eoj made their first contribution in #6419
- @peng1999 made their first contribution in #6442
- @Jeffwan made their first contribution in #6230
- @wushidonguc made their first contribution in #6455
- @ShangmingCai made their first contribution in #6467
- @ruisearch42 made their first contribution in #6032
Full Changelog: v0.5.2...v0.5.3