Highlights
Model Support
- PLACEHOLDER
- Support Mistral-Nemo (#6548)
- Support Chameleon (#6633, #5770)
- Pipeline parallel support for Mixtral (#6516)
Hardware Support
Performance Enhancements
- Add AWQ support to the Marlin kernel. This brings significant (1.5-2x) perf improvements to existing AWQ models! (#6612)
- Progress towards refactoring for SPMD worker execution. (#6032)
- Progress in improving prepare inputs procedure. (#6164, #6338, #6596)
- Memory optimization for pipeline parallelism. (#6455)
Production Engine
- Correctness testing for pipeline parallel and CPU offloading (#6410, #6549)
- Support dynamically loading Lora adapter from HuggingFace (#6234)
- Pipeline Parallel using stdlib multiprocessing module (#6130)
Others
- A CPU offloading implementation, you can now use
--cpu-offload-gb
to control how much memory to "extend" the RAM with. (#6496) - The new
vllm
CLI is now ready for testing. It comes with three commands:serve
,complete
, andchat
. Feedback and improvements are greatly welcomed! (#6431) - The wheels now build on Ubuntu 20.04 instead of 22.04. (#6517)
What's Changed
- [Docs] Add Google Cloud to sponsor list by @WoosukKwon in #6450
- [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod by @WoosukKwon in #6289
- [CI/Build][TPU] Add TPU CI test by @WoosukKwon in #6277
- Pin sphinx-argparse version by @khluu in #6453
- [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug by @mzusman in #6425
- [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests by @g-eoj in #6419
- [Docs] Announce 5th meetup by @WoosukKwon in #6458
- [CI/Build] vLLM cache directory for images by @DarkLight1337 in #6444
- [Frontend] Support for chat completions input in the tokenize endpoint by @sasha0552 in #5923
- [Misc] Fix typos in spec. decode metrics logging. by @tdoublep in #6470
- [Core] Use numpy to speed up padded token processing by @peng1999 in #6442
- [CI/Build] Remove "boardwalk" image asset by @DarkLight1337 in #6460
- [doc][misc] remind users to cancel debugging environment variables after debugging by @youkaichao in #6481
- [Hardware][TPU] Support MoE with Pallas GMM kernel by @WoosukKwon in #6457
- [Doc] Fix the lora adapter path in server startup script by @Jeffwan in #6230
- [Misc] Log spec decode metrics by @comaniac in #6454
- [Kernel][Attention] Separate
Attention.kv_scale
intok_scale
andv_scale
by @mgoin in #6081 - [ci][distributed] add pipeline parallel correctness test by @youkaichao in #6410
- [misc][distributed] improve tests by @youkaichao in #6488
- [misc][distributed] add seed to dummy weights by @youkaichao in #6491
- [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization by @wushidonguc in #6455
- [ROCm] Cleanup Dockerfile and remove outdated patch by @hongxiayang in #6482
- [Misc][Speculative decoding] Typos and typing fixes by @ShangmingCai in #6467
- [Doc][CI/Build] Update docs and tests to use
vllm serve
by @DarkLight1337 in #6431 - [Bugfix] Fix for multinode crash on 4 PP by @andoorve in #6495
- [TPU] Remove multi-modal args in TPU backend by @WoosukKwon in #6504
- [Misc] Use
torch.Tensor
for type annotation by @WoosukKwon in #6505 - [Core] Refactor _prepare_model_input_tensors - take 2 by @comaniac in #6164
- [DOC] - Add docker image to Cerebrium Integration by @milo157 in #6510
- [Bugfix] Fix Ray Metrics API usage by @Yard1 in #6354
- [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step by @alexm-neuralmagic in #6338
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel by @varun-sundar-rabindranath in #6511
- [Model] Pipeline parallel support for Mixtral by @comaniac in #6516
- [ Kernel ] Fp8 Channelwise Weight Support by @robertgshaw2-neuralmagic in #6487
- [core][model] yet another cpu offload implementation by @youkaichao in #6496
- [BugFix] Avoid secondary error in ShmRingBuffer destructor by @njhill in #6530
- [Core] Introduce SPMD worker execution using Ray accelerated DAG by @ruisearch42 in #6032
- [Misc] Minor patch for draft model runner by @comaniac in #6523
- [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs by @njhill in #6227
- [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash by @noamgat in #6501
- [TPU] Refactor TPU worker & model runner by @WoosukKwon in #6506
- [ Misc ] Improve Min Capability Checking in
compressed-tensors
by @robertgshaw2-neuralmagic in #6522 - [ci] Reword Github bot comment by @khluu in #6534
- [Model] Support Mistral-Nemo by @mgoin in #6548
- Fix PR comment bot by @khluu in #6554
- [ci][test] add correctness test for cpu offloading by @youkaichao in #6549
- [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm by @tlrmchlsmth in #6552
- [CI/Build] Build on Ubuntu 20.04 instead of 22.04 by @tlrmchlsmth in #6517
- Add support for a rope extension method by @simon-mo in #6553
- [Core] Multiprocessing Pipeline Parallel support by @njhill in #6130
- [Bugfix] Make spec. decode respect per-request seed. by @tdoublep in #6034
- [ Misc ] non-uniform quantization via
compressed-tensors
forLlama
by @robertgshaw2-neuralmagic in #6515 - [Bugfix][Frontend] Fix missing
/metrics
endpoint by @DarkLight1337 in #6463 - [BUGFIX] Raise an error for no draft token case when draft_tp>1 by @wooyeonlee0 in #6369
- [Model] RowParallelLinear: pass bias to quant_method.apply by @tdoublep in #6327
- [Bugfix][Frontend] remove duplicate init logger by @dtrifiro in #6581
- [Misc] Small perf improvements by @Yard1 in #6520
- [Docs] Update docs for wheel location by @simon-mo in #6580
- [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection by @tdoublep in #6578
- [bugfix][distributed] fix multi-node bug for shared memory by @youkaichao in #6597
- [ Kernel ] Enable Dynamic Per Token
fp8
by @robertgshaw2-neuralmagic in #6547 - [Docs] Update PP docs by @andoorve in #6598
- [build] add ib so that multi-node support with infiniband can be supported out-of-the-box by @youkaichao in #6599
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub by @varun-sundar-rabindranath in #6593
- [Core] Allow specifying custom Executor by @Yard1 in #6557
- [Bugfix][Core]: Guard for KeyErrors that can occur if a request is aborted with Pipeline Parallel by @tjohnson31415 in #6587
- [Misc] Consolidate and optimize logic for building padded tensors by @DarkLight1337 in #6541
- [ Misc ]
fbgemm
checkpoints by @robertgshaw2-neuralmagic in #6559 - [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes by @mawong-amd in #6543
- [ Kernel ] Enable
fp8-marlin
forfbgemm-fp8
models by @robertgshaw2-neuralmagic in #6606 - [Misc] Fix input_scale typing in w8a8_utils.py by @mgoin in #6579
- [ Bugfix ] Fix AutoFP8 fp8 marlin by @robertgshaw2-neuralmagic in #6609
- [Frontend] Move chat utils by @DarkLight1337 in #6602
- [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. by @sroy745 in #6485
- [Misc] Remove abused noqa by @WoosukKwon in #6619
- [Model] Refactor and decouple phi3v image embedding by @Isotr0py in #6621
- [Kernel][Core] Add AWQ support to the Marlin kernel by @alexm-neuralmagic in #6612
- [Model] Initial Support for Chameleon by @ywang96 in #5770
- [Misc] Add a wrapper for torch.inference_mode by @WoosukKwon in #6618
- [Bugfix] Fix
vocab_size
field access in LLaVA models by @jaywonchung in #6624 - [Frontend] Refactor prompt processing by @DarkLight1337 in #4028
- [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels by @tlrmchlsmth in #6649
- [ci] Use different sccache bucket for CUDA 11.8 wheel build by @khluu in #6656
- [Core] Support dynamically loading Lora adapter from HuggingFace by @Jeffwan in #6234
- [ci][build] add back vim in docker by @youkaichao in #6661
- [Misc] Remove deprecation warning for beam search by @WoosukKwon in #6659
- [Core] Modulize prepare input and attention metadata builder by @comaniac in #6596
- [Bugfix] Fix null
modules_to_not_convert
in FBGEMM Fp8 quantization by @cli99 in #6665 - [Misc] Enable chunked prefill by default for long context models by @WoosukKwon in #6666
- [misc] add start loading models for users information by @youkaichao in #6670
- add tqdm when loading checkpoint shards by @zhaotyer in #6569
- [Misc] Support FP8 kv cache scales from compressed-tensors by @mgoin in #6528
- [doc][distributed] add more doc for setting up multi-node environment by @youkaichao in #6529
- [Misc] Manage HTTP connections in one place by @DarkLight1337 in #6600
- [misc] only tqdm for first rank by @youkaichao in #6672
- [VLM][Model] Support image input for Chameleon by @ywang96 in #6633
- support ignore patterns in model loader by @simon-mo in #6673
- Bump version to v0.5.3 by @simon-mo in #6674
New Contributors
- @g-eoj made their first contribution in #6419
- @peng1999 made their first contribution in #6442
- @Jeffwan made their first contribution in #6230
- @wushidonguc made their first contribution in #6455
- @ShangmingCai made their first contribution in #6467
- @ruisearch42 made their first contribution in #6032
Full Changelog: v0.5.2...v0.5.3