vllm 0.5.3 on Python PyPI

Highlights

Model Support

PLACEHOLDER
Support Mistral-Nemo (#6548)
Support Chameleon (#6633, #5770)
Pipeline parallel support for Mixtral (#6516)

Hardware Support

Many enhancements to TPU support. (#6277, #6457, #6506, #6504)

Performance Enhancements

Add AWQ support to the Marlin kernel. This brings significant (1.5-2x) perf improvements to existing AWQ models! (#6612)
Progress towards refactoring for SPMD worker execution. (#6032)
Progress in improving prepare inputs procedure. (#6164, #6338, #6596)
Memory optimization for pipeline parallelism. (#6455)

Production Engine

Correctness testing for pipeline parallel and CPU offloading (#6410, #6549)
Support dynamically loading Lora adapter from HuggingFace (#6234)
Pipeline Parallel using stdlib multiprocessing module (#6130)

Others

A CPU offloading implementation, you can now use --cpu-offload-gb to control how much memory to "extend" the RAM with. (#6496)
The new vllm CLI is now ready for testing. It comes with three commands: serve, complete, and chat. Feedback and improvements are greatly welcomed! (#6431)
The wheels now build on Ubuntu 20.04 instead of 22.04. (#6517)

What's Changed

[Docs] Add Google Cloud to sponsor list by @WoosukKwon in #6450
[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod by @WoosukKwon in #6289
[CI/Build][TPU] Add TPU CI test by @WoosukKwon in #6277
Pin sphinx-argparse version by @khluu in #6453
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug by @mzusman in #6425
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests by @g-eoj in #6419
[Docs] Announce 5th meetup by @WoosukKwon in #6458
[CI/Build] vLLM cache directory for images by @DarkLight1337 in #6444
[Frontend] Support for chat completions input in the tokenize endpoint by @sasha0552 in #5923
[Misc] Fix typos in spec. decode metrics logging. by @tdoublep in #6470
[Core] Use numpy to speed up padded token processing by @peng1999 in #6442
[CI/Build] Remove "boardwalk" image asset by @DarkLight1337 in #6460
[doc][misc] remind users to cancel debugging environment variables after debugging by @youkaichao in #6481
[Hardware][TPU] Support MoE with Pallas GMM kernel by @WoosukKwon in #6457
[Doc] Fix the lora adapter path in server startup script by @Jeffwan in #6230
[Misc] Log spec decode metrics by @comaniac in #6454
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale by @mgoin in #6081
[ci][distributed] add pipeline parallel correctness test by @youkaichao in #6410
[misc][distributed] improve tests by @youkaichao in #6488
[misc][distributed] add seed to dummy weights by @youkaichao in #6491
[Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization by @wushidonguc in #6455
[ROCm] Cleanup Dockerfile and remove outdated patch by @hongxiayang in #6482
[Misc][Speculative decoding] Typos and typing fixes by @ShangmingCai in #6467
[Doc][CI/Build] Update docs and tests to use vllm serve by @DarkLight1337 in #6431
[Bugfix] Fix for multinode crash on 4 PP by @andoorve in #6495
[TPU] Remove multi-modal args in TPU backend by @WoosukKwon in #6504
[Misc] Use torch.Tensor for type annotation by @WoosukKwon in #6505
[Core] Refactor _prepare_model_input_tensors - take 2 by @comaniac in #6164
[DOC] - Add docker image to Cerebrium Integration by @milo157 in #6510
[Bugfix] Fix Ray Metrics API usage by @Yard1 in #6354
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step by @alexm-neuralmagic in #6338
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel by @varun-sundar-rabindranath in #6511
[Model] Pipeline parallel support for Mixtral by @comaniac in #6516
[ Kernel ] Fp8 Channelwise Weight Support by @robertgshaw2-neuralmagic in #6487
[core][model] yet another cpu offload implementation by @youkaichao in #6496
[BugFix] Avoid secondary error in ShmRingBuffer destructor by @njhill in #6530
[Core] Introduce SPMD worker execution using Ray accelerated DAG by @ruisearch42 in #6032
[Misc] Minor patch for draft model runner by @comaniac in #6523
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs by @njhill in #6227
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash by @noamgat in #6501
[TPU] Refactor TPU worker & model runner by @WoosukKwon in #6506
[ Misc ] Improve Min Capability Checking in compressed-tensors by @robertgshaw2-neuralmagic in #6522
[ci] Reword Github bot comment by @khluu in #6534
[Model] Support Mistral-Nemo by @mgoin in #6548
Fix PR comment bot by @khluu in #6554
[ci][test] add correctness test for cpu offloading by @youkaichao in #6549
[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm by @tlrmchlsmth in #6552
[CI/Build] Build on Ubuntu 20.04 instead of 22.04 by @tlrmchlsmth in #6517
Add support for a rope extension method by @simon-mo in #6553
[Core] Multiprocessing Pipeline Parallel support by @njhill in #6130
[Bugfix] Make spec. decode respect per-request seed. by @tdoublep in #6034
[ Misc ] non-uniform quantization via compressed-tensors for Llama by @robertgshaw2-neuralmagic in #6515
[Bugfix][Frontend] Fix missing /metrics endpoint by @DarkLight1337 in #6463
[BUGFIX] Raise an error for no draft token case when draft_tp>1 by @wooyeonlee0 in #6369
[Model] RowParallelLinear: pass bias to quant_method.apply by @tdoublep in #6327
[Bugfix][Frontend] remove duplicate init logger by @dtrifiro in #6581
[Misc] Small perf improvements by @Yard1 in #6520
[Docs] Update docs for wheel location by @simon-mo in #6580
[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection by @tdoublep in #6578
[bugfix][distributed] fix multi-node bug for shared memory by @youkaichao in #6597
[ Kernel ] Enable Dynamic Per Token fp8 by @robertgshaw2-neuralmagic in #6547
[Docs] Update PP docs by @andoorve in #6598
[build] add ib so that multi-node support with infiniband can be supported out-of-the-box by @youkaichao in #6599
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub by @varun-sundar-rabindranath in #6593
[Core] Allow specifying custom Executor by @Yard1 in #6557
[Bugfix][Core]: Guard for KeyErrors that can occur if a request is aborted with Pipeline Parallel by @tjohnson31415 in #6587
[Misc] Consolidate and optimize logic for building padded tensors by @DarkLight1337 in #6541
[ Misc ] fbgemm checkpoints by @robertgshaw2-neuralmagic in #6559
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes by @mawong-amd in #6543
[ Kernel ] Enable fp8-marlin for fbgemm-fp8 models by @robertgshaw2-neuralmagic in #6606
[Misc] Fix input_scale typing in w8a8_utils.py by @mgoin in #6579
[ Bugfix ] Fix AutoFP8 fp8 marlin by @robertgshaw2-neuralmagic in #6609
[Frontend] Move chat utils by @DarkLight1337 in #6602
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. by @sroy745 in #6485
[Misc] Remove abused noqa by @WoosukKwon in #6619
[Model] Refactor and decouple phi3v image embedding by @Isotr0py in #6621
[Kernel][Core] Add AWQ support to the Marlin kernel by @alexm-neuralmagic in #6612
[Model] Initial Support for Chameleon by @ywang96 in #5770
[Misc] Add a wrapper for torch.inference_mode by @WoosukKwon in #6618
[Bugfix] Fix vocab_size field access in LLaVA models by @jaywonchung in #6624
[Frontend] Refactor prompt processing by @DarkLight1337 in #4028
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels by @tlrmchlsmth in #6649
[ci] Use different sccache bucket for CUDA 11.8 wheel build by @khluu in #6656
[Core] Support dynamically loading Lora adapter from HuggingFace by @Jeffwan in #6234
[ci][build] add back vim in docker by @youkaichao in #6661
[Misc] Remove deprecation warning for beam search by @WoosukKwon in #6659
[Core] Modulize prepare input and attention metadata builder by @comaniac in #6596
[Bugfix] Fix null modules_to_not_convert in FBGEMM Fp8 quantization by @cli99 in #6665
[Misc] Enable chunked prefill by default for long context models by @WoosukKwon in #6666
[misc] add start loading models for users information by @youkaichao in #6670
add tqdm when loading checkpoint shards by @zhaotyer in #6569
[Misc] Support FP8 kv cache scales from compressed-tensors by @mgoin in #6528
[doc][distributed] add more doc for setting up multi-node environment by @youkaichao in #6529
[Misc] Manage HTTP connections in one place by @DarkLight1337 in #6600
[misc] only tqdm for first rank by @youkaichao in #6672
[VLM][Model] Support image input for Chameleon by @ywang96 in #6633
support ignore patterns in model loader by @simon-mo in #6673
Bump version to v0.5.3 by @simon-mo in #6674

New Contributors

@g-eoj made their first contribution in #6419
@peng1999 made their first contribution in #6442
@Jeffwan made their first contribution in #6230
@wushidonguc made their first contribution in #6455
@ShangmingCai made their first contribution in #6467
@ruisearch42 made their first contribution in #6032

Full Changelog: v0.5.2...v0.5.3

vllm 0.5.3 v0.5.3 on Python PyPI

Highlights

Model Support

Hardware Support

Performance Enhancements

Production Engine

Others

What's Changed

New Contributors

vllm 0.5.3
v0.5.3

on Python PyPI