vllm-project/vllm v0.2.2 on GitHub

Major changes

Bump up to PyTorch v2.1 + CUDA 12.1 (vLLM+CUDA 11.8 is also provided)
Extensive refactoring for better tensor parallelism & quantization support
New models: Yi, ChatGLM, Phi
Changes in scheduler: from 1D flattened input tensor to 2D tensor
AWQ support for all models
Added LogitsProcessor API
Preliminary support for SqueezeLLM

What's Changed

Change scheduler & input tensor shape by @WoosukKwon in #1381
Add Mistral 7B to test_models by @WoosukKwon in #1366
fix typo by @WrRan in #1383
Fix TP bug by @WoosukKwon in #1389
Fix type hints by @lxrite in #1427
remove useless statements by @WrRan in #1408
Pin dependency versions by @thiagosalvatore in #1429
SqueezeLLM Support by @chooper1 in #1326
aquila model add rope_scaling by @Sanster in #1457
fix: don't skip first special token. by @gesanqiu in #1497
Support repetition_penalty by @beginlner in #1424
Fix bias in InternLM by @WoosukKwon in #1501
Delay GPU->CPU sync in sampling by @Yard1 in #1337
Refactor LLMEngine demo script for clarity and modularity by @iongpt in #1413
Fix logging issues by @Tostino in #1494
Add py.typed so consumers of vLLM can get type checking by @jroesch in #1509
vLLM always places spaces between special tokens by @blahblahasdf in #1373
[Fix] Fix duplicated logging messages by @zhuohan123 in #1524
Add dockerfile by @skrider in #1350
Fix integer overflows in attention & cache ops by @WoosukKwon in #1514
[Small] Formatter only checks lints in changed files by @cadedaniel in #1528
Add MptForCausalLM key in model_loader by @wenfeiy-db in #1526
[BugFix] Fix a bug when engine_use_ray=True and worker_use_ray=False and TP>1 by @beginlner in #1531
Adding a health endpoint by @Fluder-Paradyne in #1540
Remove MPTConfig by @WoosukKwon in #1529
Force paged attention v2 for long contexts by @Yard1 in #1510
docs: add description by @lots-o in #1553
Added logits processor API to sampling params by @noamgat in #1469
YaRN support implementation by @Yard1 in #1264
Add Quantization and AutoAWQ to docs by @casper-hansen in #1235
Support Yi model by @esmeetu in #1567
ChatGLM2 Support by @GoHomeToMacDonal in #1261
Upgrade to CUDA 12 by @zhuohan123 in #1527
[Worker] Fix input_metadata.selected_token_indices in worker by @ymwangg in #1546
Build CUDA11.8 wheels for release by @WoosukKwon in #1596
Add Yi model to quantization support by @forpanyang in #1600
Dockerfile: Upgrade Cuda to 12.1 by @GhaziSyed in #1609
config parser: add ChatGLM2 seq_length to _get_and_verify_max_len by @irasin in #1617
Fix cpu heavy code in async function _AsyncLLMEngine._run_workers_async by @dominik-schwabe in #1628
Fix #1474 - gptj AssertionError : assert param_slice.shape == loaded_weight.shape by @lihuahua123 in #1631
[Minor] Move RoPE selection logic to get_rope by @WoosukKwon in #1633
Add DeepSpeed MII backend to benchmark script by @WoosukKwon in #1649
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models by @zhuohan123 in #1622
Remove MptConfig by @megha95 in #1668
feat(config): support parsing torch.dtype by @aarnphm in #1641
Fix loading error when safetensors contains empty tensor by @twaka in #1687
[Minor] Fix duplication of ignored seq group in engine step by @simon-mo in #1666
[models] Microsoft Phi 1.5 by @maximzubkov in #1664
[Fix] Update Supported Models List by @zhuohan123 in #1690
Return usage for openai requests by @ichernev in #1663
[Fix] Fix comm test by @zhuohan123 in #1691
Update the adding-model doc according to the new refactor by @zhuohan123 in #1692
Add 'not' to this annotation: "#FIXME(woosuk): Do not use internal method" by @linotfan in #1704
Support Min P Sampler by @esmeetu in #1642
Read quantization_config in hf config by @WoosukKwon in #1695
Support download models from www.modelscope.cn by @liuyhwangyh in #1588
follow up of #1687 when safetensors model contains 0-rank tensors by @twaka in #1696
Add AWQ support for all models by @WoosukKwon in #1714
Support fused add rmsnorm for LLaMA by @beginlner in #1667
[Fix] Fix warning msg on quantization by @WoosukKwon in #1715
Bump up the version to v0.2.2 by @WoosukKwon in #1689

New Contributors

@lxrite made their first contribution in #1427
@thiagosalvatore made their first contribution in #1429
@chooper1 made their first contribution in #1326
@beginlner made their first contribution in #1424
@iongpt made their first contribution in #1413
@Tostino made their first contribution in #1494
@jroesch made their first contribution in #1509
@skrider made their first contribution in #1350
@cadedaniel made their first contribution in #1528
@wenfeiy-db made their first contribution in #1526
@Fluder-Paradyne made their first contribution in #1540
@lots-o made their first contribution in #1553
@noamgat made their first contribution in #1469
@casper-hansen made their first contribution in #1235
@GoHomeToMacDonal made their first contribution in #1261
@ymwangg made their first contribution in #1546
@forpanyang made their first contribution in #1600
@GhaziSyed made their first contribution in #1609
@irasin made their first contribution in #1617
@dominik-schwabe made their first contribution in #1628
@lihuahua123 made their first contribution in #1631
@megha95 made their first contribution in #1668
@aarnphm made their first contribution in #1641
@simon-mo made their first contribution in #1666
@maximzubkov made their first contribution in #1664
@ichernev made their first contribution in #1663
@linotfan made their first contribution in #1704
@liuyhwangyh made their first contribution in #1588

Full Changelog: v0.2.1...v0.2.2