Major changes
- Bump up to PyTorch v2.1 + CUDA 12.1 (vLLM+CUDA 11.8 is also provided)
- Extensive refactoring for better tensor parallelism & quantization support
- New models: Yi, ChatGLM, Phi
- Changes in scheduler: from 1D flattened input tensor to 2D tensor
- AWQ support for all models
- Added LogitsProcessor API
- Preliminary support for SqueezeLLM
What's Changed
- Change scheduler & input tensor shape by @WoosukKwon in #1381
- Add Mistral 7B to
test_models
by @WoosukKwon in #1366 - fix typo by @WrRan in #1383
- Fix TP bug by @WoosukKwon in #1389
- Fix type hints by @lxrite in #1427
- remove useless statements by @WrRan in #1408
- Pin dependency versions by @thiagosalvatore in #1429
- SqueezeLLM Support by @chooper1 in #1326
- aquila model add rope_scaling by @Sanster in #1457
- fix: don't skip first special token. by @gesanqiu in #1497
- Support repetition_penalty by @beginlner in #1424
- Fix bias in InternLM by @WoosukKwon in #1501
- Delay GPU->CPU sync in sampling by @Yard1 in #1337
- Refactor LLMEngine demo script for clarity and modularity by @iongpt in #1413
- Fix logging issues by @Tostino in #1494
- Add py.typed so consumers of vLLM can get type checking by @jroesch in #1509
- vLLM always places spaces between special tokens by @blahblahasdf in #1373
- [Fix] Fix duplicated logging messages by @zhuohan123 in #1524
- Add dockerfile by @skrider in #1350
- Fix integer overflows in attention & cache ops by @WoosukKwon in #1514
- [Small] Formatter only checks lints in changed files by @cadedaniel in #1528
- Add
MptForCausalLM
key in model_loader by @wenfeiy-db in #1526 - [BugFix] Fix a bug when engine_use_ray=True and worker_use_ray=False and TP>1 by @beginlner in #1531
- Adding a health endpoint by @Fluder-Paradyne in #1540
- Remove
MPTConfig
by @WoosukKwon in #1529 - Force paged attention v2 for long contexts by @Yard1 in #1510
- docs: add description by @lots-o in #1553
- Added logits processor API to sampling params by @noamgat in #1469
- YaRN support implementation by @Yard1 in #1264
- Add Quantization and AutoAWQ to docs by @casper-hansen in #1235
- Support Yi model by @esmeetu in #1567
- ChatGLM2 Support by @GoHomeToMacDonal in #1261
- Upgrade to CUDA 12 by @zhuohan123 in #1527
- [Worker] Fix input_metadata.selected_token_indices in worker by @ymwangg in #1546
- Build CUDA11.8 wheels for release by @WoosukKwon in #1596
- Add Yi model to quantization support by @forpanyang in #1600
- Dockerfile: Upgrade Cuda to 12.1 by @GhaziSyed in #1609
- config parser: add ChatGLM2 seq_length to
_get_and_verify_max_len
by @irasin in #1617 - Fix cpu heavy code in async function _AsyncLLMEngine._run_workers_async by @dominik-schwabe in #1628
- Fix #1474 - gptj AssertionError : assert param_slice.shape == loaded_weight.shape by @lihuahua123 in #1631
- [Minor] Move RoPE selection logic to
get_rope
by @WoosukKwon in #1633 - Add DeepSpeed MII backend to benchmark script by @WoosukKwon in #1649
- TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models by @zhuohan123 in #1622
- Remove
MptConfig
by @megha95 in #1668 - feat(config): support parsing torch.dtype by @aarnphm in #1641
- Fix loading error when safetensors contains empty tensor by @twaka in #1687
- [Minor] Fix duplication of ignored seq group in engine step by @simon-mo in #1666
- [models] Microsoft Phi 1.5 by @maximzubkov in #1664
- [Fix] Update Supported Models List by @zhuohan123 in #1690
- Return usage for openai requests by @ichernev in #1663
- [Fix] Fix comm test by @zhuohan123 in #1691
- Update the adding-model doc according to the new refactor by @zhuohan123 in #1692
- Add 'not' to this annotation: "#FIXME(woosuk): Do not use internal method" by @linotfan in #1704
- Support Min P Sampler by @esmeetu in #1642
- Read quantization_config in hf config by @WoosukKwon in #1695
- Support download models from www.modelscope.cn by @liuyhwangyh in #1588
- follow up of #1687 when safetensors model contains 0-rank tensors by @twaka in #1696
- Add AWQ support for all models by @WoosukKwon in #1714
- Support fused add rmsnorm for LLaMA by @beginlner in #1667
- [Fix] Fix warning msg on quantization by @WoosukKwon in #1715
- Bump up the version to v0.2.2 by @WoosukKwon in #1689
New Contributors
- @lxrite made their first contribution in #1427
- @thiagosalvatore made their first contribution in #1429
- @chooper1 made their first contribution in #1326
- @beginlner made their first contribution in #1424
- @iongpt made their first contribution in #1413
- @Tostino made their first contribution in #1494
- @jroesch made their first contribution in #1509
- @skrider made their first contribution in #1350
- @cadedaniel made their first contribution in #1528
- @wenfeiy-db made their first contribution in #1526
- @Fluder-Paradyne made their first contribution in #1540
- @lots-o made their first contribution in #1553
- @noamgat made their first contribution in #1469
- @casper-hansen made their first contribution in #1235
- @GoHomeToMacDonal made their first contribution in #1261
- @ymwangg made their first contribution in #1546
- @forpanyang made their first contribution in #1600
- @GhaziSyed made their first contribution in #1609
- @irasin made their first contribution in #1617
- @dominik-schwabe made their first contribution in #1628
- @lihuahua123 made their first contribution in #1631
- @megha95 made their first contribution in #1668
- @aarnphm made their first contribution in #1641
- @simon-mo made their first contribution in #1666
- @maximzubkov made their first contribution in #1664
- @ichernev made their first contribution in #1663
- @linotfan made their first contribution in #1704
- @liuyhwangyh made their first contribution in #1588
Full Changelog: v0.2.1...v0.2.2