Major changes
- Up to 60% performance improvement by optimizing de-tokenization and sampler
- Initial support for AWQ (performance not optimized)
- Support for RoPE scaling and LongChat
- Support for Mistral-7B
- Many bug fixes
What's Changed
- add option to shorten prompt print in log by @leiwen83 in #991
- Make
max_model_len
configurable by @Yard1 in #972 - Fix typo in README.md by @eltociear in #1033
- Use TGI-like incremental detokenization by @Yard1 in #984
- Add Model Revision Support in #1014
- [FIX] Minor bug fixes by @zhuohan123 in #1035
- Announce paper release by @WoosukKwon in #1036
- Fix detokenization leaving special tokens by @Yard1 in #1044
- Add pandas to requirements.txt by @WoosukKwon in #1047
- OpenAI-Server: Only fail if logit_bias has actual values by @LLukas22 in #1045
- Fix warning message on LLaMA FastTokenizer by @WoosukKwon in #1037
- Abort when coroutine is cancelled by @rucyang in #1020
- Implement AWQ quantization support for LLaMA by @WoosukKwon in #1032
- Remove AsyncLLMEngine busy loop, shield background task by @Yard1 in #1059
- Fix hanging when prompt exceeds limit by @chenxu2048 in #1029
- [FIX] Don't initialize parameter by default by @zhuohan123 in #1067
- added support for quantize on LLM module by @orellavie1212 in #1080
- align llm_engine and async_engine step method. by @esmeetu in #1081
- Fix get_max_num_running_seqs for waiting and swapped seq groups by @zhuohan123 in #1068
- Add safetensors support for quantized models by @WoosukKwon in #1073
- Add minimum capability requirement for AWQ by @WoosukKwon in #1064
- [Community] Add vLLM Discord server by @zhuohan123 in #1086
- Add pyarrow to dependencies & Print warning on Ray import error by @WoosukKwon in #1094
- Add gpu_memory_utilization and swap_space to LLM by @WoosukKwon in #1090
- Add documentation to Triton server tutorial by @tanmayv25 in #983
- rope_theta and max_position_embeddings from config by @Yard1 in #1096
- Replace torch.cuda.DtypeTensor with torch.tensor by @WoosukKwon in #1123
- Add float16 and float32 to dtype choices by @WoosukKwon in #1115
- clean api code, remove redundant background task. by @esmeetu in #1102
- feat: support stop_token_ids parameter. by @gesanqiu in #1097
- Use
--ipc=host
indocker run
for distributed inference by @WoosukKwon in #1125 - Docs: Fix broken link to openai example by @nkpz in #1145
- Announce the First vLLM Meetup by @WoosukKwon in #1148
- [Sampler] Vectorized sampling (simplified) by @zhuohan123 in #1048
- [FIX] Simplify sampler logic by @zhuohan123 in #1156
- Fix config for Falcon by @WoosukKwon in #1164
- Align
max_tokens
behavior with openai by @HermitSun in #852 - [Setup] Enable
TORCH_CUDA_ARCH_LIST
for selecting target GPUs by @WoosukKwon in #1074 - Add comments on RoPE initialization by @WoosukKwon in #1176
- Allocate more shared memory to attention kernel by @Yard1 in #1154
- Support Longchat by @LiuXiaoxuanPKU in #555
- fix typo (?) by @WrRan in #1184
- fix qwen-14b model by @Sanster in #1173
- Automatically set
max_num_batched_tokens
by @WoosukKwon in #1198 - Use standard extras for
uvicorn
by @danilopeixoto in #1166 - Keep special sampling params by @blahblahasdf in #1186
- qwen add rope_scaling by @Sanster in #1210
- [Mistral] Mistral-7B-v0.1 support by @Bam4d in #1196
- Fix Mistral model by @WoosukKwon in #1220
- [Fix] Remove false assertion by @WoosukKwon in #1222
- Add Mistral to supported model list by @WoosukKwon in #1221
- Fix OOM in attention kernel test by @WoosukKwon in #1223
- Provide default max model length by @WoosukKwon in #1224
- Bump up the version to v0.2.0 by @WoosukKwon in #1212
New Contributors
- @leiwen83 made their first contribution in #991
- @LLukas22 made their first contribution in #1045
- @rucyang made their first contribution in #1020
- @chenxu2048 made their first contribution in #1029
- @orellavie1212 made their first contribution in #1080
- @tanmayv25 made their first contribution in #983
- @nkpz made their first contribution in #1145
- @WrRan made their first contribution in #1184
- @danilopeixoto made their first contribution in #1166
- @blahblahasdf made their first contribution in #1186
- @Bam4d made their first contribution in #1196
Full Changelog: v0.1.7...v0.2.0