Major changes
- Refactoring on Worker, InputMetadata, and Attention
- Fix TP support for AWQ models
- Support Prometheus metrics
- Fix Baichuan & Baichuan 2
What's Changed
- Add instructions to install vllm+cu118 by @WoosukKwon in #1717
- Documentation about official docker image by @simon-mo in #1709
- Fix the code block's format in deploying_with_docker page by @HermitSun in #1722
- Migrate linter from
pylint
toruff
by @simon-mo in #1665 - [FIX] Update the doc link in README.md by @zhuohan123 in #1730
- [BugFix] Fix a bug in loading safetensors by @WoosukKwon in #1732
- Fix hanging in the scheduler caused by long prompts by @chenxu2048 in #1534
- [Fix] Fix bugs in scheduler by @linotfan in #1727
- Rewrite torch.repeat_interleave to remove cpu synchronization by @beginlner in #1599
- fix RAM OOM when load large models in tensor parallel mode. by @boydfd in #1395
- [BugFix] Fix TP support for AWQ by @WoosukKwon in #1731
- [FIX] Fix the case when
input_is_parallel=False
forScaledActivation
by @zhuohan123 in #1737 - Add stop_token_ids in SamplingParams.repr by @chenxu2048 in #1745
- [DOCS] Add engine args documentation by @casper-hansen in #1741
- Set top_p=0 and top_k=-1 in greedy sampling by @beginlner in #1748
- Fix repetition penalty aligned with huggingface by @beginlner in #1577
- [build] Avoid building too many extensions by @ymwangg in #1624
- [Minor] Fix model docstrings by @WoosukKwon in #1764
- Added echo function to OpenAI API server. by @wanmok in #1504
- Init model on GPU to reduce CPU memory footprint by @beginlner in #1796
- Correct comments in parallel_state.py by @explainerauthors in #1818
- Fix OPT weight loading by @WoosukKwon in #1819
- [FIX] Fix class naming by @zhuohan123 in #1803
- Move the definition of BlockTable a few lines above so we could use it in BlockAllocator by @explainerauthors in #1791
- [FIX] Fix formatting error in main branch by @zhuohan123 in #1822
- [Fix] Fix RoPE in ChatGLM-32K by @WoosukKwon in #1841
- Better integration with Ray Serve by @FlorianJoncour in #1821
- Refactor Attention by @WoosukKwon in #1840
- [Docs] Add information about using shared memory in docker by @simon-mo in #1845
- Disable Logs Requests should Disable Logging of requests. by @MichaelMcCulloch in #1779
- Refactor worker & InputMetadata by @WoosukKwon in #1843
- Avoid multiple instantiations of the RoPE class by @jeejeeli in #1828
- [FIX] Fix docker build error (#1831) by @allenhaozi in #1832
- Add profile option to latency benchmark by @WoosukKwon in #1839
- Remove
max_num_seqs
in latency benchmark by @WoosukKwon in #1855 - Support max-model-len argument for throughput benchmark by @aisensiy in #1858
- Fix rope cache key error by @esmeetu in #1867
- docs: add instructions for Langchain by @mspronesti in #1162
- Support chat template and
echo
for chat API by @Tostino in #1756 - Fix Baichuan tokenizer error by @WoosukKwon in #1874
- Add weight normalization for Baichuan 2 by @WoosukKwon in #1876
- Fix the typo in SamplingParams' docstring. by @xukp20 in #1886
- [Docs] Update the AWQ documentation to highlight performance issue by @simon-mo in #1883
- Fix the broken sampler tests by @WoosukKwon in #1896
- Add Production Metrics in Prometheus format by @simon-mo in #1890
- Add PyTorch-native implementation of custom layers by @WoosukKwon in #1898
- Fix broken worker test by @WoosukKwon in #1900
- chore(examples-docs): upgrade to OpenAI V1 by @mspronesti in #1785
- Fix num_gpus when TP > 1 by @WoosukKwon in #1852
- Bump up to v0.2.3 by @WoosukKwon in #1903
New Contributors
- @boydfd made their first contribution in #1395
- @explainerauthors made their first contribution in #1818
- @FlorianJoncour made their first contribution in #1821
- @MichaelMcCulloch made their first contribution in #1779
- @jeejeeli made their first contribution in #1828
- @allenhaozi made their first contribution in #1832
- @aisensiy made their first contribution in #1858
- @xukp20 made their first contribution in #1886
Full Changelog: v0.2.2...v0.2.3