vllm 0.2.3 on Python PyPI

Major changes

Refactoring on Worker, InputMetadata, and Attention
Fix TP support for AWQ models
Support Prometheus metrics
Fix Baichuan & Baichuan 2

What's Changed

Add instructions to install vllm+cu118 by @WoosukKwon in #1717
Documentation about official docker image by @simon-mo in #1709
Fix the code block's format in deploying_with_docker page by @HermitSun in #1722
Migrate linter from pylint to ruff by @simon-mo in #1665
[FIX] Update the doc link in README.md by @zhuohan123 in #1730
[BugFix] Fix a bug in loading safetensors by @WoosukKwon in #1732
Fix hanging in the scheduler caused by long prompts by @chenxu2048 in #1534
[Fix] Fix bugs in scheduler by @linotfan in #1727
Rewrite torch.repeat_interleave to remove cpu synchronization by @beginlner in #1599
fix RAM OOM when load large models in tensor parallel mode. by @boydfd in #1395
[BugFix] Fix TP support for AWQ by @WoosukKwon in #1731
[FIX] Fix the case when input_is_parallel=False for ScaledActivation by @zhuohan123 in #1737
Add stop_token_ids in SamplingParams.repr by @chenxu2048 in #1745
[DOCS] Add engine args documentation by @casper-hansen in #1741
Set top_p=0 and top_k=-1 in greedy sampling by @beginlner in #1748
Fix repetition penalty aligned with huggingface by @beginlner in #1577
[build] Avoid building too many extensions by @ymwangg in #1624
[Minor] Fix model docstrings by @WoosukKwon in #1764
Added echo function to OpenAI API server. by @wanmok in #1504
Init model on GPU to reduce CPU memory footprint by @beginlner in #1796
Correct comments in parallel_state.py by @explainerauthors in #1818
Fix OPT weight loading by @WoosukKwon in #1819
[FIX] Fix class naming by @zhuohan123 in #1803
Move the definition of BlockTable a few lines above so we could use it in BlockAllocator by @explainerauthors in #1791
[FIX] Fix formatting error in main branch by @zhuohan123 in #1822
[Fix] Fix RoPE in ChatGLM-32K by @WoosukKwon in #1841
Better integration with Ray Serve by @FlorianJoncour in #1821
Refactor Attention by @WoosukKwon in #1840
[Docs] Add information about using shared memory in docker by @simon-mo in #1845
Disable Logs Requests should Disable Logging of requests. by @MichaelMcCulloch in #1779
Refactor worker & InputMetadata by @WoosukKwon in #1843
Avoid multiple instantiations of the RoPE class by @jeejeeli in #1828
[FIX] Fix docker build error (#1831) by @allenhaozi in #1832
Add profile option to latency benchmark by @WoosukKwon in #1839
Remove max_num_seqs in latency benchmark by @WoosukKwon in #1855
Support max-model-len argument for throughput benchmark by @aisensiy in #1858
Fix rope cache key error by @esmeetu in #1867
docs: add instructions for Langchain by @mspronesti in #1162
Support chat template and echo for chat API by @Tostino in #1756
Fix Baichuan tokenizer error by @WoosukKwon in #1874
Add weight normalization for Baichuan 2 by @WoosukKwon in #1876
Fix the typo in SamplingParams' docstring. by @xukp20 in #1886
[Docs] Update the AWQ documentation to highlight performance issue by @simon-mo in #1883
Fix the broken sampler tests by @WoosukKwon in #1896
Add Production Metrics in Prometheus format by @simon-mo in #1890
Add PyTorch-native implementation of custom layers by @WoosukKwon in #1898
Fix broken worker test by @WoosukKwon in #1900
chore(examples-docs): upgrade to OpenAI V1 by @mspronesti in #1785
Fix num_gpus when TP > 1 by @WoosukKwon in #1852
Bump up to v0.2.3 by @WoosukKwon in #1903

New Contributors

@boydfd made their first contribution in #1395
@explainerauthors made their first contribution in #1818
@FlorianJoncour made their first contribution in #1821
@MichaelMcCulloch made their first contribution in #1779
@jeejeeli made their first contribution in #1828
@allenhaozi made their first contribution in #1832
@aisensiy made their first contribution in #1858
@xukp20 made their first contribution in #1886

Full Changelog: v0.2.2...v0.2.3

vllm 0.2.3 v0.2.3 on Python PyPI

Major changes

What's Changed

New Contributors

vllm 0.2.3
v0.2.3

on Python PyPI