vllm 0.3.1 on Python PyPI

Major Changes

This version fixes the following major bugs:

Memory leak with distributed execution. (Solved by using CuPY for collective communication).
Support for Python 3.8.

Also with many smaller bug fixes listed below.

What's Changed

Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len. by @sighingnow in #2688
fix some bugs about parameter description by @zspo in #2689
[Minor] Fix test_cache.py CI test failure by @pcmoritz in #2684
Add unit test for Mixtral MoE layer by @pcmoritz in #2677
Refactor Prometheus and Add Request Level Metrics by @rib-2 in #2316
Add Internlm2 by @Leymore in #2666
Fix compile error when using rocm by @zhaoyang-star in #2648
fix python 3.8 syntax by @simon-mo in #2716
Update README for meetup slides by @simon-mo in #2718
Use revision when downloading the quantization config file by @Pernekhan in #2697
remove hardcoded device="cuda" to support more device by @jikunshang in #2503
fix length_penalty default value to 1.0 by @zspo in #2667
Add one example to run batch inference distributed on Ray by @c21 in #2696
docs: update langchain serving instructions by @mspronesti in #2736
Set&Get llm internal tokenizer instead of the TokenizerGroup by @dancingpipi in #2741
Remove eos tokens from output by default by @zcnrex in #2611
add requirement: triton >= 2.1.0 by @whyiug in #2746
[Minor] Fix benchmark_latency by @WoosukKwon in #2765
[ROCm] Fix some kernels failed unit tests by @hongxiayang in #2498
Set local logging level via env variable by @gardberg in #2774
[ROCm] Fixup arch checks for ROCM by @dllehr-amd in #2627
Add fused top-K softmax kernel for MoE by @WoosukKwon in #2769
fix issue when model parameter is not a model id but path of the model. by @liuyhwangyh in #2489
[Minor] More fix of test_cache.py CI test failure by @LiuXiaoxuanPKU in #2750
[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support by @hongxiayang in #2790
Add documentation on how to do incremental builds by @pcmoritz in #2796
[Ray] Integration compiled DAG off by default by @rkooo567 in #2471
Disable custom all reduce by default by @WoosukKwon in #2808
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention by @hongxiayang in #2768
Add documentation section about LoRA by @pcmoritz in #2834
Refactor 2 awq gemm kernels into m16nXk32 by @zcnrex in #2723
Serving Benchmark Refactoring by @ywang96 in #2433
[CI] Ensure documentation build is checked in CI by @simon-mo in #2842
Refactor llama family models by @esmeetu in #2637
Revert "Refactor llama family models" by @pcmoritz in #2851
Use CuPy for CUDA graphs by @WoosukKwon in #2811
Remove Yi model definition, please use LlamaForCausalLM instead by @pcmoritz in #2854
Add LoRA support for Mixtral by @tterrysun in #2831
Migrate InternLMForCausalLM to LlamaForCausalLM by @pcmoritz in #2860
Fix internlm after #2860 by @pcmoritz in #2861
[Fix] Fix memory profiling when GPU is used by multiple processes by @WoosukKwon in #2863
Fix docker python version by @NikolaBorisov in #2845
Migrate AquilaForCausalLM to LlamaForCausalLM by @esmeetu in #2867
Don't use cupy NCCL for AMD backends by @WoosukKwon in #2855
Align LoRA code between Mistral and Mixtral (fixes #2875) by @pcmoritz in #2880
[BugFix] Fix GC bug for LLM class by @WoosukKwon in #2882
Fix decilm.py by @pcmoritz in #2883
[ROCm] Dockerfile fix for flash-attention build by @hongxiayang in #2885
Prefix Caching- fix t4 triton error by @caoshiyi in #2517
Bump up to v0.3.1 by @WoosukKwon in #2887

New Contributors

@sighingnow made their first contribution in #2688
@rib-2 made their first contribution in #2316
@Leymore made their first contribution in #2666
@Pernekhan made their first contribution in #2697
@jikunshang made their first contribution in #2503
@c21 made their first contribution in #2696
@zcnrex made their first contribution in #2611
@whyiug made their first contribution in #2746
@gardberg made their first contribution in #2774
@dllehr-amd made their first contribution in #2627
@rkooo567 made their first contribution in #2471
@ywang96 made their first contribution in #2433
@tterrysun made their first contribution in #2831

Full Changelog: v0.3.0...v0.3.1

vllm 0.3.1 v0.3.1 on Python PyPI

Major Changes

What's Changed

New Contributors

vllm 0.3.1
v0.3.1

on Python PyPI