Major Changes
This version fixes the following major bugs:
- Memory leak with distributed execution. (Solved by using CuPY for collective communication).
- Support for Python 3.8.
Also with many smaller bug fixes listed below.
What's Changed
- Fixes assertion failure in prefix caching: the lora index mapping should respect
prefix_len
. by @sighingnow in #2688 - fix some bugs about parameter description by @zspo in #2689
- [Minor] Fix test_cache.py CI test failure by @pcmoritz in #2684
- Add unit test for Mixtral MoE layer by @pcmoritz in #2677
- Refactor Prometheus and Add Request Level Metrics by @rib-2 in #2316
- Add Internlm2 by @Leymore in #2666
- Fix compile error when using rocm by @zhaoyang-star in #2648
- fix python 3.8 syntax by @simon-mo in #2716
- Update README for meetup slides by @simon-mo in #2718
- Use revision when downloading the quantization config file by @Pernekhan in #2697
- remove hardcoded
device="cuda"
to support more device by @jikunshang in #2503 - fix length_penalty default value to 1.0 by @zspo in #2667
- Add one example to run batch inference distributed on Ray by @c21 in #2696
- docs: update langchain serving instructions by @mspronesti in #2736
- Set&Get llm internal tokenizer instead of the TokenizerGroup by @dancingpipi in #2741
- Remove eos tokens from output by default by @zcnrex in #2611
- add requirement: triton >= 2.1.0 by @whyiug in #2746
- [Minor] Fix benchmark_latency by @WoosukKwon in #2765
- [ROCm] Fix some kernels failed unit tests by @hongxiayang in #2498
- Set local logging level via env variable by @gardberg in #2774
- [ROCm] Fixup arch checks for ROCM by @dllehr-amd in #2627
- Add fused top-K softmax kernel for MoE by @WoosukKwon in #2769
- fix issue when model parameter is not a model id but path of the model. by @liuyhwangyh in #2489
- [Minor] More fix of test_cache.py CI test failure by @LiuXiaoxuanPKU in #2750
- [ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support by @hongxiayang in #2790
- Add documentation on how to do incremental builds by @pcmoritz in #2796
- [Ray] Integration compiled DAG off by default by @rkooo567 in #2471
- Disable custom all reduce by default by @WoosukKwon in #2808
- [ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention by @hongxiayang in #2768
- Add documentation section about LoRA by @pcmoritz in #2834
- Refactor 2 awq gemm kernels into m16nXk32 by @zcnrex in #2723
- Serving Benchmark Refactoring by @ywang96 in #2433
- [CI] Ensure documentation build is checked in CI by @simon-mo in #2842
- Refactor llama family models by @esmeetu in #2637
- Revert "Refactor llama family models" by @pcmoritz in #2851
- Use CuPy for CUDA graphs by @WoosukKwon in #2811
- Remove Yi model definition, please use
LlamaForCausalLM
instead by @pcmoritz in #2854 - Add LoRA support for Mixtral by @tterrysun in #2831
- Migrate InternLMForCausalLM to LlamaForCausalLM by @pcmoritz in #2860
- Fix internlm after #2860 by @pcmoritz in #2861
- [Fix] Fix memory profiling when GPU is used by multiple processes by @WoosukKwon in #2863
- Fix docker python version by @NikolaBorisov in #2845
- Migrate AquilaForCausalLM to LlamaForCausalLM by @esmeetu in #2867
- Don't use cupy NCCL for AMD backends by @WoosukKwon in #2855
- Align LoRA code between Mistral and Mixtral (fixes #2875) by @pcmoritz in #2880
- [BugFix] Fix GC bug for
LLM
class by @WoosukKwon in #2882 - Fix decilm.py by @pcmoritz in #2883
- [ROCm] Dockerfile fix for flash-attention build by @hongxiayang in #2885
- Prefix Caching- fix t4 triton error by @caoshiyi in #2517
- Bump up to v0.3.1 by @WoosukKwon in #2887
New Contributors
- @sighingnow made their first contribution in #2688
- @rib-2 made their first contribution in #2316
- @Leymore made their first contribution in #2666
- @Pernekhan made their first contribution in #2697
- @jikunshang made their first contribution in #2503
- @c21 made their first contribution in #2696
- @zcnrex made their first contribution in #2611
- @whyiug made their first contribution in #2746
- @gardberg made their first contribution in #2774
- @dllehr-amd made their first contribution in #2627
- @rkooo567 made their first contribution in #2471
- @ywang96 made their first contribution in #2433
- @tterrysun made their first contribution in #2831
Full Changelog: v0.3.0...v0.3.1