Major changes
- StarCoder2 support
- Performance optimization and LoRA support for Gemma
- 2/3/8-bit GPTQ support
- Performance optimization for MoE kernel
- [Experimental] AWS Inferentia2 support
- [Experimental] Structured output (JSON, Regex) in OpenAI Server
What's Changed
- Update a comment in
benchmark_serving.py
by @ronensc in #2934 - Added early stopping to completion APIs by @Maxusmusti in #2939
- Migrate MistralForCausalLM to LlamaForCausalLM by @esmeetu in #2868
- Use Llama RMSNorm for Gemma by @WoosukKwon in #2974
- chore(vllm): codespell for spell checking by @mspronesti in #2820
- Optimize GeGLU layer in Gemma by @WoosukKwon in #2975
- [FIX] Fix issue #2904 by @44670 in #2983
- Remove Flash Attention in test env by @WoosukKwon in #2982
- Include tokens from prompt phase in
counter_generation_tokens
by @ronensc in #2802 - Fix nvcc not found in vllm-openai image by @zhaoyang-star in #2781
- [Fix] Fix assertion on Mistral YaRN model len by @WoosukKwon in #2984
- Port metrics from
aioprometheus
toprometheus_client
by @hmellor in #2730 - Add LogProbs for Chat Completions in OpenAI by @jlcmoore in #2918
- Optimized fused MoE Kernel, take 2 by @pcmoritz in #2979
- [Minor] Remove gather_cached_kv kernel by @WoosukKwon in #3043
- [Minor] Remove unused config file by @esmeetu in #3039
- Fix using CuPy for eager mode by @esmeetu in #3037
- Fix stablelm by @esmeetu in #3038
- Support Orion model by @dachengai in #2539
- fix
get_ip
error in pure ipv6 environment by @Jingru in #2931 - [Minor] Fix type annotation in fused moe by @WoosukKwon in #3045
- Support logit bias for OpenAI API by @dylanwhawk in #3027
- [Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM by @WoosukKwon in #3046
- Enables GQA support in the prefix prefill kernels by @sighingnow in #3007
- multi-lora documentation fix by @ElefHead in #3064
- Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs by @AllenDou in #3070
- Support inference with transformers-neuronx by @liangfu in #2569
- Add LoRA support for Gemma by @WoosukKwon in #3050
- Add Support for 2/3/8-bit GPTQ Quantization Models by @chu-tianxiang in #2330
- Fix:
AttributeError
in OpenAI-compatible server by @jaywonchung in #3018 - add cache_config's info to prometheus metrics. by @AllenDou in #3100
- Support starcoder2 architecture by @sh0416 in #3089
- Fix building from source on WSL by @aliencaocao in #3112
- [Fix] Don't deep-copy LogitsProcessors when copying SamplingParams by @njhill in #3099
- Add guided decoding for OpenAI API server by @felixzhu555 in #2819
- Fix: Output text is always truncated in some models by @HyperdriveHustle in #3016
- Remove exclude_unset in streaming response by @sh0416 in #3143
- docs: Add tutorial on deploying vLLM model with KServe by @terrytangyuan in #2586
- fix relative import path of protocol.py by @Huarong in #3134
- Integrate Marlin Kernels for Int4 GPTQ inference by @robertgshaw2-neuralmagic in #2497
- Bump up to v0.3.3 by @WoosukKwon in #3129
New Contributors
- @Maxusmusti made their first contribution in #2939
- @44670 made their first contribution in #2983
- @jlcmoore made their first contribution in #2918
- @dachengai made their first contribution in #2539
- @dylanwhawk made their first contribution in #3027
- @ElefHead made their first contribution in #3064
- @AllenDou made their first contribution in #3070
- @jaywonchung made their first contribution in #3018
- @sh0416 made their first contribution in #3089
- @aliencaocao made their first contribution in #3112
- @felixzhu555 made their first contribution in #2819
- @HyperdriveHustle made their first contribution in #3016
- @terrytangyuan made their first contribution in #2586
- @Huarong made their first contribution in #3134
Full Changelog: v0.3.2...v0.3.3