vllm-project/vllm v0.3.3 on GitHub

Major changes

StarCoder2 support
Performance optimization and LoRA support for Gemma
2/3/8-bit GPTQ support
Integrate Marlin Kernels for Int4 GPTQ inference
Performance optimization for MoE kernel
[Experimental] AWS Inferentia2 support
[Experimental] Structured output (JSON, Regex) in OpenAI Server

What's Changed

Update a comment in benchmark_serving.py by @ronensc in #2934
Added early stopping to completion APIs by @Maxusmusti in #2939
Migrate MistralForCausalLM to LlamaForCausalLM by @esmeetu in #2868
Use Llama RMSNorm for Gemma by @WoosukKwon in #2974
chore(vllm): codespell for spell checking by @mspronesti in #2820
Optimize GeGLU layer in Gemma by @WoosukKwon in #2975
[FIX] Fix issue #2904 by @44670 in #2983
Remove Flash Attention in test env by @WoosukKwon in #2982
Include tokens from prompt phase in counter_generation_tokens by @ronensc in #2802
Fix nvcc not found in vllm-openai image by @zhaoyang-star in #2781
[Fix] Fix assertion on Mistral YaRN model len by @WoosukKwon in #2984
Port metrics from aioprometheus to prometheus_client by @hmellor in #2730
Add LogProbs for Chat Completions in OpenAI by @jlcmoore in #2918
Optimized fused MoE Kernel, take 2 by @pcmoritz in #2979
[Minor] Remove gather_cached_kv kernel by @WoosukKwon in #3043
[Minor] Remove unused config file by @esmeetu in #3039
Fix using CuPy for eager mode by @esmeetu in #3037
Fix stablelm by @esmeetu in #3038
Support Orion model by @dachengai in #2539
fix get_ip error in pure ipv6 environment by @Jingru in #2931
[Minor] Fix type annotation in fused moe by @WoosukKwon in #3045
Support logit bias for OpenAI API by @dylanwhawk in #3027
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM by @WoosukKwon in #3046
Enables GQA support in the prefix prefill kernels by @sighingnow in #3007
multi-lora documentation fix by @ElefHead in #3064
Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs by @AllenDou in #3070
Support inference with transformers-neuronx by @liangfu in #2569
Add LoRA support for Gemma by @WoosukKwon in #3050
Add Support for 2/3/8-bit GPTQ Quantization Models by @chu-tianxiang in #2330
Fix: AttributeError in OpenAI-compatible server by @jaywonchung in #3018
add cache_config's info to prometheus metrics. by @AllenDou in #3100
Support starcoder2 architecture by @sh0416 in #3089
Fix building from source on WSL by @aliencaocao in #3112
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams by @njhill in #3099
Add guided decoding for OpenAI API server by @felixzhu555 in #2819
Fix: Output text is always truncated in some models by @HyperdriveHustle in #3016
Remove exclude_unset in streaming response by @sh0416 in #3143
docs: Add tutorial on deploying vLLM model with KServe by @terrytangyuan in #2586
fix relative import path of protocol.py by @Huarong in #3134
Integrate Marlin Kernels for Int4 GPTQ inference by @robertgshaw2-neuralmagic in #2497
Bump up to v0.3.3 by @WoosukKwon in #3129

New Contributors

@Maxusmusti made their first contribution in #2939
@44670 made their first contribution in #2983
@jlcmoore made their first contribution in #2918
@dachengai made their first contribution in #2539
@dylanwhawk made their first contribution in #3027
@ElefHead made their first contribution in #3064
@AllenDou made their first contribution in #3070
@jaywonchung made their first contribution in #3018
@sh0416 made their first contribution in #3089
@aliencaocao made their first contribution in #3112
@felixzhu555 made their first contribution in #2819
@HyperdriveHustle made their first contribution in #3016
@terrytangyuan made their first contribution in #2586
@Huarong made their first contribution in #3134

Full Changelog: v0.3.2...v0.3.3