pypi vllm 0.8.3
v0.8.3

6 days ago

Highlights

This release features 260 commits, 109 contributors, 38 new contributors.

  • We are excited to announce Day 0 Support for Llama 4 Scout and Maverick (#16104). Please see our blog for detailed user guide.
    • Please note that Llama4 is only supported in V1 engine only for now.
  • V1 engine now supports native sliding window attention (#14097) with the hybrid memory allocator.

Cluster Scale Serving

  • Single node data parallel with API server support (#13923)
  • Multi-node offline DP+EP example (#15484)
  • Expert parallelism enhancements
    • CUTLASS grouped gemm fp8 MoE kernel (#13972)
    • Fused experts refactor (#15914)
    • Fp8 Channelwise Dynamic Per Token GroupedGEMM (#15587)
    • Adding support for fp8 gemm layer input in fp8 (#14578)
    • Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932)
  • Support XpYd disaggregated prefill with MooncakeStore (#12957)

Model Supports

V1 Engine

  • Collective RPC (#15444)
  • Faster top-k only implementation (#15478)
  • BitsAndBytes support (#15611)
  • Speculative Decoding: metrics (#15151), Eagle Proposer (#15729), n-gram interface update (#15750), EAGLE Architecture with Proper RMS Norms (#14990)

Features

API

  • Support Enum for xgrammar based structured output in V1. (#15594, #15757)
  • A new tags parameter for wake_up (#15500)
  • V1 LoRA support CPU offload (#15843)
  • Prefix caching support: FIPS enabled machines with MD5 hashing (#15299), SHA256 as alternative hashing algorithm (#15297)
  • Addition of http service metrics (#15657)

Performance

  • LoRA Scheduler optimization bridging V1 and V0 performance (#15422).

Hardwares

  • AMD:
    • Add custom allreduce support for ROCM (#14125)
    • Quark quantization documentation (#15861)
    • AITER integration: int8 scaled gemm kernel (#15433), fused moe (#14967)
    • Paged attention for V1 (#15720)
  • CPU:
  • TPU
    • Improve Memory Usage Estimation (#15671)
    • Optimize the all-reduce performance (#15903)
    • Support sliding window and logit soft capping in the paged attention kernel. (#15732)
    • TPU-optimized top-p implementation (avoids scattering). (#15736)

Doc, Build, Ecosystem

  • V1 user guide update: fp8 kv cache support (#15585), multi-modality (#15460)
  • Recommend developing with Python 3.12 in developer guide (#15811)
  • Clean up: move dockerfiles into their own directory (#14549)
  • Add minimum version for huggingface_hub to enable Xet downloads (#15873)
  • TPU CI: Add basic perf regression test (#15414)

What's Changed

  • Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 by @houseroad in #15160
  • [Hardware][TPU][Bugfix] Fix v1 mp profiler by @lsy323 in #15409
  • [Kernel][CPU] CPU MLA by @gau-nernst in #14744
  • Dockerfile.ppc64le changes to move to UBI by @Shafi-Hussain in #15402
  • [Misc] Clean up MiniCPM-V/O code by @DarkLight1337 in #15337
  • [Misc] Remove redundant num_embeds by @DarkLight1337 in #15443
  • [Doc] Update V1 user guide for multi-modality by @DarkLight1337 in #15460
  • [Kernel] Fix conflicting macro names for gguf kernels by @SzymonOzog in #15456
  • [bugfix] fix inductor cache on max_position_embeddings by @youkaichao in #15436
  • [CI/Build] Add tests for the V1 tpu_model_runner. by @yarongmu-google in #14843
  • [Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) by @oteroantoniogom in #15471
  • [bugfix] add supports_v1 platform interface by @joerunde in #15417
  • Add workaround for shared field_names in pydantic model class by @maxdebayser in #13925
  • [TPU][V1] Fix Sampler recompilation by @NickLucche in #15309
  • [V1][Minor] Use SchedulerInterface type for engine scheduler field by @njhill in #15499
  • [V1] Support long_prefill_token_threshold in v1 scheduler by @houseroad in #15419
  • [core] add bucket padding to tpu_model_runner by @Chenyaaang in #14995
  • [Core] LoRA: V1 Scheduler optimization by @varun-sundar-rabindranath in #15422
  • [CI/Build] LoRA: Delete long context tests by @varun-sundar-rabindranath in #15503
  • Transformers backend already supports V1 by @hmellor in #15463
  • [Model] Support multi-image for Molmo by @DarkLight1337 in #15438
  • [Misc] Warn about v0 in benchmark_paged_attn.py by @tlrmchlsmth in #15495
  • [BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results) by @LucasWilkinson in #15492
  • [misc] LoRA - Skip LoRA kernels when not required by @varun-sundar-rabindranath in #15152
  • Fix raw_request extraction in load_aware_call decorator by @daniel-salib in #15382
  • [Feature] Enhance EAGLE Architecture with Proper RMS Norms by @luyuzhe111 in #14990
  • [FEAT][ROCm] Integrate Fused MoE Kernels from AITER by @vllmellm in #14967
  • [Misc] Enhance warning information to user-defined chat template by @wwl2755 in #15408
  • [Misc] improve example script output by @reidliu41 in #15528
  • Separate base model from TransformersModel by @hmellor in #15467
  • Apply torchfix by @cyyever in #15532
  • Improve validation of TP in Transformers backend by @hmellor in #15540
  • [Model] Add Reasoning Parser for Granite Models by @alex-jw-brooks in #14202
  • multi-node offline DP+EP example by @youkaichao in #15484
  • Fix weight loading for some models in Transformers backend by @hmellor in #15544
  • [Refactor] Remove passthrough backend when generate grammar by @aarnphm in #15317
  • [V1][Sampler] Faster top-k only implementation by @njhill in #15478
  • Support SHA256 as hash function in prefix caching by @dr75 in #15297
  • Applying some fixes for K8s agents in CI by @Alexei-V-Ivanov-AMD in #15493
  • [V1] TPU - Revert to exponential padding by default by @alexm-redhat in #15565
  • [V1] TPU CI - Fix test_compilation.py by @alexm-redhat in #15570
  • Use Cache Hinting for fused_moe kernel by @wrmedford in #15511
  • [TPU] support disabling xla compilation cache by @yaochengji in #15567
  • Support FIPS enabled machines with MD5 hashing by @MattTheCuber in #15299
  • [Kernel] CUTLASS grouped gemm fp8 MoE kernel by @ElizaWszola in #13972
  • Add automatic tpu label to mergify.yml by @mgoin in #15560
  • add platform check back by @Chenyaaang in #15578
  • [misc] LoRA: Remove unused long context test data by @varun-sundar-rabindranath in #15558
  • [Doc] Update V1 user guide for fp8 kv cache support by @wayzeng in #15585
  • [moe][quant] add weight name case for offset by @MengqingCao in #15515
  • [V1] Refactor num_computed_tokens logic by @comaniac in #15307
  • Allow torchao quantization in SiglipMLP by @jerryzh168 in #15575
  • [ROCm] Env variable to trigger custom PA by @gshtras in #15557
  • [TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS by @yaochengji in #15583
  • [Misc] Restrict ray version dependency and update PP feature warning in V1 by @ruisearch42 in #15556
  • [TPU] Avoid Triton Import by @robertgshaw2-redhat in #15589
  • [Misc] Consolidate LRUCache implementations by @Avabowler in #15481
  • [Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM by @robertgshaw2-redhat in #15587
  • [Misc] Clean up scatter_patch_features by @DarkLight1337 in #15559
  • [Misc] Use model_redirect to redirect the model name to a local folder. by @noooop in #14116
  • Fix incorrect filenames in vllm_compile_cache.py by @zou3519 in #15494
  • [Doc] update --system for transformers installation in docker doc by @reidliu41 in #15616
  • [Model] MiniCPM-V/O supports V1 by @DarkLight1337 in #15487
  • [Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1 by @h-sugi in #15211
  • [Doc] Link to onboarding tasks by @DarkLight1337 in #15629
  • [Misc] Replace is_encoder_decoder_inputs with split_enc_dec_inputs by @DarkLight1337 in #15620
  • [Feature] Add middleware to log API Server responses by @terrytangyuan in #15593
  • [Misc] Avoid direct access of global mm_registry in compute_encoder_budget by @DarkLight1337 in #15621
  • [Doc] Use absolute placement for Ask AI button by @hmellor in #15628
  • [Bugfix][TPU][V1] Fix recompilation by @NickLucche in #15553
  • Correct PowerPC to modern IBM Power by @clnperez in #15635
  • [CI] Update rules for applying tpu label. by @russellb in #15634
  • [V1] AsyncLLM data parallel by @njhill in #13923
  • [TPU] Lazy Import by @robertgshaw2-redhat in #15656
  • [Quantization][V1] BitsAndBytes support V1 by @jeejeelee in #15611
  • [Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS. by @kebe7jun in #14948
  • [Doc] Fix dead links in Job Board by @wwl2755 in #15637
  • [CI][TPU] Temporarily Disable Quant Test on TPU by @robertgshaw2-redhat in #15649
  • Revert "Use Cache Hinting for fused_moe kernel (#15511)" by @wrmedford in #15645
  • [Misc]add coding benchmark for speculative decoding by @CXIAAAAA in #15303
  • [Quantization][FP8] Adding support for fp8 gemm layer input in fp8 by @gshtras in #14578
  • Refactor error handling for multiple exceptions in preprocessing by @JasonZhu1313 in #15650
  • [Bugfix] Fix mm_hashes forgetting to be passed by @DarkLight1337 in #15668
  • [V1] Remove legacy input registry by @DarkLight1337 in #15673
  • [TPU][CI] Fix TPUModelRunner Test by @robertgshaw2-redhat in #15667
  • [Refactor][Frontend] Keep all logic about reasoning into one class by @gaocegege in #14428
  • [CPU][CI] Improve CPU Dockerfile by @bigPYJ1151 in #15690
  • [Bugfix] Fix 'InductorAdaptor object has no attribute 'cache_dir' by @jeejeelee in #15674
  • [Misc] Fix test_sleep to use query parameters by @lizzzcai in #14373
  • [Bugfix][Frontend] Eliminate regex based check in reasoning full generator by @gaocegege in #14821
  • [Frontend] update priority for --api-key and VLLM_API_KEY by @reidliu41 in #15588
  • [Docs] Add "Generation quality changed" section to troubleshooting by @hmellor in #15701
  • [Model] Adding torch compile annotations to chatglm by @jeejeelee in #15624
  • [Bugfix][v1] xgrammar structured output supports Enum. by @chaunceyjiang in #15594
  • [Bugfix] embed_is_patch for Idefics3 by @DarkLight1337 in #15696
  • [V1] Support disable_any_whtespace for guidance backend by @russellb in #15584
  • [doc] add missing imports by @reidliu41 in #15699
  • [Bugfix] Fix regex compile display format by @kebe7jun in #15368
  • Fix cpu offload testing for gptq/awq/ct by @mgoin in #15648
  • [Minor] Remove TGI launching script by @WoosukKwon in #15646
  • [Misc] Remove unused utils and clean up imports by @DarkLight1337 in #15708
  • [Misc] Remove stale func in KVTransferConfig by @ShangmingCai in #14746
  • [TPU] [Perf] Improve Memory Usage Estimation by @robertgshaw2-redhat in #15671
  • [Bugfix] [torch.compile] Add Dynamo metrics context during compilation by @ProExpertProg in #15639
  • [V1] TPU - Fix the chunked prompt bug by @alexm-redhat in #15713
  • [Misc] cli auto show default value by @reidliu41 in #15582
  • implement prometheus fast-api-instrumentor for http service metrics by @daniel-salib in #15657
  • [Docs][V1] Optimize diagrams in prefix caching design by @simpx in #15716
  • [ROCm][AMD][Build] Update AMD supported arch list by @gshtras in #15632
  • [Model] Support Skywork-R1V by @pengyuange in #15397
  • [Docs] Document v0 engine support in reasoning outputs by @gaocegege in #15739
  • [Misc][V1] Misc code streamlining by @njhill in #15723
  • [Bugfix] LoRA V1: add and fix entrypoints tests by @varun-sundar-rabindranath in #15715
  • [CI] Speed up V1 structured output tests by @russellb in #15718
  • Use numba 0.61 for python 3.10+ to support numpy>=2 by @cyyever in #15692
  • [Bugfix] set VLLM_WORKER_MULTIPROC_METHOD=spawn for vllm.entrypoionts.openai.api_server by @jinzhen-lin in #15700
  • [TPU][V1][Bugfix] Fix w8a8 recompiilation with GSM8K by @NickLucche in #15714
  • [Kernel][TPU][ragged-paged-attn] vLLM code change for PR#8896 by @yarongmu-google in #15659
  • [doc] update doc by @reidliu41 in #15740
  • [FEAT] [ROCm] Add AITER int8 scaled gemm kernel by @tjtanaa in #15433
  • [V1] [Feature] Collective RPC by @wwl2755 in #15444
  • [Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore by @ShangmingCai in #12957
  • [V1] Support interleaved modality items by @ywang96 in #15605
  • [V1][Minor] Simplify rejection sampler's parse_output by @WoosukKwon in #15741
  • [Bugfix] Fix Mllama interleaved images input support by @Isotr0py in #15564
  • [CI] xgrammar structured output supports Enum. by @chaunceyjiang in #15757
  • [Bugfix] Fix Mistral guided generation using xgrammar by @juliendenize in #15704
  • [doc] update conda to usage link in installation by @reidliu41 in #15761
  • fix test_phi3v by @pansicheng in #15321
  • [V1] Override mm_counts for dummy data creation by @DarkLight1337 in #15703
  • fix: lint fix a ruff checkout syntax error by @yihong0618 in #15767
  • [Bugfix] Added embed_is_patch mask for fuyu model by @kylehh in #15731
  • fix: Comments to English for better dev experience by @yihong0618 in #15768
  • [V1][Scheduler] Avoid calling _try_schedule_encoder_inputs for every request by @WoosukKwon in #15778
  • [Misc] update the comments by @lcy4869 in #15780
  • [Benchmark] Update Vision Arena Dataset and HuggingFaceDataset Setup by @JenZhao in #15748
  • [Feature][ROCm]Enable fusion pass for torch.compile on ROCm by @charlifu in #15050
  • Recommend developing with Python 3.12 in developer guide by @hmellor in #15811
  • fix: better install requirement for install in setup.py by @yihong0618 in #15796
  • [V1] Fully Transparent Implementation of CPU Offloading by @youkaichao in #15354
  • [Model] Update support for NemotronNAS models by @Naveassaf in #15008
  • [Bugfix] Fix Crashing When Loading Modules With Batchnorm Stats by @alex-jw-brooks in #15813
  • [Bugfix] Fix missing return value in load_weights method of adapters.py by @noc-turne in #15542
  • Upgrade transformers to v4.50.3 by @hmellor in #13905
  • [Bugfix] Check dimensions of multimodal embeddings in V1 by @DarkLight1337 in #15816
  • [V1][Spec Decode] Remove deprecated spec decode config params by @ShangmingCai in #15466
  • fix: change GB to GiB in logging close #14979 by @yihong0618 in #15807
  • [V1] TPU CI - Add basic perf regression test by @alexm-redhat in #15414
  • Fix Transformers backend compatibility check by @hmellor in #15290
  • [V1][Core] Remove unused speculative config from scheduler by @markmc in #15818
  • Move dockerfiles into their own directory by @hmellor in #14549
  • [Distributed] Add custom allreduce support for ROCM by @ilmarkov in #14125
  • Rename fallback model and refactor supported models section by @hmellor in #15829
  • [Frontend] Add Phi-4-mini function calling support by @kinfey in #14886
  • [Bugfix][Model] fix mllama multi-image by @yma11 in #14883
  • [Bugfix] Fix extra comma by @haochengxia in #15851
  • [Bugfix]: Fix is_embedding_layer condition in VocabParallelEmbedding by @alexwl in #15824
  • [V1] TPU - Fix fused MOE by @alexm-redhat in #15834
  • [sleep mode] clear pytorch cache after sleep by @lionelvillard in #15248
  • [ROCm] Use device name in the warning by @gshtras in #15838
  • [V1] Implement sliding window attention in kv_cache_manager by @heheda12345 in #14097
  • fix: can not use uv run collect_env close #13888 by @yihong0618 in #15792
  • [Feature] specify model in config.yaml by @wayzeng in #15798
  • [Misc] Enable V1 LoRA by default by @varun-sundar-rabindranath in #15320
  • [Misc] Fix speculative config repr string by @ShangmingCai in #15860
  • [Docs] Fix small error in link text by @hmellor in #15868
  • [Bugfix] Fix no video/image profiling edge case for MultiModalDataParser by @Isotr0py in #15828
  • [Misc] Use envs.VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE by @ruisearch42 in #15831
  • setup correct nvcc version with CUDA_HOME by @chenyang78 in #15725
  • [Model] Support Mistral3 in the HF Transformers format by @mgoin in #15505
  • [Misc] remove unused script by @reidliu41 in #15746
  • Remove format.sh as it's been unsupported >70 days by @hmellor in #15884
  • [New Model]: jinaai/jina-reranker-v2-base-multilingual by @noooop in #15876
  • [Doc] Quark quantization documentation by @cha557 in #15861
  • Reinstate format.sh and make pre-commit installation simpler by @hmellor in #15890
  • [Misc] Allow using OpenCV as video IO fallback by @Isotr0py in #15055
  • [ROCm][Build][Bugfix] Bring the base dockerfile in sync with the ROCm fork by @gshtras in #15820
  • Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. by @bnellnm in #13932
  • [CI/Build] Clean up LoRA tests by @jeejeelee in #15867
  • [Model] Aya Vision by @JenZhao in #15441
  • [Model] Add module name prefixes to gemma3 by @cloud11665 in #15889
  • [CI] Disable flaky structure decoding test temporarily. by @ywang96 in #15892
  • [V1][Metrics] Initial speculative decoding metrics by @markmc in #15151
  • [V1][Spec Decode] Implement Eagle Proposer [1/N] by @WoosukKwon in #15729
  • [Docs] update usage stats language by @simon-mo in #15898
  • [BugFix] make sure socket close by @yihong0618 in #15875
  • [Model][MiniMaxText01] Support MiniMaxText01 model inference by @ZZBoom in #13454
  • [Docs] Add Ollama meetup slides by @simon-mo in #15905
  • [Docs] Add Intel as Sponsor by @simon-mo in #15913
  • Fix input triton kernel for eagle by @ekagra-ranjan in #15909
  • [V1] Fix: make sure k_index is int64 for apply_top_k_only by @b8zhong in #15907
  • [Bugfix] Fix imports for MoE on CPU by @gau-nernst in #15841
  • [V1][Minor] Enhance SpecDecoding Metrics Log in V1 by @WoosukKwon in #15902
  • [Doc] Update rocm.inc.md by @chun37 in #15917
  • [V1][Bugfix] Fix typo in MoE TPU checking by @ywang96 in #15927
  • [Benchmark]Fix error message by @Potabk in #15866
  • [Misc] Replace print with logger by @chaunceyjiang in #15923
  • [CI/Build] Further clean up LoRA tests by @jeejeelee in #15920
  • [Bugfix] Fix cache block size calculation for CPU MLA by @gau-nernst in #15848
  • [Build/CI] Update lm-eval to 0.4.8 by @cthi in #15912
  • [Kernel] Add more dtype support for GGUF dequantization by @LukasBluebaum in #15879
  • [core] Add tags parameter to wake_up() by @erictang000 in #15500
  • [V1] Fix json_object support with xgrammar by @russellb in #15488
  • Add minimum version for huggingface_hub to enable Xet downloads by @hmellor in #15873
  • [Bugfix][Benchmarks] Ensure async_request_deepspeed_mii uses the OpenAI choices key by @b8zhong in #15926
  • [CI] Remove duplicate entrypoints-test by @yankay in #15940
  • [Bugfix] Fix the issue where the model name is empty string, causing no response with the model name. by @chaunceyjiang in #15938
  • [Metrics] Hide deprecated metrics by @markmc in #15458
  • [Frontend] Implement Tool Calling with tool_choice='required' by @meffmadd in #13483
  • [CPU][Bugfix] Using custom allreduce for CPU backend by @bigPYJ1151 in #15934
  • [Model] use AutoWeightsLoader in model load_weights by @lengrongfu in #15770
  • [Misc] V1 LoRA support CPU offload by @jeejeelee in #15843
  • Restricted cmake to be less than version 4 as 4.x breaks the build of… by @npanpaliya in #15859
  • [misc] instruct pytorch to use nvml-based cuda check by @youkaichao in #15951
  • [V1] Support Mistral3 in V1 by @mgoin in #15950
  • Fix huggingface-cli[hf-xet] -> huggingface-cli[hf_xet] by @hmellor in #15969
  • [V1][TPU] TPU-optimized top-p implementation (avoids scattering). by @hyeygit in #15736
  • [TPU] optimize the all-reduce performance by @yaochengji in #15903
  • [V1][TPU] Do not compile sampling more than needed by @NickLucche in #15883
  • [ROCM][KERNEL] Paged attention for V1 by @maleksan85 in #15720
  • fix: better error message for get_config close #13889 by @yihong0618 in #15943
  • [bugfix] add seed in torchrun_example.py by @youkaichao in #15980
  • [ROCM][V0] PA kennel selection when no sliding window provided by @maleksan85 in #15982
  • [Benchmark] Add AIMO Dataset to Benchmark by @StevenShi-23 in #15955
  • [misc] improve error message for "Failed to infer device type" by @youkaichao in #15994
  • [Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process by @wwl2755 in #15367
  • [doc] update contribution link by @reidliu41 in #15922
  • fix: tiny fix make format.sh excutable by @yihong0618 in #16015
  • [SupportsQuant] Bert, Blip, Blip2, Bloom by @kylesayrs in #15573
  • [SupportsQuant] Chameleon, Chatglm, Commandr by @kylesayrs in #15952
  • [Neuron][kernel] Fuse kv cache into a single tensor by @liangfu in #15911
  • [Minor] Fused experts refactor by @bnellnm in #15914
  • [Misc][Performance] Advance tpu.txt to the most recent nightly torch … by @yarongmu-google in #16024
  • Re-enable the AMD Testing for the passing tests. by @Alexei-V-Ivanov-AMD in #15586
  • [TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. by @vanbasten23 in #15732
  • [TPU] Switch Test to Non-Sliding Window by @robertgshaw2-redhat in #15981
  • [Bugfix] Fix function names in test_block_fp8.py by @bnellnm in #16033
  • [ROCm] Tweak the benchmark script to run on ROCm by @huydhn in #14252
  • [Misc] improve gguf check by @reidliu41 in #15974
  • [TPU][V1] Remove ragged attention kernel parameter hard coding by @yaochengji in #16041
  • doc: add info for macos clang errors by @yihong0618 in #16049
  • [V1][Spec Decode] Avoid logging useless nan metrics by @markmc in #16023
  • [Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt by @jonghyunchoe in #15939
  • [Hardware][Gaudi][BugFix] fix arguments of hpu fused moe by @zhenwei-intel in #15945
  • [Bugfix][kernels] Fix half2float conversion in gguf kernels by @Isotr0py in #15995
  • [Benchmark][Doc] Update throughput benchmark and README by @StevenShi-23 in #15998
  • [CPU] Change default block_size for CPU backend by @bigPYJ1151 in #16002
  • [Distributed] [ROCM] Fix custom allreduce enable checks by @ilmarkov in #16010
  • [ROCm][Bugfix] Use platform specific FP8 dtype by @gshtras in #15717
  • [ROCm][Bugfix] Bring back fallback to eager mode removed in #14917, but for ROCm only by @gshtras in #15413
  • [Bugfix] Fix default behavior/fallback for pp in v1 by @mgoin in #16057
  • [CI] Reorganize .buildkite directory by @khluu in #16001
  • [V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue by @njhill in #15906
  • [V1] Scatter and gather placeholders in the model runner by @DarkLight1337 in #15712
  • Revert "[V1] Scatter and gather placeholders in the model runner" by @ywang96 in #16075
  • [Kernel][Bugfix] Re-fuse triton moe weight application by @bnellnm in #16071
  • [Bugfix][TPU] Fix V1 TPU worker for sliding window by @mgoin in #16059
  • [V1][Spec Decode] Update N-gram Proposer Interface by @WoosukKwon in #15750
  • [Model] Support Llama4 in vLLM by @houseroad in #16104

New Contributors

  • @Shafi-Hussain made their first contribution in #15402
  • @oteroantoniogom made their first contribution in #15471
  • @cyyever made their first contribution in #15532
  • @dr75 made their first contribution in #15297
  • @wrmedford made their first contribution in #15511
  • @MattTheCuber made their first contribution in #15299
  • @jerryzh168 made their first contribution in #15575
  • @Avabowler made their first contribution in #15481
  • @zou3519 made their first contribution in #15494
  • @h-sugi made their first contribution in #15211
  • @clnperez made their first contribution in #15635
  • @kebe7jun made their first contribution in #14948
  • @CXIAAAAA made their first contribution in #15303
  • @lizzzcai made their first contribution in #14373
  • @simpx made their first contribution in #15716
  • @pengyuange made their first contribution in #15397
  • @pansicheng made their first contribution in #15321
  • @lcy4869 made their first contribution in #15780
  • @Naveassaf made their first contribution in #15008
  • @noc-turne made their first contribution in #15542
  • @ilmarkov made their first contribution in #14125
  • @kinfey made their first contribution in #14886
  • @haochengxia made their first contribution in #15851
  • @alexwl made their first contribution in #15824
  • @lionelvillard made their first contribution in #15248
  • @cha557 made their first contribution in #15861
  • @cloud11665 made their first contribution in #15889
  • @ZZBoom made their first contribution in #13454
  • @ekagra-ranjan made their first contribution in #15909
  • @chun37 made their first contribution in #15917
  • @cthi made their first contribution in #15912
  • @LukasBluebaum made their first contribution in #15879
  • @erictang000 made their first contribution in #15500
  • @yankay made their first contribution in #15940
  • @meffmadd made their first contribution in #13483
  • @lengrongfu made their first contribution in #15770
  • @StevenShi-23 made their first contribution in #15955

Full Changelog: v0.8.2...v0.8.3

Don't miss a new vllm release

NewReleases is sending notifications on new releases.