pypi vllm 0.5.1
v0.5.1

latest releases: 0.6.3.post1, 0.6.3, 0.6.2...
4 months ago

Highlights

  • vLLM now has pipeline parallelism! (#4412, #5408, #6115, #6120). You can now run the API server with --pipeline-parallel-size. This feature is in early stage, please let us know your feedback.

Model Support

  • Support Gemma 2 (#5908, #6051). Please note that for correctness, Gemma should run with FlashInfer backend which supports logits soft cap. The wheels for FlashInfer can be downloaded here
  • Support Jamba (#4115). This is vLLM's first state space model!
  • Support Deepseek-V2 (#4650). Please note that MLA (Multi-head Latent Attention) is not implemented and we are looking for contribution!
  • Vision Language Model adding support for Phi3-Vision, dynamic image size, and a registry for processing model inputs (#4986, #5276, #5214)
    • Notably, it has a breaking change that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in <image> into the prompt instead of complicated prompt formatting. See more here
    • There is also a new guide on adding VLMs! We would love your contribution for new models!

Hardware Support

Production Service

  • Support for sharded tensorized models (#4990)
  • Continous streaming of OpenAI response token stats (#5742)

Performance

  • Enhancement in distributed communication via shared memory (#5399)
  • Latency enhancement in block manager (#5584)
  • Enhancements to compressed-tensors supporting Marlin, W4A16 (#5435, #5385)
  • Faster FP8 quantize kernel (#5396), FP8 on Ampere (#5975)
  • Option to use FlashInfer for prefill, decode, and CUDA Graph for decode (#4628)
  • Speculative Decoding
  • Draft Model Runner (#5799)

Development Productivity

  • Post merge benchmark is now available at perf.vllm.ai!
  • Addition of A100 in CI environment (#5658)
  • Step towards nightly wheel publication (#5610)

What's Changed

  • [CI/Build] Add is_quant_method_supported to control quantization test configurations by @mgoin in #5253
  • Revert "[CI/Build] Add is_quant_method_supported to control quantization test configurations" by @simon-mo in #5463
  • [CI] Upgrade codespell version. by @rkooo567 in #5381
  • [Hardware] Initial TPU integration by @WoosukKwon in #5292
  • [Bugfix] Add device assertion to TorchSDPA by @bigPYJ1151 in #5402
  • [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests by @khluu in #5464
  • [Kernel] Vectorized FP8 quantize kernel by @comaniac in #5396
  • [Bugfix] TYPE_CHECKING for MultiModalData by @kimdwkimdw in #5444
  • [Frontend] [Core] Support for sharded tensorized models by @tjohnson31415 in #4990
  • [misc] add hint for AttributeError by @youkaichao in #5462
  • [Doc] Update debug docs by @DarkLight1337 in #5438
  • [Bugfix] Fix typo in scheduler.py (requeset -> request) by @mgoin in #5470
  • [Frontend] Add "input speed" to tqdm postfix alongside output speed by @mgoin in #5425
  • [Bugfix] Fix wrong multi_modal_input format for CPU runner by @Isotr0py in #5451
  • [Core][Distributed] add coordinator to reduce code duplication in tp and pp by @youkaichao in #5293
  • [ci] Use sccache to build images by @khluu in #5419
  • [Bugfix]if the content is started with ":"(response of ping), client should i… by @sywangyi in #5303
  • [Kernel] w4a16 support for compressed-tensors by @dsikka in #5385
  • [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations by @mgoin in #5466
  • [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 by @wenyujin333 in #5497
  • [Hardware][Intel] Optimize CPU backend and add more performance tips by @bigPYJ1151 in #4971
  • [Docs] Add 4th meetup slides by @WoosukKwon in #5509
  • [Misc] Add vLLM version getter to utils by @DarkLight1337 in #5098
  • [CI/Build] Simplify OpenAI server setup in tests by @DarkLight1337 in #5100
  • [Doc] Update LLaVA docs by @DarkLight1337 in #5437
  • [Kernel] Factor out epilogues from cutlass kernels by @tlrmchlsmth in #5391
  • [MISC] Remove FP8 warning by @comaniac in #5472
  • Seperate dev requirements into lint and test by @Yard1 in #5474
  • Revert "[Core] Remove unnecessary copies in flash attn backend" by @Yard1 in #5478
  • [misc] fix format.sh by @youkaichao in #5511
  • [CI/Build] Disable test_fp8.py by @tlrmchlsmth in #5508
  • [Kernel] Disable CUTLASS kernels for fp8 by @tlrmchlsmth in #5505
  • Add cuda_device_count_stateless by @Yard1 in #5473
  • [Hardware][Intel] Support CPU inference with AVX2 ISA by @DamonFool in #5452
  • [Bugfix]typofix by @AllenDou in #5507
  • bump version to v0.5.0.post1 by @simon-mo in #5522
  • [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label by @KuntaiDu in #5073
  • [CI/Build] Disable LLaVA-NeXT CPU test by @DarkLight1337 in #5529
  • [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by @tlrmchlsmth in #5516
  • [Misc] Fix arg names by @AllenDou in #5524
  • [ Misc ] Rs/compressed tensors cleanup by @robertgshaw2-neuralmagic in #5432
  • [Kernel] Suppress mma.sp warning on CUDA 12.5 and later by @tlrmchlsmth in #5401
  • [mis] fix flaky test of test_cuda_device_count_stateless by @youkaichao in #5546
  • [Core] Remove duplicate processing in async engine by @DarkLight1337 in #5525
  • [misc][distributed] fix benign error in is_in_the_same_node by @youkaichao in #5512
  • [Docs] Add ZhenFund as a Sponsor by @simon-mo in #5548
  • [Doc] Update documentation on Tensorizer by @sangstar in #5471
  • [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by @tdoublep in #5460
  • [Bugfix] Fix typo in Pallas backend by @WoosukKwon in #5558
  • [Core][Distributed] improve p2p cache generation by @youkaichao in #5528
  • Add ccache to amd by @simon-mo in #5555
  • [Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in #5364
  • [mypy] Enable type checking for test directory by @DarkLight1337 in #5017
  • [CI/Build] Test both text and token IDs in batched OpenAI Completions API by @DarkLight1337 in #5568
  • [misc] Do not allow to use lora with chunked prefill. by @rkooo567 in #5538
  • add gptq_marlin test for bug report #5088 by @alexm-neuralmagic in #5145
  • [BugFix] Don't start a Ray cluster when not using Ray by @njhill in #5570
  • [Fix] Correct OpenAI batch response format by @zifeitong in #5554
  • Add basic correctness 2 GPU tests to 4 GPU pipeline by @Yard1 in #5518
  • [CI][BugFix] Flip is_quant_method_supported condition by @mgoin in #5577
  • [build][misc] limit numpy version by @youkaichao in #5582
  • [Doc] add debugging tips for crash and multi-node debugging by @youkaichao in #5581
  • Fix w8a8 benchmark and add Llama-3-8B by @comaniac in #5562
  • [Model] Rename Phi3 rope scaling type by @garg-amit in #5595
  • Correct alignment in the seq_len diagram. by @CharlesRiggins in #5592
  • [Kernel] compressed-tensors marlin 24 support by @dsikka in #5435
  • [Misc] use AutoTokenizer for benchmark serving when vLLM not installed by @zhyncs in #5588
  • [Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by @jikunshang in #3814
  • [CI/BUILD] Support non-AVX512 vLLM building and testing by @DamonFool in #5574
  • [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by @KuntaiDu in #5571
  • [bugfix][distributed] fix 16 gpus local rank arrangement by @youkaichao in #5604
  • [Optimization] use a pool to reuse LogicalTokenBlock.token_ids by @youkaichao in #5584
  • [Bugfix] Fix KV head calculation for MPT models when using GQA by @bfontain in #5142
  • [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by @zifeitong in #5606
  • [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by @sroy745 in #5131
  • [Model] Initialize Phi-3-vision support by @Isotr0py in #4986
  • [Kernel] Add punica dimensions for Granite 13b by @joerunde in #5559
  • [misc][typo] fix typo by @youkaichao in #5620
  • [Misc] Fix typo by @DarkLight1337 in #5618
  • [CI] Avoid naming different metrics with the same name in performance benchmark by @KuntaiDu in #5615
  • [bugfix][distributed] do not error if two processes do not agree on p2p capability by @youkaichao in #5612
  • [Misc] Remove import from transformers logging by @CatherineSue in #5625
  • [CI/Build][Misc] Update Pytest Marker for VLMs by @ywang96 in #5623
  • [ci] Deprecate original CI template by @khluu in #5624
  • [Misc] Add OpenTelemetry support by @ronensc in #4687
  • [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by @dsikka in #5542
  • [ci] Setup Release pipeline and build release wheels with cache by @khluu in #5610
  • [Model] LoRA support added for command-r by @sergey-tinkoff in #5178
  • [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by @tdoublep in #5639
  • [Doc] Added cerebrium as Integration option by @milo157 in #5553
  • [Bugfix] Fix CUDA version check for mma warning suppression by @tlrmchlsmth in #5642
  • [Bugfix] Fix w8a8 benchmarks for int8 case by @tlrmchlsmth in #5643
  • [Bugfix] Fix Phi-3 Long RoPE scaling implementation by @ShukantPal in #5628
  • [Bugfix] Added test for sampling repetition penalty bug. by @tdoublep in #5659
  • [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by @hongxiayang in #5641
  • [misc][distributed] use localhost for single-node by @youkaichao in #5619
  • [Model] Add FP8 kv cache for Qwen2 by @mgoin in #5656
  • [Bugfix] Fix sampling_params passed incorrectly in Phi3v example by @Isotr0py in #5684
  • [Misc]Add param max-model-len in benchmark_latency.py by @DearPlanet in #5629
  • [CI/Build] Add tqdm to dependencies by @DarkLight1337 in #5680
  • [ci] Add A100 queue into AWS CI template by @khluu in #5648
  • [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by @mgoin in #5688
  • [ci][distributed] add tests for custom allreduce by @youkaichao in #5689
  • [Bugfix] AsyncLLMEngine hangs with asyncio.run by @zifeitong in #5654
  • [Doc] Update docker references by @rafvasq in #5614
  • [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by @dsikka in #5650
  • [ci] Limit num gpus if specified for A100 by @khluu in #5694
  • [Misc] Improve conftest by @DarkLight1337 in #5681
  • [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by @ywang96 in #5703
  • [Kernel] Update Cutlass int8 kernel configs for SM90 by @varun-sundar-rabindranath in #5514
  • [Model] Port over CLIPVisionModel for VLMs by @ywang96 in #5591
  • [Kernel] Update Cutlass int8 kernel configs for SM80 by @varun-sundar-rabindranath in #5275
  • [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by @tlrmchlsmth in #5715
  • [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by @mgoin in #5718
  • [distributed][misc] use fork by default for mp by @youkaichao in #5669
  • [Model] MLPSpeculator speculative decoding support by @JRosenkranz in #4947
  • [Kernel] Add punica dimension for Qwen2 LoRA by @jinzhen-lin in #5441
  • [BugFix] Fix test_phi3v.py by @CatherineSue in #5725
  • [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by @jeejeelee in #5665
  • [Core][Distributed] add shm broadcast by @youkaichao in #5399
  • [Kernel][CPU] Add Quick gelu to CPU by @ywang96 in #5717
  • [Doc] Documentation on supported hardware for quantization methods by @mgoin in #5745
  • [BugFix] exclude version 1.15.0 for modelscope by @zhyncs in #5668
  • [ci][test] fix ca test in main by @youkaichao in #5746
  • [LoRA] Add support for pinning lora adapters in the LRU cache by @rohithkrn in #5603
  • [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by @jikunshang in #5616
  • [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by @DamonFool in #5710
  • [Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py by @zifeitong in #5756
  • [Bugfix] Fix pin_lora error in TPU executor by @WoosukKwon in #5760
  • [Docs][TPU] Add installation tip for TPU by @WoosukKwon in #5761
  • [core][distributed] improve shared memory broadcast by @youkaichao in #5754
  • [BugFix] [Kernel] Add Cutlass2x fallback kernels by @varun-sundar-rabindranath in #5744
  • [Distributed] Add send and recv helpers by @andoorve in #5719
  • [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by @Isotr0py in #5772
  • [doc][faq] add warning to download models for every nodes by @youkaichao in #5783
  • [Doc] Add "Suggest edit" button to doc pages by @mgoin in #5789
  • [Doc] Add Phi-3-medium to list of supported models by @mgoin in #5788
  • [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by @CatherineSue in #5795
  • [ci] Remove aws template by @khluu in #5757
  • [Doc] Add notice about breaking changes to VLMs by @DarkLight1337 in #5818
  • [Speculative Decoding] Support draft model on different tensor-parallel size than target model by @wooyeonlee0 in #5414
  • [Misc] Remove useless code in cpu_worker by @DamonFool in #5824
  • [Core] Add fault tolerance for RayTokenizerGroupPool by @Yard1 in #5748
  • [doc][distributed] add both gloo and nccl tests by @youkaichao in #5834
  • [CI/Build] Add unit testing for FlexibleArgumentParser by @mgoin in #5798
  • [Misc] Update w4a16 compressed-tensors support to include w8a16 by @dsikka in #5794
  • [Hardware][TPU] Refactor TPU backend by @WoosukKwon in #5831
  • [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by @mawong-amd in #5422
  • [Hardware][TPU] Raise errors for unsupported sampling params by @WoosukKwon in #5850
  • [CI/Build] Add E2E tests for MLPSpeculator by @tdoublep in #5791
  • [Bugfix] Fix assertion in NeuronExecutor by @aws-patlange in #5841
  • [Core] Refactor Worker and ModelRunner to consolidate control plane communication by @stephanie-wang in #5408
  • [Misc][Doc] Add Example of using OpenAI Server with VLM by @ywang96 in #5832
  • [bugfix][distributed] fix shm broadcast when the queue size is full by @youkaichao in #5801
  • [Bugfix] Fix embedding to support 2D inputs by @WoosukKwon in #5829
  • [Bugfix][TPU] Fix KV cache size calculation by @WoosukKwon in #5860
  • [CI/Build] Refactor image test assets by @DarkLight1337 in #5821
  • [Kernel] Adding bias epilogue support for cutlass_scaled_mm by @ProExpertProg in #5560
  • [Frontend] Add tokenize/detokenize endpoints by @sasha0552 in #5054
  • [Hardware][TPU] Support parallel sampling & Swapping by @WoosukKwon in #5855
  • [Bugfix][TPU] Fix CPU cache allocation by @WoosukKwon in #5869
  • Support CPU inference with VSX PowerPC ISA by @ChipKerchner in #5652
  • [doc] update usage of env var to avoid conflict by @youkaichao in #5873
  • [Misc] Add example for LLaVA-NeXT by @ywang96 in #5879
  • [BugFix] Fix cuda graph for MLPSpeculator by @njhill in #5875
  • [Doc] Add note about context length in Phi-3-Vision example by @DarkLight1337 in #5887
  • [VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly by @xwjiang2010 in #5880
  • [Model] Add base class for LoRA-supported models by @DarkLight1337 in #5018
  • [Bugfix] Fix img_sizes Parsing in Phi3-Vision by @ywang96 in #5888
  • [CI/Build] [1/3] Reorganize entrypoints tests by @DarkLight1337 in #5526
  • [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by @DarkLight1337 in #5896
  • [doc][misc] add note for Kubernetes users by @youkaichao in #5916
  • [BugFix] Fix MLPSpeculator handling of num_speculative_tokens by @njhill in #5876
  • [BugFix] Fix min_tokens behaviour for multiple eos tokens by @njhill in #5849
  • [CI/Build] Fix Args for _get_logits_warper in Sampler Test by @ywang96 in #5922
  • [Model] Add Gemma 2 by @WoosukKwon in #5908
  • [core][misc] remove logical block by @youkaichao in #5882
  • [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by @divakar-amd in #5932
  • [Hardware][TPU] Optimize KV cache swapping by @WoosukKwon in #5878
  • [VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. by @xwjiang2010 in #5905
  • [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by @Isotr0py in #5956
  • [Core] Registry for processing model inputs by @DarkLight1337 in #5214
  • Unmark fused_moe config json file as executable by @tlrmchlsmth in #5960
  • [Hardware][Intel] OpenVINO vLLM backend by @ilya-lavrenov in #5379
  • [Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high by @tdoublep in #5894
  • [CI/Build] [2/3] Reorganize entrypoints tests by @DarkLight1337 in #5904
  • [Distributed] Make it clear that % should not be in tensor dict keys. by @xwjiang2010 in #5927
  • [Spec Decode] Introduce DraftModelRunner by @comaniac in #5799
  • [Bugfix] Fix compute datatype for cutlass 3.x epilogues by @tlrmchlsmth in #5931
  • [ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) by @robertgshaw2-neuralmagic in #5928
  • [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by @robertgshaw2-neuralmagic in #5921
  • Support Deepseek-V2 by @zwd003 in #4650
  • [Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled by @mgoin in #5936
  • Unmark more files as executable by @tlrmchlsmth in #5962
  • [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError by @robertgshaw2-neuralmagic in #5963
  • [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode by @LiuXiaoxuanPKU in #4628
  • [Bugfix][TPU] Fix TPU sampler output by @WoosukKwon in #5978
  • [Bugfix][TPU] Fix pad slot id by @WoosukKwon in #5977
  • [Bugfix] fix missing last itl in openai completions benchmark by @mcalman in #5926
  • [Misc] Extend vLLM Metrics logging API by @SolitaryThinker in #5925
  • [Kernel] Add punica dimensions for Granite 3b and 8b by @joerunde in #5930
  • [Bugfix] Fix precisions in Gemma 1 by @WoosukKwon in #5913
  • [Misc] Update Phi-3-Vision Example by @ywang96 in #5981
  • [Bugfix] Support eos_token_id from config.json by @DarkLight1337 in #5954
  • [Core] Optimize SequenceStatus.is_finished by switching to IntEnum by @Yard1 in #5974
  • [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k by @comaniac in #5939
  • [ CI/Build ] Added E2E Test For Compressed Tensors by @robertgshaw2-neuralmagic in #5839
  • [CI/Build] Add TP test for vision models by @DarkLight1337 in #5892
  • [ CI/Build ] LM Eval Harness Based CI Testing by @robertgshaw2-neuralmagic in #5838
  • [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests by @mawong-amd in #5949
  • [CI/Build] Temporarily Remove Phi3-Vision from TP Test by @ywang96 in #5989
  • [CI/Build] Reuse code for checking output consistency by @DarkLight1337 in #5988
  • [CI/Build] [3/3] Reorganize entrypoints tests by @DarkLight1337 in #5966
  • [ci][distributed] fix some cuda init that makes it necessary to use spawn by @youkaichao in #5991
  • [Frontend]: Support base64 embedding by @llmpros in #5935
  • [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. by @rkooo567 in #5909
  • [ CI ] Temporarily Disable Large LM-Eval Tests by @robertgshaw2-neuralmagic in #6005
  • [Misc] Fix get_min_capability by @dsikka in #5971
  • [ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) by @robertgshaw2-neuralmagic in #5940
  • [misc][cuda] use nvml query to avoid accidentally cuda initialization by @youkaichao in #6007
  • [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker by @sroy745 in #5348
  • [ CI ] Re-enable Large Model LM Eval by @robertgshaw2-neuralmagic in #6031
  • [doc][misc] remove deprecated api server in doc by @youkaichao in #6037
  • [Misc] update benchmark backend for scalellm by @zhyncs in #6018
  • [doc][misc] further lower visibility of simple api server by @youkaichao in #6041
  • [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool by @Yard1 in #6039
  • [Bugfix] adding chunking mechanism to fused_moe to handle large inputs by @avshalomman in #6029
  • add FAQ doc under 'serving' by @llmpros in #5946
  • [Bugfix][Doc] Fix Doc Formatting by @ywang96 in #6048
  • [Bugfix] Add explicit end_forward calls to flashinfer by @Yard1 in #6044
  • [BugFix] Ensure worker model loop is always stopped at the right time by @njhill in #5987
  • [Frontend] Relax api url assertion for openai benchmarking by @jamestwhedbee in #6046
  • [Model] Changes to MLPSpeculator to support tie_weights and input_scale by @tdoublep in #5965
  • [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) by @alexm-neuralmagic in #5602
  • [Frontend] Add template related params to request by @danieljannai21 in #5709
  • [VLM] Remove image_input_type from VLM config by @xwjiang2010 in #5852
  • [Doc] Reinstate doc dependencies by @DarkLight1337 in #6061
  • [Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) by @sirejdua in #6050
  • [Core] Pipeline Parallel Support by @andoorve in #4412
  • Update conftest.py by @robertgshaw2-neuralmagic in #6076
  • [ Misc ] Refactor MoE to isolate Fp8 From Mixtral by @robertgshaw2-neuralmagic in #5970
  • [CORE] Quantized lm-head Framework by @Qubitium in #4442
  • [Model] Jamba support by @mzusman in #4115
  • [hardware][misc] introduce platform abstraction by @youkaichao in #6080
  • [Core] Dynamic image size support for VLMs by @DarkLight1337 in #5276
  • [CI] Fix base url doesn't strip "/" by @rkooo567 in #6087
  • [BugFix] Avoid unnecessary Ray import warnings by @njhill in #6079
  • [misc][distributed] error on invalid state by @youkaichao in #6092
  • [VLM][Frontend] Proper Image Prompt Formatting from OpenAI API by @ywang96 in #6091
  • [Doc] Fix Mock Import by @ywang96 in #6094
  • [Bugfix] Fix compute_logits in Jamba by @ywang96 in #6093
  • [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin by @mgoin in #5975
  • [core][distributed] allow custom allreduce when pipeline parallel size > 1 by @youkaichao in #6117
  • [vlm] Remove vision language config. by @xwjiang2010 in #6089
  • [ Misc ] Clean Up CompressedTensorsW8A8 by @robertgshaw2-neuralmagic in #6113
  • [doc][misc] bump up py version in installation doc by @youkaichao in #6119
  • [core][distributed] support layer size undividable by pp size in pipeline parallel inference by @youkaichao in #6115
  • [Bugfix] set OMP_NUM_THREADS to 1 by default when using the multiproc_gpu_executor by @tjohnson31415 in #6109
  • [Distributed][Core] Support Py39 and Py38 for PP by @andoorve in #6120
  • [CI/Build] Cleanup VLM tests by @DarkLight1337 in #6107
  • [ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention by @gshtras in #6043
  • [misc][doc] try to add warning for latest html by @youkaichao in #5979
  • [Hardware][Intel CPU] Adding intel openmp tunings in Docker file by @zhouyuan in #6008
  • [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer by @LiuXiaoxuanPKU in #6051
  • [VLM] Calculate maximum number of multi-modal tokens by model by @DarkLight1337 in #6121
  • [VLM] Improve consistency between feature size calculation and dummy data for profiling by @ywang96 in #6146
  • [VLM] Cleanup validation and update docs by @DarkLight1337 in #6149
  • [Bugfix] Use templated datasource in grafana.json to allow automatic imports by @frittentheke in #6136
  • [Frontend] Continuous usage stats in OpenAI completion API by @jvlunteren in #5742
  • [Bugfix] Add verbose error if scipy is missing for blocksparse attention by @JGSweets in #5695
  • bump version to v0.5.1 by @simon-mo in #6157
  • [Docs] Fix readthedocs for tag build by @simon-mo in #6158

New Contributors

  • @kimdwkimdw made their first contribution in #5444
  • @sywangyi made their first contribution in #5303
  • @garg-amit made their first contribution in #5595
  • @CharlesRiggins made their first contribution in #5592
  • @zhyncs made their first contribution in #5588
  • @bfontain made their first contribution in #5142
  • @sroy745 made their first contribution in #5131
  • @joerunde made their first contribution in #5559
  • @sergey-tinkoff made their first contribution in #5178
  • @milo157 made their first contribution in #5553
  • @ShukantPal made their first contribution in #5628
  • @rafvasq made their first contribution in #5614
  • @JRosenkranz made their first contribution in #4947
  • @rohithkrn made their first contribution in #5603
  • @wooyeonlee0 made their first contribution in #5414
  • @aws-patlange made their first contribution in #5841
  • @stephanie-wang made their first contribution in #5408
  • @ProExpertProg made their first contribution in #5560
  • @ChipKerchner made their first contribution in #5652
  • @ilya-lavrenov made their first contribution in #5379
  • @mcalman made their first contribution in #5926
  • @SolitaryThinker made their first contribution in #5925
  • @llmpros made their first contribution in #5935
  • @avshalomman made their first contribution in #6029
  • @danieljannai21 made their first contribution in #5709
  • @sirejdua made their first contribution in #6050
  • @gshtras made their first contribution in #6043
  • @frittentheke made their first contribution in #6136
  • @jvlunteren made their first contribution in #5742
  • @JGSweets made their first contribution in #5695

Full Changelog: v0.5.0...v0.5.1

Don't miss a new vllm release

NewReleases is sending notifications on new releases.