pypi vllm 0.5.0
v0.5.0

latest releases: 0.6.2, 0.6.1.post2, 0.6.1.post1...
3 months ago

Highlights

Production Features

Hardware Support

  • Improvements to the Intel CPU CI (#4113, #5241)
  • Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047)

Others

  • Debugging tips documentation (#5409, #5430)
  • Dynamic Per-Token Activation Quantization (#5037)
  • Customizable RoPE theta (#5197)
  • Enable passing multiple LoRA adapters at once to generate() (#5300)
  • OpenAI tools support named functions (#5032)
  • Support stream_options for OpenAI protocol (#5319, #5135)
  • Update Outlines Integration from FSM to Guide (#4109)

What's Changed

  • [CI/Build] CMakeLists: build all extensions' cmake targets at the same time by @dtrifiro in #5034
  • [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU by @tlrmchlsmth in #5137
  • [Kernel] Update Cutlass fp8 configs by @varun-sundar-rabindranath in #5144
  • [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py by @dashanji in #5151
  • [Bugfix] Fix call to init_logger in openai server by @NadavShmayo in #4765
  • [Feature][Kernel] Support bitsandbytes quantization and QLoRA by @chenqianfzh in #4776
  • [Bugfix] Remove deprecated @abstractproperty by @zhuohan123 in #5174
  • [Bugfix]: Fix issues related to prefix caching example (#5177) by @Delviet in #5180
  • [BugFix] Prevent LLM.encode for non-generation Models by @robertgshaw2-neuralmagic in #5184
  • Update test_ignore_eos by @simon-mo in #4898
  • [Frontend][OpenAI] Support for returning max_model_len on /v1/models response by @Avinash-Raj in #4643
  • [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer by @divakar-amd in #4927
  • [Misc] Simplify code and fix type annotations in conftest.py by @DarkLight1337 in #5118
  • [Core] Support image processor by @DarkLight1337 in #4197
  • [Core] Remove unnecessary copies in flash attn backend by @Yard1 in #5138
  • [Kernel] Pass a device pointer into the quantize kernel for the scales by @tlrmchlsmth in #5159
  • [CI/BUILD] enable intel queue for longer CPU tests by @zhouyuan in #4113
  • [Misc]: Implement CPU/GPU swapping in BlockManagerV2 by @Kaiyang-Chen in #3834
  • New CI template on AWS stack by @khluu in #5110
  • [FRONTEND] OpenAI tools support named functions by @br3no in #5032
  • [Bugfix] Support prompt_logprobs==0 by @toslunar in #5217
  • [Bugfix] Add warmup for prefix caching example by @zhuohan123 in #5235
  • [Kernel] Enhance MoE benchmarking & tuning script by @WoosukKwon in #4921
  • [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend by @afeldman-nm in #5210
  • [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor by @zifeitong in #5229
  • [CI/Build] Add inputs tests by @DarkLight1337 in #5215
  • [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend by @DamonFool in #5249
  • [Kernel] Add back batch size 1536 and 3072 to MoE tuning by @WoosukKwon in #5242
  • [CI/Build] Simplify model loading for HfRunner by @DarkLight1337 in #5251
  • [CI/Build] Reducing CPU CI execution time by @bigPYJ1151 in #5241
  • [CI] mark AMD test as softfail to prevent blockage by @simon-mo in #5256
  • [Misc] Add transformers version to collect_env.py by @mgoin in #5259
  • [Misc] update collect env by @youkaichao in #5261
  • [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True by @zifeitong in #5226
  • [Misc] Add CustomOp interface for device portability by @WoosukKwon in #5255
  • [Misc] Fix docstring of get_attn_backend by @WoosukKwon in #5271
  • [Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) by @tomeras91 in #5278
  • [CI] Add nightly benchmarks by @simon-mo in #5260
  • [misc] benchmark_serving.py -- add ITL results and tweak TPOT results by @tlrmchlsmth in #5263
  • [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size by @tlrmchlsmth in #5157
  • [Model] Correct Mixtral FP8 checkpoint loading by @comaniac in #5231
  • [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM by @DriverSong in #5207
  • [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 by @pcmoritz in #5238
  • [Docs] Add Sequoia as sponsors by @simon-mo in #5287
  • [Speculative Decoding] Add ProposerWorkerBase abstract class by @njhill in #5252
  • [BugFix] Fix log message about default max model length by @njhill in #5284
  • [Bugfix] Make EngineArgs use named arguments for config construction by @mgoin in #5285
  • [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. by @wuisawesome in #5290
  • [Misc] Skip for logits_scale == 1.0 by @WoosukKwon in #5291
  • [Docs] Add Ray Summit CFP by @simon-mo in #5295
  • [CI] Disable flash_attn backend for spec decode by @simon-mo in #5286
  • [Frontend][Core] Update Outlines Integration from FSM to Guide by @br3no in #4109
  • [CI/Build] Update vision tests by @DarkLight1337 in #5307
  • Bugfix: fix broken of download models from modelscope by @liuyhwangyh in #5233
  • [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 by @pcmoritz in #5294
  • [Frontend] enable passing multiple LoRA adapters at once to generate() by @mgoldey in #5300
  • [Core] Avoid copying prompt/output tokens if no penalties are used by @Yard1 in #5289
  • [Core] Change LoRA embedding sharding to support loading methods by @Yard1 in #5038
  • [Misc] Missing error message for custom ops import by @DamonFool in #5282
  • [Feature][Frontend]: Add support for stream_options in ChatCompletionRequest by @Etelis in #5135
  • [Misc][Utils] allow get_open_port to be called for multiple times by @youkaichao in #5333
  • [Kernel] Switch fp8 layers to use the CUTLASS kernels by @tlrmchlsmth in #5183
  • Remove Ray health check by @Yard1 in #4693
  • Addition of lacked ignored_seq_groups in _schedule_chunked_prefill by @JamesLim-sy in #5296
  • [Kernel] Dynamic Per-Token Activation Quantization by @dsikka in #5037
  • [Frontend] Add OpenAI Vision API Support by @ywang96 in #5237
  • [Misc] Remove unused cuda_utils.h in CPU backend by @DamonFool in #5345
  • fix DbrxFusedNormAttention missing cache_config by @Calvinnncy97 in #5340
  • [Bug Fix] Fix the support check for FP8 CUTLASS by @cli99 in #5352
  • [Misc] Add args for selecting distributed executor to benchmarks by @BKitor in #5335
  • [ROCm][AMD] Use pytorch sdpa math backend to do naive attention by @hongxiayang in #4965
  • [CI/Test] improve robustness of test by replacing del with context manager (hf_runner) by @youkaichao in #5347
  • [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) by @youkaichao in #5357
  • [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale by @mgoin in #5353
  • [Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint by @youkaichao in #5074
  • [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py by @youkaichao in #5361
  • [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops by @bnellnm in #5047
  • [Bugfix] Fix KeyError: 1 When Using LoRA adapters by @BlackBird-Coding in #5164
  • [Misc] Update to comply with the new compressed-tensors config by @dsikka in #5350
  • [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server by @ywang96 in #5374
  • [misc][typo] fix typo by @youkaichao in #5372
  • [Misc] Improve error message when LoRA parsing fails by @DarkLight1337 in #5194
  • [Model] Initial support for LLaVA-NeXT by @DarkLight1337 in #4199
  • [Feature][Frontend]: Continued stream_options implementation also in CompletionRequest by @Etelis in #5319
  • [Bugfix] Fix LLaVA-NeXT by @DarkLight1337 in #5380
  • [ci] Use small_cpu_queue for doc build by @khluu in #5331
  • [ci] Mount buildkite agent on Docker container to upload benchmark results by @khluu in #5330
  • [Docs] Add Docs on Limitations of VLM Support by @ywang96 in #5383
  • [Docs] Alphabetically sort sponsors by @WoosukKwon in #5386
  • Bump version to v0.5.0 by @simon-mo in #5384
  • [Doc] Add documentation for FP8 W8A8 by @mgoin in #5388
  • [ci] Fix Buildkite agent path by @khluu in #5392
  • [Misc] Various simplifications and typing fixes by @njhill in #5368
  • [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs by @maor-ps in #5312
  • [Bugfix][Frontend] Cleanup "fix chat logprobs" by @DarkLight1337 in #5026
  • [Doc] add debugging tips by @youkaichao in #5409
  • [Doc][Typo] Fixing Missing Comma by @ywang96 in #5403
  • [Misc] Remove VLLM_BUILD_WITH_NEURON env variable by @WoosukKwon in #5389
  • [CI] docfix by @rkooo567 in #5410
  • [Speculative decoding] Initial spec decode docs by @cadedaniel in #5400
  • [Doc] Add an automatic prefix caching section in vllm documentation by @KuntaiDu in #5324
  • [Docs] [Spec decode] Fix docs error in code example by @cadedaniel in #5427
  • [Bugfix] Fix MultiprocessingGPUExecutor.check_health when world_size == 1 by @jsato8094 in #5254
  • [Bugfix] fix lora_dtype value type in arg_utils.py by @c3-ali in #5398
  • [Frontend] Customizable RoPE theta by @sasha0552 in #5197
  • [Core][Distributed] add same-node detection by @youkaichao in #5369
  • [Core][Doc] Default to multiprocessing for single-node distributed case by @njhill in #5230
  • [Doc] add common case for long waiting time by @youkaichao in #5430

New Contributors

Full Changelog: v0.4.3...v0.5.0

Don't miss a new vllm release

NewReleases is sending notifications on new releases.