vllm-project/vllm v0.4.2 on GitHub

Highlights

Features

Chunked prefill is ready for testing! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (#4580)
Speculative decoding functionalities: logprobs (#4378), ngram (#4237)
Support FlashInfer as attention backend (#4353)

Models and Enhancements

Add support for Phi-3-mini (#4298, #4372, #4380)
Add more histogram metrics (#2764, #4523)
Full tensor parallelism for LoRA layers (#3524)
Expanding Marlin kernel to support all GPTQ models (#3922, #4466, #4533)

Dependency Upgrade

Upgrade to torch==2.3.0 (#4454)
Upgrade to tensorizer==2.9.0 (#4467)
Expansion of AMD test suite (#4267)

Progress and Dev Experience

Centralize and document all environment variables (#4548, #4574)
Progress towards fully typed codebase (#4337, #4427, #4555, #4450)
Progress towards pipeline parallelism (#4512, #4444, #4566)
Progress towards multiprocessing based executors (#4348, #4402, #4419)
Progress towards FP8 support (#4343, #4332, 4527)

What's Changed

[Core][Distributed] use existing torch.cuda.device context manager by @youkaichao in #4318
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by @ywang96 in #4279
[Bugfix] Fix marlin kernel crash on H100 by @alexm-nm in #4218
[Doc] Add note for docker user by @youkaichao in #4340
[Misc] Use public API in benchmark_throughput by @zifeitong in #4300
[Model] Adds Phi-3 support by @caiom in #4298
[Core] Move ray_utils.py from engine to executor package by @njhill in #4347
[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by @Isotr0py in #4324
[CI/Build] Adding functionality to reset the node's GPUs before processing. by @Alexei-V-Ivanov-AMD in #4213
[Doc] README Phi-3 name fix. by @caiom in #4372
[Core]refactor aqlm quant ops by @jikunshang in #4351
[Mypy] Typing lora folder by @rkooo567 in #4337
[Misc] Optimize flash attention backend log by @esmeetu in #4368
[Core] Add shutdown() method to ExecutorBase by @njhill in #4349
[Core] Move function tracing setup to util function by @njhill in #4352
[ROCm][Hardware][AMD][Doc] Documentation update for ROCm by @hongxiayang in #4376
[Bugfix] Fix parameter name in get_tokenizer by @DarkLight1337 in #4107
[Frontend] Add --log-level option to api server by @normster in #4377
[CI] Disable non-lazy string operation on logging by @rkooo567 in #4326
[Core] Refactoring sampler and support prompt logprob for chunked prefill by @rkooo567 in #4309
[Misc][Refactor] Generalize linear_method to be quant_method by @comaniac in #4373
[Misc] add RFC issue template by @youkaichao in #4401
[Core] Introduce DistributedGPUExecutor abstract class by @njhill in #4348
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by @pcmoritz in #4343
[Frontend][Bugfix] Disallow extra fields in OpenAI API by @DarkLight1337 in #4355
[Misc] Fix logger format typo by @esmeetu in #4396
[ROCm][Hardware][AMD] Enable group query attention for triton FA by @hongxiayang in #4406
[Kernel] Full Tensor Parallelism for LoRA Layers by @FurtherAI in #3524
[Model] Phi-3 4k sliding window temp. fix by @caiom in #4380
[Bugfix][Core] Fix get decoding config from ray by @esmeetu in #4335
[Bugfix] Abort requests when the connection to /v1/completions is interrupted by @chestnut-Q in #4363
[BugFix] Fix min_tokens when eos_token_id is None by @njhill in #4389
✨ support local cache for models by @prashantgupta24 in #4374
[BugFix] Fix return type of executor execute_model methods by @njhill in #4402
[BugFix] Resolved Issues For LinearMethod --> QuantConfig by @robertgshaw2-neuralmagic in #4418
[Misc] fix typo in llm_engine init logging by @DefTruth in #4428
Add more Prometheus metrics by @ronensc in #2764
[CI] clean docker cache for neuron by @simon-mo in #4441
[mypy][5/N] Support all typing on model executor by @rkooo567 in #4427
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by @robertgshaw2-neuralmagic in #3922
[CI] hotfix: soft fail neuron test by @simon-mo in #4458
[Core][Distributed] use cpu group to broadcast metadata in cpu by @youkaichao in #4444
[Misc] Upgrade to torch==2.3.0 by @mgoin in #4454
[Bugfix][Kernel] Fix compute_type for MoE kernel by @WoosukKwon in #4463
[Core]Refactor gptq_marlin ops by @jikunshang in #4466
[BugFix] fix num_lookahead_slots missing in async executor by @leiwen83 in #4165
[Doc] add visualization for multi-stage dockerfile by @prashantgupta24 in #4456
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) by @robertgshaw2-neuralmagic in #4332
[Frontend] Support complex message content for chat completions endpoint by @fgreinacher in #3467
[Frontend] [Core] Tensorizer: support dynamic num_readers, update version by @alpayariyak in #4467
[Bugfix][Minor] Make ignore_eos effective by @bigPYJ1151 in #4468
fix_tokenizer_snapshot_download_bug by @kingljl in #4493
Unable to find Punica extension issue during source code installation by @kingljl in #4494
[Core] Centralize GPU Worker construction by @njhill in #4419
[Misc][Typo] type annotation fix by @HarryWu99 in #4495
[Misc] fix typo in block manager by @Juelianqvq in #4453
Allow user to define whitespace pattern for outlines by @robcaulk in #4305
[Misc]Add customized information for models by @jeejeelee in #4132
[Test] Add ignore_eos test by @rkooo567 in #4519
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by @AnyISalIn in #4173
[Bugfix] Fix 307 Redirect for /metrics by @robertgshaw2-neuralmagic in #4523
[Doc] update(example model): for OpenAI compatible serving by @fpaupier in #4503
[Bugfix] Use random seed if seed is -1 by @sasha0552 in #4531
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by @tjohnson31415 in #4534
[Speculative decoding] Add ngram prompt lookup decoding by @leiwen83 in #4237
[Core] Enable prefix caching with block manager v2 enabled by @leiwen83 in #4142
[Core] Add multiproc_worker_utils for multiprocessing-based workers by @njhill in #4357
[Kernel] Update fused_moe tuning script for FP8 by @pcmoritz in #4457
[Bugfix] Add validation for seed by @sasha0552 in #4529
[Bugfix][Core] Fix and refactor logging stats by @esmeetu in #4336
[Core][Distributed] fix pynccl del error by @youkaichao in #4508
[Misc] Remove Mixtral device="cuda" declarations by @pcmoritz in #4543
[Misc] Fix expert_ids shape in MoE by @WoosukKwon in #4517
[MISC] Rework logger to enable pythonic custom logging configuration to be provided by @tdg5 in #4273
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption by @rkooo567 in #4451
[CI]Add regression tests to ensure the async engine generates metrics by @ronensc in #4524
[mypy][6/N] Fix all the core subdirectory typing by @rkooo567 in #4450
[Core][Distributed] enable multiple tp group by @youkaichao in #4512
[Kernel] Support running GPTQ 8-bit models in Marlin by @alexm-nm in #4533
[mypy][7/N] Cover all directories by @rkooo567 in #4555
[Misc] Exclude the tests directory from being packaged by @itechbear in #4552
[BugFix] Include target-device specific requirements.txt in sdist by @markmc in #4559
[Misc] centralize all usage of environment variables by @youkaichao in #4548
[kernel] fix sliding window in prefix prefill Triton kernel by @mmoskal in #4405
[CI/Build] AMD CI pipeline with extended set of tests. by @Alexei-V-Ivanov-AMD in #4267
[Core] Ignore infeasible swap requests. by @rkooo567 in #4557
[Core][Distributed] enable allreduce for multiple tp groups by @youkaichao in #4566
[BugFix] Prevent the task of _force_log from being garbage collected by @Atry in #4567
[Misc] remove chunk detected debug logs by @DefTruth in #4571
[Doc] add env vars to the doc by @youkaichao in #4572
[Core][Model runner refactoring 1/N] Refactor attn metadata term by @rkooo567 in #4518
[Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None by @mgoin in #4586
Fix/async chat serving by @schoennenbeck in #2727
[Kernel] Use flashinfer for decoding by @LiuXiaoxuanPKU in #4353
[Speculative decoding] Support target-model logprobs by @cadedaniel in #4378
[Misc] add installation time env vars by @youkaichao in #4574
[Misc][Refactor] Introduce ExecuteModelData by @comaniac in #4540
[Doc] Chunked Prefill Documentation by @rkooo567 in #4580
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) by @mgoin in #4527
[CI] check size of the wheels by @simon-mo in #4319
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics by @DearPlanet in #3937
bump version to v0.4.2 by @simon-mo in #4600
[CI] Reduce wheel size by not shipping debug symbols by @simon-mo in #4602

New Contributors

@zifeitong made their first contribution in #4300
@caiom made their first contribution in #4298
@Alexei-V-Ivanov-AMD made their first contribution in #4213
@normster made their first contribution in #4377
@FurtherAI made their first contribution in #3524
@chestnut-Q made their first contribution in #4363
@prashantgupta24 made their first contribution in #4374
@fgreinacher made their first contribution in #3467
@alpayariyak made their first contribution in #4467
@HarryWu99 made their first contribution in #4495
@Juelianqvq made their first contribution in #4453
@robcaulk made their first contribution in #4305
@AnyISalIn made their first contribution in #4173
@sasha0552 made their first contribution in #4531
@tdg5 made their first contribution in #4273
@itechbear made their first contribution in #4552
@markmc made their first contribution in #4559
@Atry made their first contribution in #4567
@schoennenbeck made their first contribution in #2727
@DearPlanet made their first contribution in #3937

Full Changelog: v0.4.1...v0.4.2