Highlights
Features
- Chunked prefill is ready for testing! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (#4580)
- Speculative decoding functionalities: logprobs (#4378), ngram (#4237)
- Support FlashInfer as attention backend (#4353)
Models and Enhancements
- Add support for Phi-3-mini (#4298, #4372, #4380)
- Add more histogram metrics (#2764, #4523)
- Full tensor parallelism for LoRA layers (#3524)
- Expanding Marlin kernel to support all GPTQ models (#3922, #4466, #4533)
Dependency Upgrade
- Upgrade to
torch==2.3.0
(#4454) - Upgrade to
tensorizer==2.9.0
(#4467) - Expansion of AMD test suite (#4267)
Progress and Dev Experience
- Centralize and document all environment variables (#4548, #4574)
- Progress towards fully typed codebase (#4337, #4427, #4555, #4450)
- Progress towards pipeline parallelism (#4512, #4444, #4566)
- Progress towards multiprocessing based executors (#4348, #4402, #4419)
- Progress towards FP8 support (#4343, #4332, 4527)
What's Changed
- [Core][Distributed] use existing torch.cuda.device context manager by @youkaichao in #4318
- [Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by @ywang96 in #4279
- [Bugfix] Fix marlin kernel crash on H100 by @alexm-nm in #4218
- [Doc] Add note for docker user by @youkaichao in #4340
- [Misc] Use public API in benchmark_throughput by @zifeitong in #4300
- [Model] Adds Phi-3 support by @caiom in #4298
- [Core] Move ray_utils.py from
engine
toexecutor
package by @njhill in #4347 - [Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by @Isotr0py in #4324
- [CI/Build] Adding functionality to reset the node's GPUs before processing. by @Alexei-V-Ivanov-AMD in #4213
- [Doc] README Phi-3 name fix. by @caiom in #4372
- [Core]refactor aqlm quant ops by @jikunshang in #4351
- [Mypy] Typing lora folder by @rkooo567 in #4337
- [Misc] Optimize flash attention backend log by @esmeetu in #4368
- [Core] Add
shutdown()
method toExecutorBase
by @njhill in #4349 - [Core] Move function tracing setup to util function by @njhill in #4352
- [ROCm][Hardware][AMD][Doc] Documentation update for ROCm by @hongxiayang in #4376
- [Bugfix] Fix parameter name in
get_tokenizer
by @DarkLight1337 in #4107 - [Frontend] Add --log-level option to api server by @normster in #4377
- [CI] Disable non-lazy string operation on logging by @rkooo567 in #4326
- [Core] Refactoring sampler and support prompt logprob for chunked prefill by @rkooo567 in #4309
- [Misc][Refactor] Generalize linear_method to be quant_method by @comaniac in #4373
- [Misc] add RFC issue template by @youkaichao in #4401
- [Core] Introduce
DistributedGPUExecutor
abstract class by @njhill in #4348 - [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by @pcmoritz in #4343
- [Frontend][Bugfix] Disallow extra fields in OpenAI API by @DarkLight1337 in #4355
- [Misc] Fix logger format typo by @esmeetu in #4396
- [ROCm][Hardware][AMD] Enable group query attention for triton FA by @hongxiayang in #4406
- [Kernel] Full Tensor Parallelism for LoRA Layers by @FurtherAI in #3524
- [Model] Phi-3 4k sliding window temp. fix by @caiom in #4380
- [Bugfix][Core] Fix get decoding config from ray by @esmeetu in #4335
- [Bugfix] Abort requests when the connection to /v1/completions is interrupted by @chestnut-Q in #4363
- [BugFix] Fix
min_tokens
wheneos_token_id
is None by @njhill in #4389 - ✨ support local cache for models by @prashantgupta24 in #4374
- [BugFix] Fix return type of executor execute_model methods by @njhill in #4402
- [BugFix] Resolved Issues For LinearMethod --> QuantConfig by @robertgshaw2-neuralmagic in #4418
- [Misc] fix typo in llm_engine init logging by @DefTruth in #4428
- Add more Prometheus metrics by @ronensc in #2764
- [CI] clean docker cache for neuron by @simon-mo in #4441
- [mypy][5/N] Support all typing on model executor by @rkooo567 in #4427
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by @robertgshaw2-neuralmagic in #3922
- [CI] hotfix: soft fail neuron test by @simon-mo in #4458
- [Core][Distributed] use cpu group to broadcast metadata in cpu by @youkaichao in #4444
- [Misc] Upgrade to
torch==2.3.0
by @mgoin in #4454 - [Bugfix][Kernel] Fix compute_type for MoE kernel by @WoosukKwon in #4463
- [Core]Refactor gptq_marlin ops by @jikunshang in #4466
- [BugFix] fix num_lookahead_slots missing in async executor by @leiwen83 in #4165
- [Doc] add visualization for multi-stage dockerfile by @prashantgupta24 in #4456
- [Kernel] Support Fp8 Checkpoints (Dynamic + Static) by @robertgshaw2-neuralmagic in #4332
- [Frontend] Support complex message content for chat completions endpoint by @fgreinacher in #3467
- [Frontend] [Core] Tensorizer: support dynamic
num_readers
, update version by @alpayariyak in #4467 - [Bugfix][Minor] Make ignore_eos effective by @bigPYJ1151 in #4468
- fix_tokenizer_snapshot_download_bug by @kingljl in #4493
- Unable to find Punica extension issue during source code installation by @kingljl in #4494
- [Core] Centralize GPU Worker construction by @njhill in #4419
- [Misc][Typo] type annotation fix by @HarryWu99 in #4495
- [Misc] fix typo in block manager by @Juelianqvq in #4453
- Allow user to define whitespace pattern for outlines by @robcaulk in #4305
- [Misc]Add customized information for models by @jeejeelee in #4132
- [Test] Add ignore_eos test by @rkooo567 in #4519
- [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by @AnyISalIn in #4173
- [Bugfix] Fix 307 Redirect for
/metrics
by @robertgshaw2-neuralmagic in #4523 - [Doc] update(example model): for OpenAI compatible serving by @fpaupier in #4503
- [Bugfix] Use random seed if seed is -1 by @sasha0552 in #4531
- [CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by @tjohnson31415 in #4534
- [Speculative decoding] Add ngram prompt lookup decoding by @leiwen83 in #4237
- [Core] Enable prefix caching with block manager v2 enabled by @leiwen83 in #4142
- [Core] Add
multiproc_worker_utils
for multiprocessing-based workers by @njhill in #4357 - [Kernel] Update fused_moe tuning script for FP8 by @pcmoritz in #4457
- [Bugfix] Add validation for seed by @sasha0552 in #4529
- [Bugfix][Core] Fix and refactor logging stats by @esmeetu in #4336
- [Core][Distributed] fix pynccl del error by @youkaichao in #4508
- [Misc] Remove Mixtral device="cuda" declarations by @pcmoritz in #4543
- [Misc] Fix expert_ids shape in MoE by @WoosukKwon in #4517
- [MISC] Rework logger to enable pythonic custom logging configuration to be provided by @tdg5 in #4273
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption by @rkooo567 in #4451
- [CI]Add regression tests to ensure the async engine generates metrics by @ronensc in #4524
- [mypy][6/N] Fix all the core subdirectory typing by @rkooo567 in #4450
- [Core][Distributed] enable multiple tp group by @youkaichao in #4512
- [Kernel] Support running GPTQ 8-bit models in Marlin by @alexm-nm in #4533
- [mypy][7/N] Cover all directories by @rkooo567 in #4555
- [Misc] Exclude the
tests
directory from being packaged by @itechbear in #4552 - [BugFix] Include target-device specific requirements.txt in sdist by @markmc in #4559
- [Misc] centralize all usage of environment variables by @youkaichao in #4548
- [kernel] fix sliding window in prefix prefill Triton kernel by @mmoskal in #4405
- [CI/Build] AMD CI pipeline with extended set of tests. by @Alexei-V-Ivanov-AMD in #4267
- [Core] Ignore infeasible swap requests. by @rkooo567 in #4557
- [Core][Distributed] enable allreduce for multiple tp groups by @youkaichao in #4566
- [BugFix] Prevent the task of
_force_log
from being garbage collected by @Atry in #4567 - [Misc] remove chunk detected debug logs by @DefTruth in #4571
- [Doc] add env vars to the doc by @youkaichao in #4572
- [Core][Model runner refactoring 1/N] Refactor attn metadata term by @rkooo567 in #4518
- [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None by @mgoin in #4586
- Fix/async chat serving by @schoennenbeck in #2727
- [Kernel] Use flashinfer for decoding by @LiuXiaoxuanPKU in #4353
- [Speculative decoding] Support target-model logprobs by @cadedaniel in #4378
- [Misc] add installation time env vars by @youkaichao in #4574
- [Misc][Refactor] Introduce ExecuteModelData by @comaniac in #4540
- [Doc] Chunked Prefill Documentation by @rkooo567 in #4580
- [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) by @mgoin in #4527
- [CI] check size of the wheels by @simon-mo in #4319
- [Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics by @DearPlanet in #3937
- bump version to v0.4.2 by @simon-mo in #4600
- [CI] Reduce wheel size by not shipping debug symbols by @simon-mo in #4602
New Contributors
- @zifeitong made their first contribution in #4300
- @caiom made their first contribution in #4298
- @Alexei-V-Ivanov-AMD made their first contribution in #4213
- @normster made their first contribution in #4377
- @FurtherAI made their first contribution in #3524
- @chestnut-Q made their first contribution in #4363
- @prashantgupta24 made their first contribution in #4374
- @fgreinacher made their first contribution in #3467
- @alpayariyak made their first contribution in #4467
- @HarryWu99 made their first contribution in #4495
- @Juelianqvq made their first contribution in #4453
- @robcaulk made their first contribution in #4305
- @AnyISalIn made their first contribution in #4173
- @sasha0552 made their first contribution in #4531
- @tdg5 made their first contribution in #4273
- @itechbear made their first contribution in #4552
- @markmc made their first contribution in #4559
- @Atry made their first contribution in #4567
- @schoennenbeck made their first contribution in #2727
- @DearPlanet made their first contribution in #3937
Full Changelog: v0.4.1...v0.4.2