vllm-project/vllm v0.6.3.post1 on GitHub

Highlights

New Models

Support Ministral 3B and Ministral 8B via interleaved attention (#9414)
Support multiple and interleaved images for Llama3.2 (#9095)
Support VLM2Vec, the first multimodal embedding model in vLLM (#9303)

Important bug fix

Fix chat API continuous usage stats (#9357)
Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
Fix Molmo text-only input bug (#9397)
Fix CUDA 11.8 Build (#9386)
Fix _version.py not found issue (#9375)

Other Enhancements

Remove block manager v1 and make block manager v2 default (#8704)
Spec Decode Optimize ngram lookup performance (#9333)

What's Changed

[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
[Frontend] merge beam search implementations by @LunrEclipse in #9296
[Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
[Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
[Frontend] Clarify model_type error messages by @stevegrubb in #9345
[Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
[Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
[BugFix] Fix chat API continuous usage stats by @njhill in #9357
pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
[Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
[Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
[Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
[Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
[Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
[CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
[Core] Rename input data types by @DarkLight1337 in #8688
[Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
[Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
Support mistral interleaved attn by @patrickvonplaten in #9414
[Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
[Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
[CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
Add notes on the use of Slack by @terrytangyuan in #9442
[Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
[Misc] Print stack trace using logger.exception by @DarkLight1337 in #9461
[misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
[Bugfix] Allow prefill of assistant response when using mistral_common by @sasha0552 in #9446
[TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
[Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
[CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375

New Contributors

@gracehonv made their first contribution in #9349
@streaver91 made their first contribution in #9396

Full Changelog: v0.6.3...v0.6.3.post1