github vllm-project/vllm v0.6.6

latest release: v0.6.6.post1
2 days ago

Highlights

  • Support Deepseek V3 (#11523, #11502) model.

    • On 8xH200s or MI300x: vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192. The context lenght can be increased to about 32K beyond running into memory issue.
    • For other devices, follow our distributed inference guide to enable tensor parallel and/or pipeline parallel inference
    • We are just getting started for enhancing the support and unlock more performance. See #11539 for planned work.
  • Last mile stretch for V1 engine refactoring: API Server (#11529, #11530), penalties for sampler (#10681), prefix caching for vision language models (#11187, #11305), TP Ray executor (#11107,#11472)

  • Breaking change: X-Request-ID echoing is now opt-in instead of on by default for performance reason. Set --enable-request-id-headers to enable it.

Model Support

  • IBM Granite 3.1 (#11307), JambaForSequenceClassification model (#10860)
  • Add QVQ and QwQ to the list of supported models (#11509)

Performance

  • Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)

Production Engine

  • Support streaming model from S3 using RunAI Model Streamer as optional loader (#10192)
  • Online Pooling API (#11457)
  • Load video from base64 (#11492)

Others

  • Add pypi index for every commit and nightly build (#11404)

What's Changed

New Contributors

Full Changelog: v0.6.5...v0.6.6

Don't miss a new vllm release

NewReleases is sending notifications on new releases.