github vllm-project/vllm v0.7.1

19 hours ago

Highlights

This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism

V1

For the V1 architecture, we

Models

  • New Model: MiniCPM-o (text outputs only) (#12069)

Hardwares

  • Neuron: NKI-based flash-attention kernel with paged KV cache (#11277)
  • AMD: llama 3.2 support upstreaming (#12421)

Others

  • Support override generation config in engine arguments (#12409)
  • Support reasoning content in API for deepseek R1 (#12473)

What's Changed

New Contributors

Full Changelog: v0.7.0...v0.7.1

Don't miss a new vllm release

NewReleases is sending notifications on new releases.