github huggingface/text-generation-inference v2.4.0

latest releases: v3.3.7, v3.3.6, v3.3.5...
14 months ago

Notable changes

  • Experimental prefill chunking (PREFILL_CHUNKING=1)
  • Experimental FP8 KV cache support
  • Greatly decrease latency for large batches (> 128 requests)
  • Faster MoE kernels and support for GPTQ-quantized MoE
  • Faster implementation of MLLama

What's Changed

New Contributors

Full Changelog: v2.3.0...v2.4

Don't miss a new text-generation-inference release

NewReleases is sending notifications on new releases.