github huggingface/text-generation-inference v0.9.0

latest releases: v3.3.6, v3.3.5, v3.3.4...
2 years ago

Highlights

  • server: add paged attention to flash models
  • server: Inference support for GPTQ (llama + falcon tested) + Quantization script
  • server: only compute prefill logprobs when asked

Features

  • launcher: parse oom signals
  • server: batch tokenization for flash causal lm
  • server: Rework loading by
  • server: optimize dist ops
  • router: add ngrok integration
  • server: improve flash attention import errors
  • server: Refactor conversion logic
  • router: add header option to disable buffering for the generate_stream response by @rkimball
  • router: add arg validation

Fix

  • docs: CUDA_VISIBLE_DEVICES comment by @antferdom
  • docs: Fix typo and use POSIX comparison in the makefile by @piratos
  • server: fix warpers on CPU
  • server: Fixing T5 in case the names are mixed up
  • router: add timeout on flume sends
  • server: Do not init process group if already initialized
  • server: Add the option to force another dtype than f16
  • launcher: fix issue where launcher does not properly report shard failures

New Contributors

Full Changelog: v0.8.2...v0.9.0

Don't miss a new text-generation-inference release

NewReleases is sending notifications on new releases.