huggingface/text-generation-inference v0.9.0
on GitHub

latest releases: v3.3.6, v3.3.5, v3.3.4...

2 years ago

Highlights

server: add paged attention to flash models
server: Inference support for GPTQ (llama + falcon tested) + Quantization script
server: only compute prefill logprobs when asked

Features

launcher: parse oom signals
server: batch tokenization for flash causal lm
server: Rework loading by
server: optimize dist ops
router: add ngrok integration
server: improve flash attention import errors
server: Refactor conversion logic
router: add header option to disable buffering for the generate_stream response by @rkimball
router: add arg validation

Fix

docs: CUDA_VISIBLE_DEVICES comment by @antferdom
docs: Fix typo and use POSIX comparison in the makefile by @piratos
server: fix warpers on CPU
server: Fixing T5 in case the names are mixed up
router: add timeout on flume sends
server: Do not init process group if already initialized
server: Add the option to force another dtype than f16
launcher: fix issue where launcher does not properly report shard failures

New Contributors

@antferdom made their first contribution in #441
@piratos made their first contribution in #443
@Yard1 made their first contribution in #388
@rkimball made their first contribution in #498

Full Changelog: v0.8.2...v0.9.0

Check out latest releases or
releases around huggingface/text-generation-inference v0.9.0

Don't miss a new text-generation-inference release

NewReleases is sending notifications on new releases.

Get notifications