github huggingface/text-generation-inference v0.9.4

latest releases: v3.3.6, v3.3.5, v3.3.4...
2 years ago

Features

  • server: auto max_batch_total_tokens for flash att models #630
  • router: ngrok edge #642
  • server: Add trust_remote_code to quantize script by @ChristophRaab #647
  • server: Add exllama GPTQ CUDA kernel support #553 #666
  • server: Directly load GPTBigCode to specified device by @Atry in #618
  • server: add cuda memory fraction #659
  • server: Using quantize_config.json instead of GPTQ_BITS env variables #671
  • server: support new falcon config #712

Fix

  • server: llama v2 GPTQ #648
  • server: Fixing non parameters in quantize script bigcode/starcoder was an example #661
  • server: use mem_get_info to get kv cache size #664
  • server: fix exllama buffers #689
  • server: fix quantization python requirements #708

New Contributors

Full Changelog: v0.9.3...v0.9.4

Don't miss a new text-generation-inference release

NewReleases is sending notifications on new releases.