huggingface/text-generation-inference v2.2.0 on GitHub

Notable changes

Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
Gemma2 softcap support
Deepseek v2 support.
Lots of internal reworks/cleanup (allowing for cool features)
Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)

What's Changed

Preparing patch release. by @Narsil in #2186
Adding "longrope" for Phi-3 (#2172) by @amihalik in #2179
Refactor dead code - Removing all flash_xxx.py files. by @Narsil in #2166
Fix Starcoder2 after refactor by @danieldk in #2189
GPTQ CI improvements by @danieldk in #2151
Consistently take prefix in model constructors by @danieldk in #2191
fix dbrx & opt model prefix bug by @icyxp in #2201
hotfix: Fix number of KV heads by @danieldk in #2202
Fix incorrect cache allocation with multi-query by @danieldk in #2203
Falcon/DBRX: get correct number of key-value heads by @danieldk in #2205
add doc for intel gpus by @sywangyi in #2181
fix: python deserialization by @jaluma in #2178
update to metrics 0.23.0 or could work with metrics-exporter-promethe… by @sywangyi in #2190
feat: use model name as adapter id in chat endpoints by @drbh in #2128
Fix nccl regression on PyTorch 2.3 upgrade by @fxmarty in #2099
Fix buildx cache + change runner type by @glegendre01 in #2176
Fixed README ToC by @vinkamath in #2196
Updating the self check by @Narsil in #2209
Move quantized weight handling out of the Weights class by @danieldk in #2194
Add support for FP8 on compute capability >=8.0, <8.9 by @danieldk in #2213
fix: append DONE message to chat stream by @drbh in #2221
[fix] Modifying base in yarn embedding by @SeongBeomLEE in #2212
Use symmetric quantization in the quantize subcommand by @danieldk in #2120
feat: simple mistral lora integration tests by @drbh in #2180
fix custom cache dir by @ErikKaum in #2226
fix: Remove bitsandbytes installation when running cpu-only install by @Hugoch in #2216
Add support for AWQ-quantized Idefics2 by @danieldk in #2233
server quantize: expose groupsize option by @danieldk in #2225
Remove stray quantize argument in get_weights_col_packed_qkv by @danieldk in #2237
fix(server): fix cohere by @OlivierDehaene in #2249
Improve the handling of quantized weights by @danieldk in #2250
Hotfix: fix of use of unquantized weights in Gemma GQA loading by @danieldk in #2255
Hotfix: various GPT-based model fixes by @danieldk in #2256
Hotfix: fix MPT after recent refactor by @danieldk in #2257
Hotfix: pass through model revision in VlmCausalLM by @danieldk in #2258
usage stats and crash reports by @ErikKaum in #2220
add usage stats to toctree by @ErikKaum in #2260
fix: adjust default tool choice by @drbh in #2244
Add support for Deepseek V2 by @danieldk in #2224
re-push to internal registry by @XciD in #2242
Add FP8 release test by @danieldk in #2261
feat(fp8): use fbgemm kernels and load fp8 weights directly by @OlivierDehaene in #2248
fix(server): fix deepseekv2 loading by @OlivierDehaene in #2266
Hotfix: fix of use of unquantized weights in Mixtral GQA loading by @icyxp in #2269
legacy warning on text_generation client by @ErikKaum in #2271
fix(ci): test new instances by @XciD in #2272
fix(server): fix fp8 weight loading by @OlivierDehaene in #2268
Softcapping for gemma2. by @Narsil in #2273
use proper name for ci by @XciD in #2274
Fixing mistral nemo. by @Narsil in #2276
fix(l4): fix fp8 logic on l4 by @OlivierDehaene in #2277
Add support for repacking AWQ weights for GPTQ-Marlin by @danieldk in #2278
[WIP] Add support for Mistral-Nemo by supporting head_dim through config by @shaltielshmid in #2254
Preparing for release. by @Narsil in #2285
Add support for Llama 3 rotary embeddings by @danieldk in #2286
hotfix: pin numpy by @danieldk in #2289

New Contributors

@jaluma made their first contribution in #2178
@vinkamath made their first contribution in #2196
@ErikKaum made their first contribution in #2226
@Hugoch made their first contribution in #2216
@XciD made their first contribution in #2242
@shaltielshmid made their first contribution in #2254

Full Changelog: v2.1.1...v2.2.0