huggingface/text-generation-inference v2.3.0 on GitHub

Important changes

Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem.
So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).
Lots of performance improvements with Marlin and quantization.

What's Changed

chore: update to torch 2.4 by @OlivierDehaene in #2259
fix crash in multi-modal by @sywangyi in #2245
fix of use of unquantized weights in cohere GQA loading, also enable … by @sywangyi in #2291
Split up layers.marlin into several files by @danieldk in #2292
fix: refactor adapter weight loading and mapping by @drbh in #2193
Using g6 instead of g5. by @Narsil in #2281
Some small fixes for the Torch 2.4.0 update by @danieldk in #2304
Fixing idefics on g6 tests. by @Narsil in #2306
Fix registry name by @XciD in #2307
Support tied embeddings in 0.5B and 1.5B Qwen2 models by @danieldk in #2313
feat: add ruff and resolve issue by @drbh in #2262
Run ci api key by @ErikKaum in #2315
Install Marlin from standalone package by @danieldk in #2320
fix: reject grammars without properties by @drbh in #2309
patch-error-on-invalid-grammar by @ErikKaum in #2282
fix: adjust test snapshots and small refactors by @drbh in #2323
server quantize: store quantizer config in standard format by @danieldk in #2299
Rebase TRT-llm by @Narsil in #2331
Handle GPTQ-Marlin loading in GPTQMarlinWeightLoader by @danieldk in #2300
Pr 2290 ci run by @drbh in #2329
refactor usage stats by @ErikKaum in #2339
enable HuggingFaceM4/idefics-9b in intel gpu by @sywangyi in #2338
Fix cache block size for flash decoding by @danieldk in #2351
Unify attention output handling by @danieldk in #2343
fix: attempt forward on flash attn2 to check hardware support by @drbh in #2335
feat: include local lora adapter loading docs by @drbh in #2359
fix: return the out tensor rather then the functions return value by @drbh in #2361
feat: implement a templated endpoint for visibility into chat requests by @drbh in #2333
feat: prefer stop over eos_token to align with openai finish_reason by @drbh in #2344
feat: return the generated text when parsing fails by @drbh in #2353
fix: default num_ln_in_parallel_attn to one if not supplied by @drbh in #2364
fix: prefer original layernorm names for 180B by @drbh in #2365
fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig by @almersawi in #2350
add gptj modeling in TGI #2366 (CI RUN) by @drbh in #2372
Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) by @drbh in #2371
Pr 2374 ci branch by @drbh in #2378
fix EleutherAI/gpt-neox-20b does not work in tgi by @sywangyi in #2346
Pr 2337 ci branch by @drbh in #2379
fix: prefer hidden_activation over hidden_act in gemma2 by @drbh in #2381
Update Quantization docs and minor doc fix. by @Vaibhavs10 in #2368
Pr 2352 ci branch by @drbh in #2382
Add FlashInfer support by @danieldk in #2354
Add experimental flake by @danieldk in #2384
Using HF_HOME instead of CACHE to get token read in addition to models. by @Narsil in #2288
flake: add fmt and clippy by @danieldk in #2389
Update documentation for Supported models by @Vaibhavs10 in #2386
flake: use rust-overlay by @danieldk in #2390
Using an enum for flash backens (paged/flashdecoding/flashinfer) by @Narsil in #2385
feat: add guideline to chat request and template by @drbh in #2391
Update flake for 9.0a capability in Torch by @danieldk in #2394
nix: add router to the devshell by @danieldk in #2396
Upgrade fbgemm by @Narsil in #2398
Adding launcher to build. by @Narsil in #2397
Fixing import exl2 by @Narsil in #2399
Cpu dockerimage by @sywangyi in #2367
Add support for prefix caching to the v3 router by @danieldk in #2392
Keeping the benchmark somewhere by @Narsil in #2401
feat: validate template variables before apply and improve sliding wi… by @drbh in #2403
fix: allocate tmp based on sgmv kernel if available by @drbh in #2345
fix: improve completions to send a final chunk with usage details by @drbh in #2336
Updating the flake. by @Narsil in #2404
Pr 2395 ci run by @drbh in #2406
fix: include create_exllama_buffers and set_device for exllama by @drbh in #2407
nix: incremental build of the launcher by @danieldk in #2410
Adding more kernels to flake. by @Narsil in #2411
add numa to improve cpu inference perf by @sywangyi in #2330
fix: adds causal to attention params by @drbh in #2408
nix: partial incremental build of the router by @danieldk in #2416
Upgrading exl2. by @Narsil in #2415
More fixes trtllm by @mfuntowicz in #2342
nix: build router incrementally by @danieldk in #2422
Fixing exl2 and other quanize tests again. by @Narsil in #2419
Upgrading the tests to match the current workings. by @Narsil in #2423
nix: try to reduce the number of Rust rebuilds by @danieldk in #2424
Improve the Consuming TGI + Streaming docs. by @Vaibhavs10 in #2412
Further fixes. by @Narsil in #2426
doc: Add metrics documentation and add a 'Reference' section by @Hugoch in #2230
All integration tests back everywhere (too many failed CI). by @Narsil in #2428
nix: update to CUDA 12.4 by @danieldk in #2429
Prefix caching by @Narsil in #2402
nix: add pure server to flake, add both pure and impure devshells by @danieldk in #2430
nix: add text-generation-benchmark to pure devshell by @danieldk in #2431
Adding eetq to flake. by @Narsil in #2438
nix: add awq-inference-engine as server dependency by @danieldk in #2442
nix: add default package by @danieldk in #2453
Fix: don't apply post layernorm in SiglipVisionTransformer by @drbh in #2459
Pr 2451 ci branch by @drbh in #2454
Fixing CI. by @Narsil in #2462
fix: bump minijinja version and add test for llama 3.1 tools by @drbh in #2463
fix: improve regex expression by @drbh in #2468
nix: build Torch against MKL and various other improvements by @danieldk in #2469
Lots of improvements (Still 2 allocators) by @Narsil in #2449
feat: add /v1/models endpoint by @drbh in #2433
update doc with intel cpu part by @sywangyi in #2420
Tied embeddings in MLP speculator. by @Narsil in #2473
nix: improve impure devshell by @danieldk in #2478
nix: add punica-kernels by @danieldk in #2477
fix: enable chat requests in vertex endpoint by @drbh in #2481
feat: support lora revisions and qkv_proj weights by @drbh in #2482
hotfix: avoid non-prefilled block use when using prefix caching by @danieldk in #2489
Adding links to Adyen blogpost. by @Narsil in #2492
Add two handy gitignores for Nix environments by @danieldk in #2484
hotfix: fix regression of attention api change in intel platform by @sywangyi in #2439
nix: add pyright/ruff for proper LSP in the impure devshell by @danieldk in #2496
Fix incompatibility with latest syrupy and update in Poetry by @danieldk in #2497
radix trie: add assertions by @danieldk in #2491
hotfix: add syrupy to the right subproject by @danieldk in #2499
Add links to Adyen blogpost by @martinigoyanes in #2500
Fixing more correctly the invalid drop of the batch. by @Narsil in #2498
Add Directory Check to Prevent Redundant Cloning in Build Process by @vamsivallepu in #2486
Prefix test - Different kind of load test to trigger prefix test bugs. by @Narsil in #2490
Fix tokenization yi by @Narsil in #2507
Fix truffle by @Narsil in #2514
nix: support Python tokenizer conversion in the router by @danieldk in #2515
Add nix test. by @Narsil in #2513
fix: pass missing revision arg for lora adapter when loading multiple… by @drbh in #2510
hotfix : enable intel ipex cpu and xpu in python3.11 by @sywangyi in #2517
Use ratatui not (deprecated) tui by @strickvl in #2521
Add tests for Mixtral by @danieldk in #2520
Adding a test for FD. by @Narsil in #2516
nix: pure Rust check/fmt/clippy/test by @danieldk in #2525
fix: metrics unbounded memory by @OlivierDehaene in #2528
Move to moe-kernels package and switch to common MoE layer by @danieldk in #2511
Stream options. by @Narsil in #2533
Update to moe-kenels 0.3.1 by @danieldk in #2535
doc: clarify that --quantize is not needed for pre-quantized models by @danieldk in #2536
hotfix: ipex fails since cuda moe kernel is not supported by @sywangyi in #2532
fix: wrap python basic logs in debug assertion in launcher by @OlivierDehaene in #2539
Preparing for release. by @Narsil in #2540

New Contributors

@almersawi made their first contribution in #2350
@Vaibhavs10 made their first contribution in #2368
@mfuntowicz made their first contribution in #2342
@vamsivallepu made their first contribution in #2486
@strickvl made their first contribution in #2521

Full Changelog: v2.2.0...v2.3.0