huggingface/text-generation-inference v2.4.0 on GitHub

Notable changes

Experimental prefill chunking (PREFILL_CHUNKING=1)
Experimental FP8 KV cache support
Greatly decrease latency for large batches (> 128 requests)
Faster MoE kernels and support for GPTQ-quantized MoE
Faster implementation of MLLama

What's Changed

nix: remove unused _server.nix file by @danieldk in #2538
chore: Add old V2 backend by @OlivierDehaene in #2551
Remove duplicated RUN in Dockerfile by @alvarobartt in #2547
Micro cleanup. by @Narsil in #2555
Hotfixing main by @Narsil in #2556
Add support for scalar FP8 weight scales by @danieldk in #2550
Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 by @danieldk in #2537
Update the link to the Ratatui organization by @orhun in #2546
Simplify crossterm imports by @orhun in #2545
Adding note for private models in quick-tour document by @ariG23498 in #2548
Hotfixing main. by @Narsil in #2562
Cleanup Vertex + Chat by @Narsil in #2553
More tensor cores. by @Narsil in #2558
remove LORA_ADAPTERS_PATH by @nbroad1881 in #2563
Add LoRA adapters support for Gemma2 by @alvarobartt in #2567
Fix build with --features google by @alvarobartt in #2566
Improve support for GPUs with capability < 8 by @danieldk in #2575
flashinfer: pass window size and dtype by @danieldk in #2574
Remove compute capability lazy cell by @danieldk in #2580
Update architecture.md by @ulhaqi12 in #2577
Update ROCM libs and improvements by @mht-sharma in #2579
Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in #2557
feat: support phi3.5 moe by @drbh in #2479
Move flake back to tgi-nix main by @danieldk in #2586
MoE Marlin: support desc_act for groupsize != -1 by @danieldk in #2590
nix: experimental support for building a Docker container by @danieldk in #2470
Mllama flash version by @Narsil in #2585
Max token capacity metric by @Narsil in #2595
CI (2592): Allow LoRA adapter revision in server launcher by @drbh in #2602
Unroll notify error into generate response by @drbh in #2597
New release 2.3.1 by @Narsil in #2604
Revert "Unroll notify error into generate response" by @drbh in #2605
nix: example of local package overrides during development by @danieldk in #2607
Add basic FP8 KV cache support by @danieldk in #2603
Fp8 Cache condition by @flozi00 in #2611
enable mllama in intel platform by @sywangyi in #2610
Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in #2617
Add support for fused MoE Marlin for AWQ by @danieldk in #2616
nix: move back to the tgi-nix main branch by @danieldk in #2620
CI (2599): Update ToolType input schema by @drbh in #2601
nix: add black and isort to the closure by @danieldk in #2619
AMD CI by @Narsil in #2589
feat: allow tool calling to respond without a tool by @drbh in #2614
Update documentation to most recent stable version of TGI. by @Vaibhavs10 in #2625
Intel ci by @Narsil in #2630
Fixing intel Supports windowing. by @Narsil in #2637
Small fixes for supported models by @osanseviero in #2471
Cpu perf by @Narsil in #2596
Clarify gated description and quicktour by @osanseviero in #2631
update ipex to fix incorrect output of mllama in cpu by @sywangyi in #2640
feat: enable pytorch xpu support for non-attention models by @dvrogozh in #2561
Fixing linters. by @Narsil in #2650
Rollback to ChatRequest for Vertex AI Chat instead of VertexChat by @alvarobartt in #2651
Fp8 e4m3_fnuz support for rocm by @mht-sharma in #2588
feat: prefill chunking by @OlivierDehaene in #2600
Support e4m3fn KV cache by @danieldk in #2655
Simplify the attention function by @danieldk in #2609
fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in #2663
fix: prefer inplace softmax to avoid copy by @drbh in #2661
Break cycle between the attention implementations and KV cache by @danieldk in #2627
CI job. Gpt awq 4 by @Narsil in #2665
Make handling of FP8 scales more consisent by @danieldk in #2666
Test Marlin MoE with desc_act=true by @danieldk in #2622
break when there's nothing to read by @sywangyi in #2582
Add impureWithCuda dev shell by @danieldk in #2677
Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in #2632
feat: natively support Granite models by @OlivierDehaene in #2682
feat: allow any supported payload on /invocations by @OlivierDehaene in #2683
flashinfer: reminder to remove contiguous call in the future by @danieldk in #2685
Fix Phi 3.5 MoE tests by @danieldk in #2684
Add support for FP8 KV cache scales by @danieldk in #2628
Fixing "deadlock" when python prompts for trust_remote_code by always by @Narsil in #2664
[TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in #2357
Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in #2691
Fixing mt0 test. by @Narsil in #2692
Add support for stop words in TRTLLM by @mfuntowicz in #2678
Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in #2688

New Contributors

@alvarobartt made their first contribution in #2547
@orhun made their first contribution in #2546
@ariG23498 made their first contribution in #2548
@ulhaqi12 made their first contribution in #2577
@mht-sharma made their first contribution in #2579
@dvrogozh made their first contribution in #2561

Full Changelog: v2.3.0...v2.4