huggingface/text-generation-inference v2.4.1 on GitHub

Notable changes

Choose input/total tokens automatically based on available VRAM
Support Qwen2 VL
Decrease latency of very large batches (> 128)

What's Changed

feat: add triton kernels to decrease latency of large batches by @OlivierDehaene in #2687
Avoiding timeout for bloom tests. by @Narsil in #2693
Green main by @Narsil in #2697
Choosing input/total tokens automatically based on available VRAM? by @Narsil in #2673
We can have a tokenizer anywhere. by @Narsil in #2527
Update poetry lock. by @Narsil in #2698
Fixing auto bloom test. by @Narsil in #2699
More timeout on docker start ? by @Narsil in #2701
Monkey patching as a desperate measure. by @Narsil in #2704
add xpu triton in dockerfile, or will show "Could not import Flash At… by @sywangyi in #2702
Support qwen2 vl by @drbh in #2689
fix cuda graphs for qwen2-vl by @drbh in #2708
fix: create position ids for text only input by @drbh in #2714
fix: add chat_tokenize endpoint to api docs by @drbh in #2710
Hotfixing auto length (warmup max_s was wrong). by @Narsil in #2716
Fix prefix caching + speculative decoding by @tgaddair in #2711
Fixing linting on main. by @Narsil in #2719
nix: move to tgi-nix main by @danieldk in #2718
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… by @sywangyi in #2717
add trust_remote_code in tokenizer to fix baichuan issue by @sywangyi in #2725
Add initial support for compressed-tensors checkpoints by @danieldk in #2732
nix: update nixpkgs by @danieldk in #2746
benchmark: fix prefill throughput by @danieldk in #2741
Fix: Change model_type from ssm to mamba by @mokeddembillel in #2740
Fix: Change embeddings to embedding by @mokeddembillel in #2738
fix response type of document for Text Generation Inference by @jitokim in #2743
Upgrade outlines to 0.1.1 by @aW3st in #2742
Upgrading our deps. by @Narsil in #2750
feat: return streaming errors as an event formatted for openai's client by @drbh in #2668
Remove vLLM dependency for CUDA by @danieldk in #2751
fix: improve find_segments via numpy diff by @drbh in #2686
add ipex moe implementation to support Mixtral and PhiMoe by @sywangyi in #2707
Add support for compressed-tensors w8a8 int checkpoints by @danieldk in #2745
feat: support flash attention 2 in qwen2 vl vision blocks by @drbh in #2721
Simplify two ipex conditions by @danieldk in #2755
Update to moe-kernels 0.7.0 by @danieldk in #2720
PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme by @drbh in #2645
fix: adjust llama MLP name from dense to mlp to correctly apply lora by @drbh in #2760
nix: update for outlines 0.1.4 by @danieldk in #2764
Add support for wNa16 int 2:4 compressed-tensors checkpoints by @danieldk in #2758
nix: build and cache impure devshells by @danieldk in #2765
fix: set outlines version to 0.1.3 to avoid caching serialization issue by @drbh in #2766
nix: downgrade to outlines 0.1.3 by @danieldk in #2768
fix: incomplete generations w/ single tokens generations and models that did not support chunking by @OlivierDehaene in #2770
fix: tweak grammar test response by @drbh in #2769
Add a README section about using Nix by @danieldk in #2767
Remove guideline from API by @Wauplin in #2762
feat: Add automatic nightly benchmarks by @Hugoch in #2591
feat: add payload limit by @OlivierDehaene in #2726
Update to marlin-kernels 0.3.6 by @danieldk in #2771
chore: prepare 2.4.1 release by @OlivierDehaene in #2773

New Contributors

@tgaddair made their first contribution in #2711
@mokeddembillel made their first contribution in #2740
@jitokim made their first contribution in #2743

Full Changelog: v2.3.0...v2.4.1