huggingface/text-generation-inference v2.1.0 on GitHub

Notable changes

New models : gemma2
Multi lora adapters. You can now run multiple loras on the same TGI deployment #2010
Faster GPTQ inference and Marlin support (up to 2x speedup).
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)
Lots of Rocm support and bugfixes,
Lots of new contributors ! Thanks a lot for these contributions

What's Changed

OpenAI function calling compatible support by @phangiabao98 in #1888
Fixing types. by @Narsil in #1906
Types. by @Narsil in #1909
Fixing signals. by @Narsil in #1910
Removing some unused code. by @Narsil in #1915
MI300 compatibility by @fxmarty in #1764
Add TGI monitoring guide through Grafana and Prometheus by @fxmarty in #1908
Update grafana template by @fxmarty in #1918
Fix TunableOp bug by @fxmarty in #1920
Fix TGI issues with ROCm by @fxmarty in #1921
Fixing the download strategy for ibm-fms by @Narsil in #1917
ROCm: make CK FA2 default instead of Triton by @fxmarty in #1924
docs: Fix grafana dashboard url by @edwardzjl in #1925
feat: include token in client test like server tests by @drbh in #1932
Creating doc automatically for supported models. by @Narsil in #1929
fix: use path inside of speculator config by @drbh in #1935
feat: add train medusa head tutorial by @drbh in #1934
reenable xpu for tgi by @sywangyi in #1939
Fixing some legacy behavior (big swapout of serverless on legacy stuff). by @Narsil in #1937
Add completion route to client and add stop parameter where it's missing by @thomas-schillaci in #1869
Improving the logging system. by @Narsil in #1938
Fixing codellama loads by using purely AutoTokenizer. by @Narsil in #1947
Fix seeded output. by @Narsil in #1949
Fix (flash) Gemma prefix and enable tests by @danieldk in #1950
Fix GPTQ for models which do not have float16 at the default dtype (simpler) by @danieldk in #1953
Processor config chat template by @drbh in #1954
fix small typo and broken link by @MoritzLaurer in #1958
Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). by @Narsil in #1959
Fix (non-container) pytest stdout buffering-related lock-up by @danieldk in #1963
Fixing the text part from tokenizer endpoint. by @Narsil in #1967
feat: adjust attn weight loading logic by @drbh in #1975
Add support for exl2-quantized models by @danieldk in #1965
Update documentation version to 2.0.4 by @fxmarty in #1980
Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. by @Narsil in #1986
Fixing exl2 scratch buffer. by @Narsil in #1990
single char ` addition for docs by @nbroad1881 in #1989
Fixing GPTQ imports. by @Narsil in #1994
reable xpu, broken by gptq and setuptool upgrade by @sywangyi in #1988
router: send the input as chunks to the backend by @danieldk in #1981
Fix Phi-2 with tp>1 by @danieldk in #2003
fix: update triton implementation reference by @emmanuel-ferdman in #2002
feat: add SchedulerV3 by @OlivierDehaene in #1996
Support GPTQ models with column-packed up/gate tensor by @danieldk in #2006
Making make install work better by default. by @Narsil in #2004
Hotfixing make install. by @Narsil in #2008
Do not initialize scratch space when there are no ExLlamaV2 layers by @danieldk in #2015
feat: move allocation logic to rust by @OlivierDehaene in #1835
Fixing rocm. by @Narsil in #2021
Fix GPTQWeight import by @danieldk in #2020
Update version on init.py to 0.7.0 by @andimarafioti in #2017
Add support for Marlin-quantized models by @danieldk in #2014
marlin: support tp>1 when group_size==-1 by @danieldk in #2032
marlin: improve build by @danieldk in #2031
Internal runner ? by @Narsil in #2023
Xpu gqa by @sywangyi in #2013
server: use chunked inputs by @danieldk in #1985
ROCm and sliding windows fixes by @fxmarty in #2033
Add Phi-3 medium support by @danieldk in #2039
feat(ci): add trufflehog secrets detection by @McPatate in #2038
fix(ci): remove unnecessary permissions by @McPatate in #2045
Update LLMM1 bound by @fxmarty in #2050
Support chat response format by @drbh in #2046
fix(server): fix OPT implementation by @OlivierDehaene in #2061
fix(layers): fix SuRotaryEmbedding by @OlivierDehaene in #2060
PR #2049 CI run by @drbh in #2054
implement Open Inference Protocol endpoints by @drbh in #1942
Add support for GPTQ Marlin by @danieldk in #2052
Update the link for qwen2 by @xianbaoqian in #2068
Adding architecture document by @tengomucho in #2044
Support different image sizes in prefill in VLMs by @danieldk in #2065
Contributing guide & Code of Conduct by @LysandreJik in #2074
fix build.rs watch files by @zirconium-n in #2072
Set maximum grpc message receive size to 2GiB by @danieldk in #2075
CI: Tailscale improvements by @glegendre01 in #2079
CI: pass pre-commit hooks again by @danieldk in #2084
feat: rotate tests ci token by @drbh in #2091
Support exl2-quantized Qwen2 models by @danieldk in #2085
Factor out sharding of packed tensors by @danieldk in #2059
Fix text-generation-server quantize by @danieldk in #2103
feat: sort cuda graphs in descending order by @drbh in #2104
New runner. Manual squash. by @Narsil in #2110
Fix cargo-chef prepare by @ur4t in #2101
Support HF_TOKEN environment variable by @Wauplin in #2066
Add OTLP Service Name Environment Variable by @KevinDuffy94 in #2076
corrected Pydantic warning. by @yukiman76 in #2095
use xpu-smi to dump used memory by @sywangyi in #2047
fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api by @sunxichen in #2089
Cpu tgi by @sywangyi in #1936
feat: add simple tests for weights by @drbh in #2092
Removing IPEX_AVAIL. by @Narsil in #2115
fix cpu and xpu issue by @sywangyi in #2116
Add pytest release marker by @danieldk in #2114
Fix CI . by @Narsil in #2118
Enable multiple LoRa adapters by @drbh in #2010
Support AWQ quantization with bias by @danieldk in #2117
Add support for Marlin 2:4 sparsity by @danieldk in #2102
fix: simplify kserve endpoint and fix imports by @drbh in #2119
Fixing prom leak by upgrading. by @Narsil in #2129
Bumping to 2.1 by @Narsil in #2131
Idefics2: sync added image tokens with transformers by @danieldk in #2080
Fixing malformed rust tokenizers by @Narsil in #2134
Fixing gemma2. by @Narsil in #2135
fix: refactor post_processor logic and add test by @drbh in #2137

New Contributors

@phangiabao98 made their first contribution in #1888
@edwardzjl made their first contribution in #1925
@thomas-schillaci made their first contribution in #1869
@nbroad1881 made their first contribution in #1989
@emmanuel-ferdman made their first contribution in #2002
@andimarafioti made their first contribution in #2017
@McPatate made their first contribution in #2038
@xianbaoqian made their first contribution in #2068
@tengomucho made their first contribution in #2044
@LysandreJik made their first contribution in #2074
@zirconium-n made their first contribution in #2072
@glegendre01 made their first contribution in #2079
@ur4t made their first contribution in #2101
@KevinDuffy94 made their first contribution in #2076
@yukiman76 made their first contribution in #2095
@sunxichen made their first contribution in #2089

Full Changelog: v2.0.3...v2.1.0