Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache
This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!
All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to @gante, @sanchit-gandhi and @xenova
💥 End-to-end generation compile
Generate: end-to-end compilation #30788 by @gante: model.generate
now supports compiling! There are a few limitations, but here is a small snippet:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import copy
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
# compile generate
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")
# compiled generate does NOT accept parameterization except a) model inputs b) a generation config
generation_config = copy.deepcopy(model.generation_config)
generation_config.pad_token_id = model.config.eos_token_id
model_inputs = tokenizer(["Write a poem about the market crashing in summer"], return_tensors="pt")
model_inputs = model_inputs.to(model.device)
output_compiled = compiled_generate(**model_inputs, generation_config=generation_config)
print(output_compiled)
⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)
- 3-5x faster torch.compile forward compilation for autoregressive decoder models #32227* by @fxmarty .
As documented on the PR, this makes the whole generation a lot faster when you re-use the cache!
You can see this when you runmodel.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀
- Offloaded KV Cache #31325* by @n17s : you just have to set
cache_implementation="offloaded"
when callingfrom_pretrained
or using this:
from transformers import GenerationConfig
gen_config = GenerationConfig(cache_implementation="offloaded", # other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True)
outputs = model.generate(inputs["input_ids"],generation_config=gen_config)
📦 Torch export for static cache
pytorch
team gave us a great gift: you can now use torch.export
directly compatible with Executorch! Find examples here.
This also unlocks support for prompt reuse:
import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
device = "cuda"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"
INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."
model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(ckpt)
prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values
prompt = "Why are french people obsessed with french?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20)
response = tokenizer.batch_decode(outputs)[0]
print(response)
prompt = "What is the best city to swim in?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**new_inputs, past_key_values=copy.deepcopy(prompt_cache),max_new_tokens=20)
response = tokenizer.batch_decode(outputs)[0]
Gemma2: assisted decoding
Gemma 2: support assisted generation #32357 by @gante
We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it here.
# transformers assisted generation reference:
# https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# we DON’T recommend using the 9b model with the 2b model as its assistant
assistant_model_name = 'google/gemma-2-2b-it'
reference_model_name = 'google/gemma-2-27b-it'
tokenizer = AutoTokenizer.from_pretrained(reference_model_name)
model = AutoModelForCausalLM.from_pretrained(
reference_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
assistant_model = AutoModelForCausalLM.from_pretrained(
assistant_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
model_inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(model.device)
generation_options = {
"assistant_model": assistant_model,
"do_sample": True,
"temperature": 0.7,
"max_new_tokens": 64,
}
outputs = model.generate(**model_inputs, **generation_options)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
Nemotron support
Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.
The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to @suiyoubi. See:
- Add Nemotron HF Support #31699
Codestral support
Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.
Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.
It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!
Breaking changes:
We removed the chat template in the code, they should all be on the hub!
- 🚨 No more default chat templates #31733 by @Rocketknight1
Long-form decoding for whisper, even faster:
Our great @sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in
- [whisper] compile compatibility with long-form decoding #31772
What's Changed
- Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in #31629
- Updated
ruff
to the latest version by @Sai-Suraj-27 in #31926 - fix by @gante in #32162
- fix: Fixed an if condition that is always evaluating to true by @Sai-Suraj-27 in #32160
- [docs] change temperature to a positive value by @faaany in #32077
- adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer by @rohitdwivedula in #32171
- fix: default value reflects the runtime environment variables rather than the ones present at import time. by @junrae6454 in #32153
- Update qwen2.md by @ArtificialZeng in #32108
- Remove conversational pipeline tests by @amyeroberts in #32099
- RoPE: relaxed rope validation by @gante in #32182
- let's not warn when someone is running a forward by @ArthurZucker in #32176
- Fix resize embedding with Deepspeed by @zucchini-nlp in #32192
- Fix float8_e4m3fn in modeling_utils by @SunMarc in #32193
- Support dequantizing GGUF FP16 format by @PenutChen in #31783
- 🚨 No more default chat templates by @Rocketknight1 in #31733
- fix: Replaced deprecated
unittest method
with the correct one by @Sai-Suraj-27 in #32198 - [whisper] fix short-form output type by @sanchit-gandhi in #32178
- remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 by @statelesshz in #32210
- Update question_answering.py by @avlewis in #32208
- [BigBird Pegasus] set _supports_param_buffer_assignment to False by @kashif in #32222
- [warnings] fix E721 warnings by @kashif in #32223
- Follow up for #31973 by @ydshieh in #32025
- translate philosophy.md to chinese by @statelesshz in #32177
- Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac by @jrhe in #31846
- Fix code snippet for Grounding DINO by @qubvel in #32229
- Generation: stop at
eos
for assisted decoding by @zucchini-nlp in #31301 - Llava: generate without images by @zucchini-nlp in #32183
- Resize embeds with DeepSpeed by @zucchini-nlp in #32214
- don't log base model architecture in wandb if log model is false by @joaonadkarni in #32143
- Refactor: Removed un-necessary
object
base class by @Sai-Suraj-27 in #32230 - Adds: extra_repr for RMSNorm layers in most models by @rohitdwivedula in #32204
- Add check for
target_sizes is None
inpost_process_image_guided_detection
for owlv2 by @catalys1 in #31934 - [tests] fix
static
cache implementation is not compatible withattn_implementation==flash_attention_2
by @faaany in #32039 - Flash-Attn: fix generation when no attention mask or no pading by @zucchini-nlp in #32241
- More flexible trigger condition by @ydshieh in #32251
- Llama 3.1: replace for loop by tensor ops at inv_freq initialization by @gante in #32244
- 🚨 Bloom support for cache class by @zucchini-nlp in #31445
- Upload new model failure report to Hub by @ydshieh in #32264
- Optimize t5 tokenize logic to avoid redundant calls by @leejet in #32270
- fix: Fixed wrong argument passed to
convert_blip_checkpoint
function call by @Sai-Suraj-27 in #32262 - Repo: remove exceptions in
check_docstrings
by @gante in #32259 - make
p_mask
a numpy array before passing toselect_starts_ends
by @faaany in #32076 - fix(docs): Fixed a link in docs by @Sai-Suraj-27 in #32274
- Generate: end-to-end compilation by @gante in #30788
- Whisper tokenizer word level timestamps by @kamilakesbi in #32197
- [pipeline] fix padding for 1-d tensors by @sanchit-gandhi in #31776
- Make static cache compatible with torch.export by @guangy10 in #32168
- Add stream messages from agent run for gradio chatbot by @aymeric-roucher in #32142
- use torch 2.4 in 2 CI jobs by @ydshieh in #32302
- Docs: fix GaLore optimizer code example by @gil2rok in #32249
- Fix GGUF dequantize for
gguf==0.9.1
by @Isotr0py in #32298 - Cast epochs_trained to int when resuming training by @teddy-f-47 in #32286
- feat(ci): set
fetch-depth: 0
in trufflehog checkout step by @McPatate in #31663 - Fix M4T for ASR pipeline by @ylacombe in #32296
- Docs: formatting nits by @gante in #32247
- Alternative agent plan by @plaggy in #32295
- fix: Added missing raise keyword for few exceptions by @Sai-Suraj-27 in #32333
- fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit by @winglian in #32276
- fixes #32329 : The Torch code is correct - to get an average of 10% o… by @fkrasnov2 in #32335
- Repo checks: skip docstring checks if not in the diff by @gante in #32328
- Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process by @xenova in #32191
- LLaVA-NeXT: fix anyres shapes by @zucchini-nlp in #32314
- Gemma2 and flash-attention by @zucchini-nlp in #32188
- Llama 3.1: Fix incorrect
inv_freq
assignment by @gante in #32330 - [Idefics2] - Fix FA2 call for Perceiver layer by @amyeroberts in #32275
- Gemma 2: support assisted generation by @gante in #32357
- Fix error when streaming to gradio with non-string tool arguments by @aymeric-roucher in #32360
-
3-5x faster torch.compile forward compilation for autoregressive decoder models by @fxmarty in #32227
- fix: Fixed
staticmethods
with self as first argument by @Sai-Suraj-27 in #32361 - fix: warmup_steps check for training_args by @Ricardo-L-C in #32236
- LLaVa: add cache class attribute by @zucchini-nlp in #32278
- [enc-dec cache] fix bug in indexing by @sanchit-gandhi in #32370
- [whisper] compile compatibility with long-form decoding by @sanchit-gandhi in #31772
- Remove size check between attn_weights and kv_seq_len for phi3 by @helunwencser in #32339
- add missing attribute _supports_param_buffer_assignment for gpt-j. by @nv-guomingz in #32359
- Check device map for saving tokenizer config on TPU (fix for issue #31971) by @ayukh in #32043
- update clean_up_tokenization_spaces warning by @itazap in #32371
- Empty list in defaults for LLaMA special tokens during weights conversion by @ViktorooReps in #32342
- Fix conflicting key in init kwargs in PreTrainedTokenizerBase by @OmarManzoor in #31233
- Offloaded KV Cache by @n17s in #31325
- Docker: add
speech
dep to the consistency docker image by @gante in #32374 - Fixed Hybrid Cache Shape Initialization. by @OsamaS99 in #32163
- Yell at the user if zero-3 init wasn't performed, but expected to have been done by @muellerzr in #32299
- Update docs by @zucchini-nlp in #32368
- RoPE: Add numerical tests ✨ by @gante in #32380
- [generate] only require an attention mask for mps with torch<2.4 by @sanchit-gandhi in #32367
- fix: (issue #32124) Exception raised when running
transformers/examples/flax/language-modeling/t5_tokenizer_model.py
. by @fshp971 in #32157 - MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. by @Luke20000429 in #31500
- Bump keras from 2.8.0 to 2.13.1 in /examples/research_projects/decision_transformer by @dependabot in #32393
- fix: SeamlessM4TFeatureExtractor stride remainder by @TechInterMezzo in #32088
- Phi3 tests: fix typing for Python 3.8 by @zucchini-nlp in #32388
- #32184 save total_vocab_size by @itazap in #32240
- add values for neftune by @nbroad1881 in #32399
- Fix documentation references to google/bit-50 model by @JuanFKurucz in #32407
- Persist embedding type of BART and mBART models after resize by @AbdiHaryadi in #32242
- fix: Updated
test_embeded_special_tokens
for luke and mluke models by @Sai-Suraj-27 in #32413 - Respect the config's attn_implementation if set by @amyeroberts in #32383
- Fix documentation links and code reference to model llava-next by @JuanFKurucz in #32434
- Cache: create docs by @zucchini-nlp in #32150
- Llava: fix checkpoint_doc by @RUFFY-369 in #32458
- add the missing flash attention test marker by @faaany in #32419
- Update kwargs validation for
preprocess
with decorator by @qubvel in #32024 - Fix get large model config for Switch Transformer encoder only tester by @JuanFKurucz in #32438
- Dependencies: fix typo by @gante in #32389
- Add Nemotron HF Support by @suiyoubi in #31699
- Generate: fix end to end compilation by @gante in #32465
- Add codestral mamba2 by @molbap in #32080
New Contributors
- @RhuiDih made their first contribution in #31629
- @rohitdwivedula made their first contribution in #32171
- @ArtificialZeng made their first contribution in #32108
- @avlewis made their first contribution in #32208
- @jrhe made their first contribution in #31846
- @joaonadkarni made their first contribution in #32143
- @catalys1 made their first contribution in #31934
- @leejet made their first contribution in #32270
- @guangy10 made their first contribution in #32168
- @gil2rok made their first contribution in #32249
- @teddy-f-47 made their first contribution in #32286
- @plaggy made their first contribution in #32295
- @fkrasnov2 made their first contribution in #32335
- @helunwencser made their first contribution in #32339
- @nv-guomingz made their first contribution in #32359
- @ayukh made their first contribution in #32043
- @n17s made their first contribution in #31325
- @OsamaS99 made their first contribution in #32163
- @fshp971 made their first contribution in #32157
- @Luke20000429 made their first contribution in #31500
- @TechInterMezzo made their first contribution in #32088
- @AbdiHaryadi made their first contribution in #32242
- @RUFFY-369 made their first contribution in #32458
- @suiyoubi made their first contribution in #31699
Full Changelog: v4.43.4...v4.44.0