huggingface/transformers v4.43.0 on GitHub

Llama

The Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.

To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.

We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.

Chameleon

The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.

Chameleon: add model by @zucchini-nlp in #31534

ZoeDepth

The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.

Add ZoeDepth by @NielsRogge in #30136

Hiera

Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.

Adding hiera by @Namangarg110 in #30356

Agents

Our ReactAgent has a specific way to return its final output: it calls the tool final_answer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific final_answer tools helps the llm_engine find what to return: so we generalized the final_answer tool for all agents.

Adds final answer tool for all agents by @aymeric-roucher in #31703

Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!

Code agent: allow function persistence between steps by @aymeric-roucher in #31769

This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planning_interval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon!

Agents planning by @aymeric-roucher in #31702

Notable changes to the codebase

A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture.
It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.

Llama: RoPE refactor by @gante in #32135

Breaking changes

TextGenerationPipeline and tokenizer kwargs

🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline previously did not add a <bos> by default, which (negatively) impacted their performance. In practice, this is a breaking change.

Example of a script changed as a result of this PR:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Foo bar"))

🚨🚨 TextGenerationPipeline: rely on the tokenizer default kwargs by @gante in #31747

Bugfixes and improvements

Fix post gemma merge by @ArthurZucker in #31660
Fix float out of range in owlvit and owlv2 when using FP16 or lower precision by @aliencaocao in #31657
[docs] Llama3 by @stevhliu in #31662
[HybridCache] Fix get_seq_length method by @sanchit-gandhi in #31661
don't zero out the attention_mask when using sliding window with flash attention by @winglian in #31670
Fix Gemma2 4d attention mask by @hiyouga in #31674
Fix return_dict in encodec by @jla524 in #31646
add gather_use_object arguments by @SangbumChoi in #31514
Gemma capping is a must for big models by @ArthurZucker in #31698
Add French version of run scripts tutorial by @jadechoghari in #31483
dependencies: keras-nlp<0.14 pin by @gante in #31684
remove incorrect urls pointing to the llava repository by @BiliBraker in #31107
Move some test files (tets/test_xxx_utils.py) to tests/utils by @ydshieh in #31730
Fix mistral ONNX export by @fxmarty in #31696
[whisper] static kv cache by @sanchit-gandhi in #31166
Make tool JSON schemas consistent by @Rocketknight1 in #31756
Fix documentation for Gemma2. by @jbornschein in #31682
fix assisted decoding by @jiqing-feng in #31401
Requires for torch.tensor before casting by @echarlaix in #31755
handle (processor_class, None) returned by ModelPatterns by @molbap in #31753
Gemma 2: Update slow tests by @gante in #31759
Add ignore_errors=True to trainer.py rmtree in _inner_training_loop by @njbrake in #31668
[fix bug] logits's shape different from label's shape in preprocess_logits_for_metrics by @wiserxin in #31447
Fix RT-DETR cache for generate_anchors by @qubvel in #31671
Fix RT-DETR weights initialization by @qubvel in #31724
pytest_num_workers=4 for some CircleCI jobs by @ydshieh in #31764
Fix Gemma2 types by @hiyouga in #31779
Add torch_empty_cache_steps to TrainingArguments by @aliencaocao in #31546
Fix ClapProcessor to merge feature_extractor output into the returned BatchEncoding by @mxkopy in #31767
Fix serialization for offloaded model by @SunMarc in #31727
Make tensor device correct when ACCELERATE_TORCH_DEVICE is defined by @kiszk in #31751
Exclude torch.compile time from metrics computation by @zxd1997066 in #31443
Update CometCallback to allow reusing of the running experiment by @Lothiraldan in #31366
Fix gemma tests by @ydshieh in #31794
Add training support for SigLIP by @aliencaocao in #31495
Repeating an important warning in the chat template docs by @Rocketknight1 in #31796
Allow FP16 or other precision inference for Pipelines by @aliencaocao in #31342
Fix galore lr display with schedulers by @vasqu in #31710
Fix Wav2Vec2 Fairseq conversion (weight norm state dict keys) by @gau-nernst in #31714
Depth Anything: update conversion script for V2 by @pcuenca in #31522
Fix Seq2SeqTrainer crash when BatchEncoding data is None by @iohub in #31418
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31813
Add FA2 and sdpa support for SigLIP by @qubvel in #31499
Bump transformers from 4.26.1 to 4.38.0 in /examples/tensorflow/language-modeling-tpu by @dependabot[bot] in #31837
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/lxmert by @dependabot[bot] in #31838
Fix typos by @omahs in #31819
transformers.fx.symbolic_trace supports inputs_embeds by @fxmarty in #31574
Avoid failure TFBlipModelTest::test_pipeline_image_to_text by @ydshieh in #31827
Fix incorrect accelerator device handling for MPS in TrainingArguments by @andstor in #31812
Mamba & RecurrentGemma: enable strict signature by @gante in #31549
Deprecate vocab_size in other two VLMs by @zucchini-nlp in #31681
FX symbolic_trace: do not test decoder_inputs_embeds by @fxmarty in #31840
[Grounding DINO] Add processor to auto mapping by @NielsRogge in #31845
chore: remove duplicate words by @hattizai in #31853
save_pretrained: use tqdm when saving checkpoint shards from offloaded params by @kallewoof in #31856
Test loading generation config with safetensor weights by @gante in #31550
docs: typo in tf qa example by @chen-keinan in #31864
Generate: Add new decoding strategy "DoLa" in .generate() by @voidism in #29619
Fix _init_weights for ResNetPreTrainedModel by @ydshieh in #31851
Update depth estimation task guide by @merveenoyan in #31860
Bump zipp from 3.7.0 to 3.19.1 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31871
Add return type annotation to PreTrainedModel.from_pretrained by @mauvilsa in #31869
Revert "Fix _init_weights for ResNetPreTrainedModel" by @ydshieh in #31868
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/visual_bert by @dependabot[bot] in #31872
add warning when using gradient_checkpointing with FSDP full shard by @yundai424 in #31578
Add conversion for interleave llava by @zucchini-nlp in #31858
remove duplicate words in msg by @yukionfire in #31876
Fix file type checks in data splits for contrastive training example script by @npyoung in #31720
Fix failed tests in #31851 by @ydshieh in #31879
fix: Removed duplicate field definitions in some classes by @Sai-Suraj-27 in #31888
Push sharded checkpoint to hub when push_to_hub=True in TrainingArguments by @SunMarc in #31808
[RT-DETR] Add resources by @NielsRogge in #31815
Modify warnings in a with block to avoid flaky tests by @ydshieh in #31893
Add a condition for nested_detach by @haikuoxin in #31855
InstructBlipVideo: Update docstring by @zucchini-nlp in #31886
Fixes to alternating SWA layers in Gemma2 by @turboderp in #31775
Processor accepts any kwargs by @zucchini-nlp in #31889
[ConvertSlow] make sure the order is preserved for addedtokens by @ArthurZucker in #31902
[Gemma2] Support FA2 softcapping by @ArthurZucker in #31887
Fix missing methods for Fuyu by @Isotr0py in #31880
fix: Fixed the 1st argument name in classmethods by @Sai-Suraj-27 in #31907
add gather_use_object arguments II by @SangbumChoi in #31799
Add warning message for beta and gamma parameters by @OmarManzoor in #31654
Fix fx tests with inputs_embeds by @fxmarty in #31862
Refactor flash attention implementation in transformers by @ArthurZucker in #31446
Generate: fix SlidingWindowCache.reset() by @gante in #31917
🚨 fix(SigLip): remove spurious exclusion of first vision output token by @transmissions11 in #30952
Allow Trainer.get_optimizer_cls_and_kwargs to be overridden by @apoorvkh in #31875
[Bug Fix] fix qa pipeline tensor to numpy by @jiqing-feng in #31585
Docker: TF pin on the consistency job by @gante in #31928
fix prompt strip to support tensors and np arrays by @AvivSham in #27818
Fix GenerationMixin.generate compatibility with pytorch profiler by @fxmarty in #31935
Generate: remove deprecated code due to Cache and cache_position being default by @gante in #31898
Generate: v4.42 deprecations 🧹🧹 by @gante in #31956
Whisper: move to tensor cpu before converting to np array at decode time by @gante in #31954
fix: Removed a wrong key-word argument in sigmoid_focal_loss() function call by @Sai-Suraj-27 in #31951
Generate: handle logits_warper update in models with custom generate fn by @gante in #31957
fix: Fixed the arguments in create_repo() function call by @Sai-Suraj-27 in #31947
Notify new docker images built for circleci by @ydshieh in #31701
Avoid race condition by @ydshieh in #31973
Masking: remove flakiness from test by @gante in #31939
Generate: doc nits by @gante in #31982
Fix the incorrect permutation of gguf by @PenutChen in #31788
Cambricon MLUs support SDPA and flash_attn by @huismiling in #31102
Speedup model init on CPU (by 10x+ for llama-3-8B as one example) by @muellerzr in #31771
[tests] fix deepspeed zero3 config for test_stage3_nvme_offload by @faaany in #31881
Fix bad test about slower init by @muellerzr in #32002
Tests: remove cuda versions when the result is the same 🧹🧹 by @gante in #31955
Bug report update by @gante in #31983
add flash-attn deterministic option to flash-attn>=2.4.1 by @junrae6454 in #31961
fix: Fixed incorrect dictionary assignment in src/transformers/__init__.py by @Sai-Suraj-27 in #31993
Bug report update -- round 2 by @gante in #32006
Fix gather when collecting 'num_input_tokens_seen' by @CodeCreator in #31974
Fix if else and actually enable superfast init by @muellerzr in #32007
SpeechEncoderDecoder doesn't support param buffer assignments by @muellerzr in #32009
Fix tests skip by @qubvel in #32012
Fixed log messages that are resulting in TypeError due to too many arguments by @Sai-Suraj-27 in #32017
Fix typo in classification function selection logic to improve code consistency by @moses in #32031
doc: fix broken BEiT and DiNAT model links on Backbone page by @dvrogozh in #32029
Pass missing arguments to SeamlessM4Tv2ConformerEncoderLayer.forward() when gradient checkpointing is enabled by @anferico in #31945
Add language to word timestamps for Whisper by @robinderat in #31572
Add sdpa and FA2 for CLIP by @qubvel in #31940
unpin numpy<2.0 by @ydshieh in #32018
Chameleon: minor fixes after shipping by @zucchini-nlp in #32037
Bump scikit-learn from 1.0.2 to 1.5.0 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31458
Bump scikit-learn from 1.1.2 to 1.5.0 in /examples/research_projects/codeparrot/examples by @dependabot[bot] in #32052
[mistral] Support passing head_dim through config (and do not require head_dim * num_heads == hidden_size) by @xenova in #32050
Add torch.compile Support For Mamba by @zhenglongjiepheonix in #31247
fix: Removed duplicate entries in a dictionary by @Sai-Suraj-27 in #32041
docs: Fixed 2 links in the docs along with some minor fixes by @Sai-Suraj-27 in #32058
Llava: add default chat templates by @zucchini-nlp in #31691
[Chameleon, Hiera] Improve docs by @NielsRogge in #32038
Incorrect Whisper long-form decoding timestamps by @kamilakesbi in #32003
[mistral] Fix FA2 attention reshape for Mistral Nemo by @xenova in #32065
VideoLLaVa: fix chat format in docs by @zucchini-nlp in #32083
Fix progress callback deepcopy by @fozziethebeat in #32070
Fixes to chameleon docs by @merveenoyan in #32078
Add image-text-to-text task guide by @merveenoyan in #31777
Support generating with fallback for short form audio in Whisper by @kamilakesbi in #30984
Disable quick init for deepspeed by @muellerzr in #32066
Chameleon: not supported with fast load by @zucchini-nlp in #32091
Fix tests after huggingface_hub 0.24 by @Wauplin in #32054
Fix shard order by @b-chu in #32023
Generate: store special token tensors under a unique variable name by @gante in #31980
fix: Replaced deprecated mktemp() function by @Sai-Suraj-27 in #32123
Mention model_info.id instead of model_info.modelId by @Wauplin in #32106
[generate] fix eos/pad id check on mps devices by @sanchit-gandhi in #31695
Fix failing test with race condition by @Rocketknight1 in #32140
Update ko/_toctree.yml and remove custom_tools.md to reflect latest changes by @jungnerd in #31969
fix: Fixed raising TypeError instead of ValueError for invalid type by @Sai-Suraj-27 in #32111
[RoBERTa] Minor clarifications to model doc by @bt2513 in #31949
Return assistant generated tokens mask in apply_chat_template by @yonigottesman in #30650
Don't default to other weights file when use_safetensors=True by @amyeroberts in #31874
set warning level to info for special tokens have been added by @ArthurZucker in #32138
Add new quant method by @SunMarc in #32047
Add llama3-llava-next-8b to llava_next conversion script by @jamt9000 in #31395
LLaVaNeXT: pad on right if training by @zucchini-nlp in #32134
Remove trust_remote_code when loading Libri Dummy by @sanchit-gandhi in #31748
[modelling] remove un-necessary transpose for fa2 attention by @sanchit-gandhi in #31749
Fix mask creations of GPTNeoX and GPT2 by @vasqu in #31944
Add method to retrieve used chat template by @KonradSzafer in #32032
Add YaRN and Dynamic-YaRN RoPE Scaling Methods by @mig-mfreitas in #30910
Disable quick init for TapasPreTrainedModel by @daniellok-db in #32149
Modify resize_token_embeddings to ensure output type is same as input by @bayllama in #31979
gguf conversion add_prefix_space=None for llama3 by @itazap in #31937
Fix flash attention speed issue by @Cyrilvallez in #32028
Fix video batching to videollava by @merveenoyan in #32139
Added mamba.py backend by @alxndrTL in #30139
Rename Phi-3 rope scaling type by @garg-amit in #31436
Revert "Incorrect Whisper long-form decoding timestamps " by @sanchit-gandhi in #32148
Fix typing to be compatible with later py versions by @amyeroberts in #32155
feat(cache): StaticCache uses index_copy_ to avoid useless copy by @tengomucho in #31857
Added additional kwarg for successful running of optuna hyperparameter search by @DeF0017 in #31924
Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in #31629

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@aliencaocao
- Fix float out of range in owlvit and owlv2 when using FP16 or lower precision (#31657)
- Add torch_empty_cache_steps to TrainingArguments (#31546)
- Add training support for SigLIP (#31495)
- Allow FP16 or other precision inference for Pipelines (#31342)
@voidism
- Generate: Add new decoding strategy "DoLa" in .generate() (#29619)
@Namangarg110
- Adding hiera (#30356)

huggingface/transformers v4.43.0 v4.43.0: Llama 3.1, Chameleon, ZoeDepth, Hiera on GitHub