Llama
The Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.
To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.
We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.
Chameleon
The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.
- Chameleon: add model by @zucchini-nlp in #31534
ZoeDepth
The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.
- Add ZoeDepth by @NielsRogge in #30136
Hiera
Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.
- Adding hiera by @Namangarg110 in #30356
Agents
Our ReactAgent has a specific way to return its final output: it calls the tool final_answer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific final_answer tools helps the llm_engine find what to return: so we generalized the final_answer tool for all agents.
- Adds final answer tool for all agents by @aymeric-roucher in #31703
Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!
- Code agent: allow function persistence between steps by @aymeric-roucher in #31769
This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planning_interval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon!
- Agents planning by @aymeric-roucher in #31702
Notable changes to the codebase
A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture.
It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.
Breaking changes
TextGenerationPipeline and tokenizer kwargs
🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline
previously did not add a <bos>
by default, which (negatively) impacted their performance. In practice, this is a breaking change.
Example of a script changed as a result of this PR:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Foo bar"))
Bugfixes and improvements
- Fix post gemma merge by @ArthurZucker in #31660
- Fix float out of range in owlvit and owlv2 when using FP16 or lower precision by @aliencaocao in #31657
- [docs] Llama3 by @stevhliu in #31662
- [HybridCache] Fix
get_seq_length
method by @sanchit-gandhi in #31661 - don't zero out the attention_mask when using sliding window with flash attention by @winglian in #31670
- Fix Gemma2 4d attention mask by @hiyouga in #31674
- Fix return_dict in encodec by @jla524 in #31646
- add gather_use_object arguments by @SangbumChoi in #31514
- Gemma capping is a must for big models by @ArthurZucker in #31698
- Add French version of run scripts tutorial by @jadechoghari in #31483
- dependencies:
keras-nlp<0.14
pin by @gante in #31684 - remove incorrect urls pointing to the llava repository by @BiliBraker in #31107
- Move some test files (
tets/test_xxx_utils.py
) totests/utils
by @ydshieh in #31730 - Fix mistral ONNX export by @fxmarty in #31696
- [whisper] static kv cache by @sanchit-gandhi in #31166
- Make tool JSON schemas consistent by @Rocketknight1 in #31756
- Fix documentation for Gemma2. by @jbornschein in #31682
- fix assisted decoding by @jiqing-feng in #31401
- Requires for torch.tensor before casting by @echarlaix in #31755
- handle (processor_class, None) returned by ModelPatterns by @molbap in #31753
- Gemma 2: Update slow tests by @gante in #31759
- Add ignore_errors=True to trainer.py rmtree in _inner_training_loop by @njbrake in #31668
- [fix bug] logits's shape different from label's shape in preprocess_logits_for_metrics by @wiserxin in #31447
- Fix RT-DETR cache for generate_anchors by @qubvel in #31671
- Fix RT-DETR weights initialization by @qubvel in #31724
pytest_num_workers=4
for some CircleCI jobs by @ydshieh in #31764- Fix Gemma2 types by @hiyouga in #31779
- Add torch_empty_cache_steps to TrainingArguments by @aliencaocao in #31546
- Fix ClapProcessor to merge feature_extractor output into the returned BatchEncoding by @mxkopy in #31767
- Fix serialization for offloaded model by @SunMarc in #31727
- Make tensor device correct when ACCELERATE_TORCH_DEVICE is defined by @kiszk in #31751
- Exclude torch.compile time from metrics computation by @zxd1997066 in #31443
- Update CometCallback to allow reusing of the running experiment by @Lothiraldan in #31366
- Fix gemma tests by @ydshieh in #31794
- Add training support for SigLIP by @aliencaocao in #31495
- Repeating an important warning in the chat template docs by @Rocketknight1 in #31796
- Allow FP16 or other precision inference for Pipelines by @aliencaocao in #31342
- Fix galore lr display with schedulers by @vasqu in #31710
- Fix Wav2Vec2 Fairseq conversion (weight norm state dict keys) by @gau-nernst in #31714
- Depth Anything: update conversion script for V2 by @pcuenca in #31522
- Fix Seq2SeqTrainer crash when BatchEncoding data is None by @iohub in #31418
- Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31813
- Add FA2 and
sdpa
support for SigLIP by @qubvel in #31499 - Bump transformers from 4.26.1 to 4.38.0 in /examples/tensorflow/language-modeling-tpu by @dependabot[bot] in #31837
- Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/lxmert by @dependabot[bot] in #31838
- Fix typos by @omahs in #31819
- transformers.fx.symbolic_trace supports inputs_embeds by @fxmarty in #31574
- Avoid failure
TFBlipModelTest::test_pipeline_image_to_text
by @ydshieh in #31827 - Fix incorrect accelerator device handling for MPS in
TrainingArguments
by @andstor in #31812 - Mamba & RecurrentGemma: enable strict signature by @gante in #31549
- Deprecate
vocab_size
in other two VLMs by @zucchini-nlp in #31681 - FX symbolic_trace: do not test decoder_inputs_embeds by @fxmarty in #31840
- [Grounding DINO] Add processor to auto mapping by @NielsRogge in #31845
- chore: remove duplicate words by @hattizai in #31853
- save_pretrained: use tqdm when saving checkpoint shards from offloaded params by @kallewoof in #31856
- Test loading generation config with safetensor weights by @gante in #31550
- docs: typo in tf qa example by @chen-keinan in #31864
- Generate: Add new decoding strategy "DoLa" in
.generate()
by @voidism in #29619 - Fix
_init_weights
forResNetPreTrainedModel
by @ydshieh in #31851 - Update depth estimation task guide by @merveenoyan in #31860
- Bump zipp from 3.7.0 to 3.19.1 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31871
- Add return type annotation to PreTrainedModel.from_pretrained by @mauvilsa in #31869
- Revert "Fix
_init_weights
forResNetPreTrainedModel
" by @ydshieh in #31868 - Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/visual_bert by @dependabot[bot] in #31872
- add warning when using gradient_checkpointing with FSDP full shard by @yundai424 in #31578
- Add conversion for interleave llava by @zucchini-nlp in #31858
- remove duplicate words in msg by @yukionfire in #31876
- Fix file type checks in data splits for contrastive training example script by @npyoung in #31720
- Fix failed tests in #31851 by @ydshieh in #31879
- fix: Removed
duplicate
field definitions in some classes by @Sai-Suraj-27 in #31888 - Push sharded checkpoint to hub when
push_to_hub=True
inTrainingArguments
by @SunMarc in #31808 - [RT-DETR] Add resources by @NielsRogge in #31815
- Modify
warnings
in awith
block to avoid flaky tests by @ydshieh in #31893 - Add a condition for nested_detach by @haikuoxin in #31855
- InstructBlipVideo: Update docstring by @zucchini-nlp in #31886
- Fixes to alternating SWA layers in Gemma2 by @turboderp in #31775
- Processor accepts any kwargs by @zucchini-nlp in #31889
- [
ConvertSlow
] make sure the order is preserved for addedtokens by @ArthurZucker in #31902 - [
Gemma2
] Support FA2 softcapping by @ArthurZucker in #31887 - Fix missing methods for Fuyu by @Isotr0py in #31880
- fix: Fixed the
1st argument
name in classmethods by @Sai-Suraj-27 in #31907 - add gather_use_object arguments II by @SangbumChoi in #31799
- Add warning message for beta and gamma parameters by @OmarManzoor in #31654
- Fix fx tests with inputs_embeds by @fxmarty in #31862
- Refactor flash attention implementation in transformers by @ArthurZucker in #31446
- Generate: fix
SlidingWindowCache.reset()
by @gante in #31917 - 🚨 fix(SigLip): remove spurious exclusion of first vision output token by @transmissions11 in #30952
- Allow
Trainer.get_optimizer_cls_and_kwargs
to be overridden by @apoorvkh in #31875 - [Bug Fix] fix qa pipeline tensor to numpy by @jiqing-feng in #31585
- Docker: TF pin on the consistency job by @gante in #31928
- fix prompt strip to support tensors and np arrays by @AvivSham in #27818
- Fix
GenerationMixin.generate
compatibility with pytorch profiler by @fxmarty in #31935 - Generate: remove deprecated code due to
Cache
andcache_position
being default by @gante in #31898 - Generate: v4.42 deprecations 🧹🧹 by @gante in #31956
- Whisper: move to tensor cpu before converting to np array at decode time by @gante in #31954
- fix: Removed a wrong key-word argument in
sigmoid_focal_loss()
function call by @Sai-Suraj-27 in #31951 - Generate: handle
logits_warper
update in models with custom generate fn by @gante in #31957 - fix: Fixed the arguments in
create_repo()
function call by @Sai-Suraj-27 in #31947 - Notify new docker images built for circleci by @ydshieh in #31701
- Avoid race condition by @ydshieh in #31973
- Masking: remove flakiness from test by @gante in #31939
- Generate: doc nits by @gante in #31982
- Fix the incorrect permutation of gguf by @PenutChen in #31788
- Cambricon MLUs support SDPA and flash_attn by @huismiling in #31102
- Speedup model init on CPU (by 10x+ for llama-3-8B as one example) by @muellerzr in #31771
- [tests] fix deepspeed zero3 config for
test_stage3_nvme_offload
by @faaany in #31881 - Fix bad test about slower init by @muellerzr in #32002
- Tests: remove cuda versions when the result is the same 🧹🧹 by @gante in #31955
- Bug report update by @gante in #31983
- add flash-attn deterministic option to flash-attn>=2.4.1 by @junrae6454 in #31961
- fix: Fixed incorrect dictionary assignment in
src/transformers/__init__.py
by @Sai-Suraj-27 in #31993 - Bug report update -- round 2 by @gante in #32006
- Fix gather when collecting 'num_input_tokens_seen' by @CodeCreator in #31974
- Fix if else and actually enable superfast init by @muellerzr in #32007
- SpeechEncoderDecoder doesn't support param buffer assignments by @muellerzr in #32009
- Fix tests skip by @qubvel in #32012
- Fixed
log messages
that are resulting in TypeError due to too many arguments by @Sai-Suraj-27 in #32017 - Fix typo in classification function selection logic to improve code consistency by @moses in #32031
- doc: fix broken BEiT and DiNAT model links on Backbone page by @dvrogozh in #32029
- Pass missing arguments to
SeamlessM4Tv2ConformerEncoderLayer.forward()
when gradient checkpointing is enabled by @anferico in #31945 - Add language to word timestamps for Whisper by @robinderat in #31572
- Add
sdpa
and FA2 for CLIP by @qubvel in #31940 - unpin
numpy<2.0
by @ydshieh in #32018 - Chameleon: minor fixes after shipping by @zucchini-nlp in #32037
- Bump scikit-learn from 1.0.2 to 1.5.0 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31458
- Bump scikit-learn from 1.1.2 to 1.5.0 in /examples/research_projects/codeparrot/examples by @dependabot[bot] in #32052
- [mistral] Support passing
head_dim
through config (and do not requirehead_dim * num_heads == hidden_size
) by @xenova in #32050 - Add torch.compile Support For Mamba by @zhenglongjiepheonix in #31247
- fix: Removed
duplicate entries
in a dictionary by @Sai-Suraj-27 in #32041 - docs: Fixed 2 links in the docs along with some minor fixes by @Sai-Suraj-27 in #32058
- Llava: add default chat templates by @zucchini-nlp in #31691
- [Chameleon, Hiera] Improve docs by @NielsRogge in #32038
- Incorrect Whisper long-form decoding timestamps by @kamilakesbi in #32003
- [mistral] Fix FA2 attention reshape for Mistral Nemo by @xenova in #32065
- VideoLLaVa: fix chat format in docs by @zucchini-nlp in #32083
- Fix progress callback deepcopy by @fozziethebeat in #32070
- Fixes to chameleon docs by @merveenoyan in #32078
- Add image-text-to-text task guide by @merveenoyan in #31777
- Support generating with fallback for short form audio in Whisper by @kamilakesbi in #30984
- Disable quick init for deepspeed by @muellerzr in #32066
- Chameleon: not supported with fast load by @zucchini-nlp in #32091
- Fix tests after
huggingface_hub
0.24 by @Wauplin in #32054 - Fix shard order by @b-chu in #32023
- Generate: store special token tensors under a unique variable name by @gante in #31980
- fix: Replaced deprecated
mktemp()
function by @Sai-Suraj-27 in #32123 - Mention model_info.id instead of model_info.modelId by @Wauplin in #32106
- [generate] fix eos/pad id check on mps devices by @sanchit-gandhi in #31695
- Fix failing test with race condition by @Rocketknight1 in #32140
- Update
ko/_toctree.yml
and removecustom_tools.md
to reflect latest changes by @jungnerd in #31969 - fix: Fixed raising
TypeError
instead ofValueError
for invalid type by @Sai-Suraj-27 in #32111 - [RoBERTa] Minor clarifications to model doc by @bt2513 in #31949
- Return assistant generated tokens mask in apply_chat_template by @yonigottesman in #30650
- Don't default to other weights file when use_safetensors=True by @amyeroberts in #31874
- set warning level to info for special tokens have been added by @ArthurZucker in #32138
- Add new quant method by @SunMarc in #32047
- Add llama3-llava-next-8b to llava_next conversion script by @jamt9000 in #31395
- LLaVaNeXT: pad on right if training by @zucchini-nlp in #32134
- Remove
trust_remote_code
when loading Libri Dummy by @sanchit-gandhi in #31748 - [modelling] remove un-necessary transpose for fa2 attention by @sanchit-gandhi in #31749
- Fix mask creations of
GPTNeoX
andGPT2
by @vasqu in #31944 - Add method to retrieve used chat template by @KonradSzafer in #32032
- Add YaRN and Dynamic-YaRN RoPE Scaling Methods by @mig-mfreitas in #30910
- Disable quick init for TapasPreTrainedModel by @daniellok-db in #32149
- Modify resize_token_embeddings to ensure output type is same as input by @bayllama in #31979
- gguf conversion add_prefix_space=None for llama3 by @itazap in #31937
- Fix flash attention speed issue by @Cyrilvallez in #32028
- Fix video batching to videollava by @merveenoyan in #32139
- Added mamba.py backend by @alxndrTL in #30139
- Rename Phi-3 rope scaling type by @garg-amit in #31436
- Revert "Incorrect Whisper long-form decoding timestamps " by @sanchit-gandhi in #32148
- Fix typing to be compatible with later py versions by @amyeroberts in #32155
- feat(cache): StaticCache uses index_copy_ to avoid useless copy by @tengomucho in #31857
- Added additional kwarg for successful running of optuna hyperparameter search by @DeF0017 in #31924
- Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in #31629
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @aliencaocao
- @voidism
- Generate: Add new decoding strategy "DoLa" in
.generate()
(#29619)
- Generate: Add new decoding strategy "DoLa" in
- @Namangarg110
- Adding hiera (#30356)