huggingface/transformers v4.51.0 on GitHub

New Model Additions

Llama 4

Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:

The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.
The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.

Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).

For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories

Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed:

pip install -U transformers[hf_xet]

Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like:

torchrun –nproc-per-instance=8 script.py

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

Make sure to check the model cards on the repos (Llama 4 Maverick (~400B) and Llama 4 Scout (~109B)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options!

Phi4-Multimodal

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:

Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Vision: English
Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Add Phi4 multimodal by @Cyrilvallez in #36939

DeepSeek-v3

DeepSeek-v3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The model is detailed in the following paper.

Overview

The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.

The abstract from the paper is the following:

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

[WIP] add deepseek-v3 by @bzantium in #35926

Qwen3

The Qwen3 architecture has been contributed to transformers and is available in v4.51.0. At time of release, the models themselves have not yet been released - stay tuned for a release from the Qwen team!

Adding Qwen3 and Qwen3MoE by @bozheng-hit in #36878

Documentation

Model docs are getting a significant overhaul by providing much needed, ready-to-use examples one can copy-paste in their modules/consoles. We will adapt these examples to each model, with the goal of providing relevant examples on a per-model basis.

[docs] Model docs by @stevhliu in #36469

Significant model improvements

A very large PR was provided by @nikosanto13 that helped add modular files to all speech models in the library; seeing the difference between each of them is now much simpler, as well as maintenance and eventual refactors.

Introduce modular files for speech models by @nikosanto13 in #35902

Bugfixes and improvements

fix: loss computation after embeddings resize - mllama by @Ssukriti in #36840
Simplify keep_in_fp32_modules logic by @Cyrilvallez in #36722
Fix Pan and Scan on batched images Gemma3 by @yonigozlan in #36864
Update installation.md by @ariG23498 in #36826
fix Gemma3 Config by @eljandoubi in #36893
Fix torch version guard at import by @zucchini-nlp in #36907
[Fix] Add original_max_position_embeddings to YARN rope_scaling optional keys by @JustinTong0323 in #36877
tests: fix asyncio.wait() usage for python>=3.11 by @dvrogozh in #36898
[chameleon] fix num image token check by @zucchini-nlp in #36918
Fix Compressed tensors to_dict_diff by @MekkCyber in #36922
Use another repo. for Mistral3 processor testing by @ydshieh in #36925
Fix typos by @omahs in #36910
Update trainer_pt_utils.py docstrings for consistency by @ethanknights in #36912
[2/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36857
Fix pytorch defomr attn path by @qubvel in #36923
More precise comment by @ydshieh in #36935
Added support for seed in DataCollatorForWholeWordMask by @capemox in #36903
Fix processor kwargs qwen2 vl by @yonigozlan in #36890
Disallow Offload to disk for gguf files by @MekkCyber in #36933
Deprecate #36741 and map Causal to Conditional by @zucchini-nlp in #36917
Fixing _pre_quantization_dtype when torch_dtype is None by @MekkCyber in #36930
Export for Phi4-mini by @guangy10 in #36780
fix typos in the tests directory by @threewebcode in #36932
Fix cuda index issue in cache allocator by @SunMarc in #36937
[Utils] torch version checks optionally accept dev versions by @gante in #36847
Update after #36962 by @ydshieh in #36965
Change GPUS to GPUs by @zhanluxianshen in #36945
typo fixed in README_fr.md by @NargiT in #36951
Updated docker files to use uv for installing packages by @Sai-Suraj-27 in #36957
update examples after ruff being updated by @ydshieh in #36972
Remove extra tensor clone in PyTorch code by @cyyever in #36748
[docs] Fix image link by @stevhliu in #36869
Add ruff target-version by @cyyever in #36971
update bot comment again by @ydshieh in #36974
🚨Deprecate legacy argument for image-text-to-text models and adopt new behavior by default by @yonigozlan in #36307
Fix tensor dtype mismatch by @cyyever in #36985
byebye CircleCI TF jobs by @ydshieh in #36998
Use torch.expm1 by @cyyever in #36995
Install networkx==3.2.1 manually in some CircleCI jobs after #36957 by @ydshieh in #37000
Fix Optional type annotation by @cyyever in #36841
Fix get_device_properties by @ivarflakstad in #36997
Allow easy registration of custom attention functions by @Cyrilvallez in #36889
Fix removing "cpu" from frozenset in bitsandbytes.py to allow better ROCm support. by @anadon in #36975
Fix device_map check for ggml files by @MekkCyber in #37003
Log the correct learning rate by @SunMarc in #36973
fix typos in the code comments and error messages by @threewebcode in #36993
Remove deprecated training arguments by @cyyever in #36946
[docs] Attention mask image by @stevhliu in #36970
fix transformers_cli import relative path issue by @yao-matrix in #36989
Support QuestionAnswering Module for ModernBert based models. by @bakrianoo in #35566
Fix PixtralProcessor patch_size when spatial_merge_size is used by @mgoin in #37019
[Modeling] Load FP8 safetensors such as DeepSeek by @kylesayrs in #36828
Mark 2 tests as flaky for now by @ydshieh in #37038
remove redundant code in trainer by @hiyouga in #36994
Skip FP8 linear tests For device capability 9.0 by @MekkCyber in #37008
Add Distill Any Depth by @keetrap in #36614
fix pegasus init weights and other copied models by @jiqing-feng in #36844
Optimize to_py_obj for python-native numeric lists and scalars by @n0gu-furiosa in #36885
Fixup for distill_any_depth conversion script by @qubvel in #37043
[chat templates} support loading audio from video by @zucchini-nlp in #36955
[audio utils] fix fft_bin_width computation by @eustlb in #36603
[generate, cache] handle more complex device maps by @gante in #37014
clean pipeline question_answering. by @zhanluxianshen in #36986
Avoid unnecessary device operations in loss computing by @cyyever in #36950
Set weights_only in torch.load by @cyyever in #36991
Replace default split function with jnp.split() in flax models by @premmurugan229 in #37001
Remove deprecated batch_size parameter by @cyyever in #37007
fixed typo by @finnoh in #37036
fix: Fully remove legacy cache from Llama by @Wheest in #36958
Fix SDPA implementation in Qwen2-VL (issues with torch==2.6.0) by @ManuelFay in #36891
fix: AttributeError: 'LlavaProcessor' object has no attribute 'image_token_id' by @jp1924 in #37026
Fix some typos about benchmark scripts. by @zhanluxianshen in #37027
Change deprecated PT functions by @cyyever in #37041
[blip-2] Fix dtype mismatch when keep in fp32 by @zucchini-nlp in #37068
fix tied weigths issue by @ydshieh in #37031
Update w/ new account by @muellerzr in #37084
Fix state_dict map location when quantized by @Cyrilvallez in #37086
Fix AttentionInterface following feedback by @Cyrilvallez in #37010
fixed typo. by @zhanluxianshen in #37057
[generate] beam search -- fix output cropping by @gante in #37080
[Cache] rename dtype attribute 🚨 🚨 by @gante in #37044
Kenlm by @ydshieh in #37091
🌐 [i18n-KO] Translated qwen2_vl.md to Korean by @MinJu-Ha in #36750
Gaudi: Fix the pipeline failed issue with hpu device by @yuanwu2017 in #36990
Support passing flash_attn_kwargs when gradient_checkpointing is enabled by @efsotr in #37037
Fix 4090/ada not detected as having FP8 support by @Qubitium in #37067
enable tp on CPU by @jiqing-feng in #36299
fix whisper re-compile by @jiqing-feng in #36712
[MLU] Fix FA2 check error, remove deepspeed-mlu deps. by @huismiling in #36159
Fix Gemma3 embedding scaling by @gau-nernst in #37109
RWKV: fix mask warning typo by @RobinKa in #37114
Remove deprecated code by @cyyever in #37059
[tests] remove cuda-only test marker in AwqConfigTest by @faaany in #37032
Export T5 (encoder-decoder) to ExecuTorch by @guangy10 in #36486
skip by @ydshieh in #37141
[qwen3] fix generation tests by @zucchini-nlp in #37142
Fix more inefficient PT operations by @cyyever in #37060
Fix std initialization in Idefics variants by @yaswanth19 in #37100
add gpt2 test on XPU by @jiqing-feng in #37028
Fix llava xpu tests. by @jiqing-feng in #37130
enable test_assisted_decoding_in_different_gpu test on XPU by @yao-matrix in #37120
Use public export API on torch 2.5 and future by @guangy10 in #36781
Convert _VALID_DICT_FIELDS to class attribute for shared dict parsing in subclasses by @Tavish9 in #36736
Only count num items in batch when needed by @IlyasMoutawwakil in #36867
Make canine model exportable by removing unncessary complicated logic by @tugsbayasgalan in #37124
[ModernBERT] Never save 'reference_compile' config; should be set based on end user by @tomaarsen in #36305
fix XPU UT error case brough by RNG difference btw XPU and CUDA by @yao-matrix in #37121
Fixes the inconsistency of the optionality of attention_mask by @Zephyr271828 in #37153
Avoid pipeline test failing related to Hub call by @ydshieh in #37170
Fix meta state dict loading with quantizers by @Cyrilvallez in #37136
Revert #37031 by @Cyrilvallez in #37178
[doc] Fix link for Quark quantization page by @BowenBao in #37179
[chat-template] fix video loading by @zucchini-nlp in #37146
Skip code 307 in RequestCounter by @ydshieh in #36953
Add device workaround for int4 weight only quantization after API update by @jerryzh168 in #36980
Fixes DynamicCache export issues due to control flow and inplace modifications by @xadupre in #36652
Try to avoid/reduce some remaining CI job failures by @ydshieh in #37202
fix: Add 'image-text-to-text' to TASK_MAPPING by @saattrupdan in #37107
Fix some code annotation typos. by @zhanluxianshen in #37102
Merge tensor operations with device transfer operations by @cyyever in #37097
[3/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36936
Add py.typed by @cyyever in #37022
No more dtype_byte_size() by @Rocketknight1 in #37144
[Tests] add min_new_tokens to prevent flaky length checks by @gante in #37175
Stop DOSing the Hub in the CI by @Rocketknight1 in #37209
More ReDOS fixes! by @Rocketknight1 in #36964
Updated the model card for CLIP by @purusharthmalik in #37040
Update falcon model card by @ricalanis in #37184
Updated model card for Qwen2 by @Aravind-11 in #37192
Fix static cache export by @guangy10 in #37229
[Phi4] add multimodal chat template by @zucchini-nlp in #36996
Add new dim to num_items_in_batch if necessary by @regisss in #36967
Fix test by @Cyrilvallez in #37213
[tests] fix mamba integration simple inference precision issue by @faaany in #37193
[CI] lazy loading external datasets by @gante in #37218
enable 2 types of case on XPU by @yao-matrix in #37198
Fix AST parsing when looking for remote code imports by @Rocketknight1 in #37245
Add support for fast image processing in image-pretraining example by @jafraustro in #37021
Allow flexible generation params arg when checking pipeline specs by @Rocketknight1 in #37211
[CI] green llama tests by @gante in #37244
Adding links to ShieldGemma 2 technical report by @RyanMullins in #37247
feat: updated model card for qwen_2.5_vl by @arkhamHack in #37099
Update model card for Cohere by @bimal-gajera in #37056
chore: Update model doc for code_llama by @AbhishekRP2002 in #37115
Update Model Card for ModernBERT by @ParagEkbote in #37052
Update model card for electra by @Wu-n0 in #37063
[qwen-vl] fix image processor by @zucchini-nlp in #37258
update error msg by @itazap in #37207
Fix utils/check_bad_commit.py by @ydshieh in #37272
Support return_tensors in audio chat templates by @zucchini-nlp in #34601
Update ruff to 0.11.2 by @ydshieh in #36962
Fix typing for None valued variables by @cyyever in #37004
Use lru_cache for tokenization tests by @ydshieh in #36818
Create and Expose SamVisionModel as public for better accessibility by @geetu040 in #36493
[Feature] Support using FlashAttention2 on Ascend NPU by @FightingZhen in #36696
Remove low_cpu_mem_usage and _fast_init by @Cyrilvallez in #36963
Refactor return_dict logic to remove complicated if/else paths by @qubvel in #36794
Refactor attention for SigLIP based models by @qubvel in #36981
Add Optional to types by @cyyever in #37163
Purge unused ModelTester code by @Rocketknight1 in #37085

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@cyyever
- [2/N] Use pyupgrade --py39-plus to improve code (#36857)
- Remove extra tensor clone in PyTorch code (#36748)
- Add ruff target-version (#36971)
- Fix tensor dtype mismatch (#36985)
- Use torch.expm1 (#36995)
- Fix Optional type annotation (#36841)
- Remove deprecated training arguments (#36946)
- Avoid unnecessary device operations in loss computing (#36950)
- Fix typing for None valued variables (#37004)
- Set weights_only in torch.load (#36991)
- Remove deprecated batch_size parameter (#37007)
- Change deprecated PT functions (#37041)
- Remove deprecated code (#37059)
- Fix more inefficient PT operations (#37060)
- Merge tensor operations with device transfer operations (#37097)
- [3/N] Use pyupgrade --py39-plus to improve code (#36936)
- Add py.typed (#37022)
- Add Optional to types (#37163)
@bzantium
- [WIP] add deepseek-v3 (#35926)
@bozheng-hit
- Adding Qwen3 and Qwen3MoE (#36878)
@geetu040
- Create and Expose SamVisionModel as public for better accessibility (#36493)
@FightingZhen
- [Feature] Support using FlashAttention2 on Ascend NPU (#36696)
@nikosanto13
- Introduce modular files for speech models (#35902)

huggingface/transformers v4.51.0 v4.51.0: Llama 4, Phi4-Multimodal, DeepSeek-v3, Qwen3 on GitHub