New Model Additions
Llama 4
Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:
- The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.
- The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.
Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).
For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories
Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed:
pip install -U transformers[hf_xet]
Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like:
torchrun –nproc-per-instance=8 script.py
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="flex_attention",
device_map="auto",
torch_dtype=torch.bfloat16,
)
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": url1},
{"type": "image", "url": url2},
{"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])
Make sure to check the model cards on the repos (Llama 4 Maverick (~400B) and Llama 4 Scout (~109B)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options!
Phi4-Multimodal

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:
- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
- Vision: English
- Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
- Add Phi4 multimodal by @Cyrilvallez in #36939
DeepSeek-v3
DeepSeek-v3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The model is detailed in the following paper.
Overview
The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.
The abstract from the paper is the following:
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
Qwen3
The Qwen3 architecture has been contributed to transformers and is available in v4.51.0. At time of release, the models themselves have not yet been released - stay tuned for a release from the Qwen team!
- Adding Qwen3 and Qwen3MoE by @bozheng-hit in #36878
Documentation
Model docs are getting a significant overhaul by providing much needed, ready-to-use examples one can copy-paste in their modules/consoles. We will adapt these examples to each model, with the goal of providing relevant examples on a per-model basis.
Significant model improvements
A very large PR was provided by @nikosanto13 that helped add modular files to all speech models in the library; seeing the difference between each of them is now much simpler, as well as maintenance and eventual refactors.
- Introduce modular files for speech models by @nikosanto13 in #35902
Bugfixes and improvements
- fix: loss computation after embeddings resize - mllama by @Ssukriti in #36840
- Simplify keep_in_fp32_modules logic by @Cyrilvallez in #36722
- Fix Pan and Scan on batched images Gemma3 by @yonigozlan in #36864
- Update installation.md by @ariG23498 in #36826
- fix Gemma3 Config by @eljandoubi in #36893
- Fix torch version guard at import by @zucchini-nlp in #36907
- [Fix] Add
original_max_position_embeddings
to YARN rope_scaling optional keys by @JustinTong0323 in #36877 - tests: fix asyncio.wait() usage for python>=3.11 by @dvrogozh in #36898
- [chameleon] fix num image token check by @zucchini-nlp in #36918
- Fix Compressed tensors to_dict_diff by @MekkCyber in #36922
- Use another repo. for Mistral3 processor testing by @ydshieh in #36925
- Fix typos by @omahs in #36910
- Update
trainer_pt_utils.py
docstrings for consistency by @ethanknights in #36912 - [2/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36857
- Fix pytorch defomr attn path by @qubvel in #36923
- More precise comment by @ydshieh in #36935
- Added support for seed in
DataCollatorForWholeWordMask
by @capemox in #36903 - Fix processor kwargs qwen2 vl by @yonigozlan in #36890
- Disallow Offload to disk for gguf files by @MekkCyber in #36933
- Deprecate #36741 and map Causal to Conditional by @zucchini-nlp in #36917
- Fixing _pre_quantization_dtype when torch_dtype is None by @MekkCyber in #36930
- Export for Phi4-mini by @guangy10 in #36780
- fix typos in the tests directory by @threewebcode in #36932
- Fix cuda index issue in cache allocator by @SunMarc in #36937
- [Utils] torch version checks optionally accept dev versions by @gante in #36847
- Update after #36962 by @ydshieh in #36965
- Change GPUS to GPUs by @zhanluxianshen in #36945
- typo fixed in README_fr.md by @NargiT in #36951
- Updated docker files to use
uv
for installing packages by @Sai-Suraj-27 in #36957 - update examples after ruff being updated by @ydshieh in #36972
- Remove extra tensor clone in PyTorch code by @cyyever in #36748
- [docs] Fix image link by @stevhliu in #36869
- Add ruff target-version by @cyyever in #36971
- update bot comment again by @ydshieh in #36974
- 🚨Deprecate legacy argument for image-text-to-text models and adopt new behavior by default by @yonigozlan in #36307
- Fix tensor dtype mismatch by @cyyever in #36985
- byebye CircleCI TF jobs by @ydshieh in #36998
- Use torch.expm1 by @cyyever in #36995
- Install
networkx==3.2.1
manually in some CircleCI jobs after #36957 by @ydshieh in #37000 - Fix Optional type annotation by @cyyever in #36841
- Fix get_device_properties by @ivarflakstad in #36997
- Allow easy registration of custom attention functions by @Cyrilvallez in #36889
- Fix removing "cpu" from frozenset in bitsandbytes.py to allow better ROCm support. by @anadon in #36975
- Fix device_map check for ggml files by @MekkCyber in #37003
- Log the correct learning rate by @SunMarc in #36973
- fix typos in the code comments and error messages by @threewebcode in #36993
- Remove deprecated training arguments by @cyyever in #36946
- [docs] Attention mask image by @stevhliu in #36970
- fix transformers_cli import relative path issue by @yao-matrix in #36989
- Support QuestionAnswering Module for ModernBert based models. by @bakrianoo in #35566
- Fix PixtralProcessor patch_size when spatial_merge_size is used by @mgoin in #37019
- [Modeling] Load FP8 safetensors such as DeepSeek by @kylesayrs in #36828
- Mark 2 tests as flaky for now by @ydshieh in #37038
- remove redundant code in trainer by @hiyouga in #36994
- Skip FP8 linear tests For device capability 9.0 by @MekkCyber in #37008
- Add Distill Any Depth by @keetrap in #36614
- fix pegasus init weights and other copied models by @jiqing-feng in #36844
- Optimize
to_py_obj
for python-native numeric lists and scalars by @n0gu-furiosa in #36885 - Fixup for distill_any_depth conversion script by @qubvel in #37043
- [chat templates} support loading audio from video by @zucchini-nlp in #36955
- [audio utils] fix fft_bin_width computation by @eustlb in #36603
- [generate, cache] handle more complex device maps by @gante in #37014
- clean pipeline question_answering. by @zhanluxianshen in #36986
- Avoid unnecessary device operations in loss computing by @cyyever in #36950
- Set weights_only in torch.load by @cyyever in #36991
- Replace default split function with jnp.split() in flax models by @premmurugan229 in #37001
- Remove deprecated batch_size parameter by @cyyever in #37007
- fixed typo by @finnoh in #37036
- fix: Fully remove legacy cache from Llama by @Wheest in #36958
- Fix SDPA implementation in Qwen2-VL (issues with torch==2.6.0) by @ManuelFay in #36891
- fix: AttributeError: 'LlavaProcessor' object has no attribute 'image_token_id' by @jp1924 in #37026
- Fix some typos about benchmark scripts. by @zhanluxianshen in #37027
- Change deprecated PT functions by @cyyever in #37041
- [blip-2] Fix dtype mismatch when keep in fp32 by @zucchini-nlp in #37068
- fix tied weigths issue by @ydshieh in #37031
- Update w/ new account by @muellerzr in #37084
- Fix state_dict map location when quantized by @Cyrilvallez in #37086
- Fix AttentionInterface following feedback by @Cyrilvallez in #37010
- fixed typo. by @zhanluxianshen in #37057
- [generate] beam search -- fix output cropping by @gante in #37080
- [Cache] rename dtype attribute 🚨 🚨 by @gante in #37044
- Kenlm by @ydshieh in #37091
- 🌐 [i18n-KO] Translated
qwen2_vl.md
to Korean by @MinJu-Ha in #36750 - Gaudi: Fix the pipeline failed issue with hpu device by @yuanwu2017 in #36990
- Support passing flash_attn_kwargs when gradient_checkpointing is enabled by @efsotr in #37037
- Fix 4090/ada not detected as having FP8 support by @Qubitium in #37067
- enable tp on CPU by @jiqing-feng in #36299
- fix whisper re-compile by @jiqing-feng in #36712
- [MLU] Fix FA2 check error, remove deepspeed-mlu deps. by @huismiling in #36159
- Fix Gemma3 embedding scaling by @gau-nernst in #37109
- RWKV: fix mask warning typo by @RobinKa in #37114
- Remove deprecated code by @cyyever in #37059
- [tests] remove cuda-only test marker in
AwqConfigTest
by @faaany in #37032 - Export T5 (encoder-decoder) to ExecuTorch by @guangy10 in #36486
- skip by @ydshieh in #37141
- [qwen3] fix generation tests by @zucchini-nlp in #37142
- Fix more inefficient PT operations by @cyyever in #37060
- Fix std initialization in Idefics variants by @yaswanth19 in #37100
- add gpt2 test on XPU by @jiqing-feng in #37028
- Fix llava xpu tests. by @jiqing-feng in #37130
- enable
test_assisted_decoding_in_different_gpu
test on XPU by @yao-matrix in #37120 - Use public export API on torch 2.5 and future by @guangy10 in #36781
- Convert
_VALID_DICT_FIELDS
to class attribute for shared dict parsing in subclasses by @Tavish9 in #36736 - Only count num items in batch when needed by @IlyasMoutawwakil in #36867
- Make canine model exportable by removing unncessary complicated logic by @tugsbayasgalan in #37124
- [
ModernBERT
] Never save 'reference_compile' config; should be set based on end user by @tomaarsen in #36305 - fix XPU UT error case brough by RNG difference btw XPU and CUDA by @yao-matrix in #37121
- Fixes the inconsistency of the optionality of attention_mask by @Zephyr271828 in #37153
- Avoid pipeline test failing related to Hub call by @ydshieh in #37170
- Fix meta state dict loading with quantizers by @Cyrilvallez in #37136
- Revert #37031 by @Cyrilvallez in #37178
- [doc] Fix link for Quark quantization page by @BowenBao in #37179
- [chat-template] fix video loading by @zucchini-nlp in #37146
- Skip code
307
inRequestCounter
by @ydshieh in #36953 - Add device workaround for int4 weight only quantization after API update by @jerryzh168 in #36980
- Fixes DynamicCache export issues due to control flow and inplace modifications by @xadupre in #36652
- Try to avoid/reduce some remaining CI job failures by @ydshieh in #37202
- fix: Add 'image-text-to-text' to
TASK_MAPPING
by @saattrupdan in #37107 - Fix some code annotation typos. by @zhanluxianshen in #37102
- Merge tensor operations with device transfer operations by @cyyever in #37097
- [3/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36936
- Add py.typed by @cyyever in #37022
- No more dtype_byte_size() by @Rocketknight1 in #37144
- [Tests] add
min_new_tokens
to prevent flaky length checks by @gante in #37175 - Stop DOSing the Hub in the CI by @Rocketknight1 in #37209
- More ReDOS fixes! by @Rocketknight1 in #36964
- Updated the model card for CLIP by @purusharthmalik in #37040
- Update falcon model card by @ricalanis in #37184
- Updated model card for Qwen2 by @Aravind-11 in #37192
- Fix static cache export by @guangy10 in #37229
- [Phi4] add multimodal chat template by @zucchini-nlp in #36996
- Add new dim to
num_items_in_batch
if necessary by @regisss in #36967 - Fix test by @Cyrilvallez in #37213
- [tests] fix mamba integration simple inference precision issue by @faaany in #37193
- [CI] lazy loading external datasets by @gante in #37218
- enable 2 types of case on XPU by @yao-matrix in #37198
- Fix AST parsing when looking for remote code imports by @Rocketknight1 in #37245
- Add support for fast image processing in image-pretraining example by @jafraustro in #37021
- Allow flexible generation params arg when checking pipeline specs by @Rocketknight1 in #37211
- [CI] green llama tests by @gante in #37244
- Adding links to ShieldGemma 2 technical report by @RyanMullins in #37247
- feat: updated model card for qwen_2.5_vl by @arkhamHack in #37099
- Update model card for Cohere by @bimal-gajera in #37056
- chore: Update model doc for code_llama by @AbhishekRP2002 in #37115
- Update Model Card for ModernBERT by @ParagEkbote in #37052
- Update model card for electra by @Wu-n0 in #37063
- [qwen-vl] fix image processor by @zucchini-nlp in #37258
- update error msg by @itazap in #37207
- Fix
utils/check_bad_commit.py
by @ydshieh in #37272 - Support
return_tensors
in audio chat templates by @zucchini-nlp in #34601 - Update ruff to
0.11.2
by @ydshieh in #36962 - Fix typing for None valued variables by @cyyever in #37004
- Use
lru_cache
for tokenization tests by @ydshieh in #36818 - Create and Expose SamVisionModel as public for better accessibility by @geetu040 in #36493
- [Feature] Support using FlashAttention2 on Ascend NPU by @FightingZhen in #36696
- Remove low_cpu_mem_usage and _fast_init by @Cyrilvallez in #36963
- Refactor
return_dict
logic to remove complicated if/else paths by @qubvel in #36794 - Refactor attention for SigLIP based models by @qubvel in #36981
- Add Optional to types by @cyyever in #37163
- Purge unused ModelTester code by @Rocketknight1 in #37085
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @cyyever
- [2/N] Use pyupgrade --py39-plus to improve code (#36857)
- Remove extra tensor clone in PyTorch code (#36748)
- Add ruff target-version (#36971)
- Fix tensor dtype mismatch (#36985)
- Use torch.expm1 (#36995)
- Fix Optional type annotation (#36841)
- Remove deprecated training arguments (#36946)
- Avoid unnecessary device operations in loss computing (#36950)
- Fix typing for None valued variables (#37004)
- Set weights_only in torch.load (#36991)
- Remove deprecated batch_size parameter (#37007)
- Change deprecated PT functions (#37041)
- Remove deprecated code (#37059)
- Fix more inefficient PT operations (#37060)
- Merge tensor operations with device transfer operations (#37097)
- [3/N] Use pyupgrade --py39-plus to improve code (#36936)
- Add py.typed (#37022)
- Add Optional to types (#37163)
- @bzantium
- [WIP] add deepseek-v3 (#35926)
- @bozheng-hit
- Adding Qwen3 and Qwen3MoE (#36878)
- @geetu040
- Create and Expose SamVisionModel as public for better accessibility (#36493)
- @FightingZhen
- [Feature] Support using FlashAttention2 on Ascend NPU (#36696)
- @nikosanto13
- Introduce modular files for speech models (#35902)