Release v5.5.0

New Model additions

Gemma4

Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.

You can find all the original Gemma 4 checkpoints under the Gemma 4 release.

The key difference from previous Gemma releases is the new design to process images of different sizes using a fixed-budget number of tokens. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There a a couple constraints to follow:

The total number of pixels must fit within a patch budget
Both height and width must be divisible by 48 (= patch size 16 × pooling kernel 3)

Important

Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).

The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is 280 soft tokens per image.

Soft Tokens	Patches (before pooling)	Approx. Image Area
70	630	~161K pixels
140	1,260	~323K pixels
280	2,520	~645K pixels
560	5,040	~1.3M pixels
1,120	10,080	~2.6M pixels

To encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."

NomicBERT

NomicBERT is a BERT-inspired encoder model that applies Rotary Position Embeddings (RoPE) to create reproducible long context text embeddings. It is the first fully reproducible, open-source text embedding model with 8192 context length that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short-context MTEB and long context LoCo benchmarks. The model generates dense vector embeddings for various tasks including search, clustering, and classification using specific instruction prefixes.

Links: Documentation | Paper

Internalise the NomicBERT model (#43067) by @ed22699 in #43067

MusicFlamingo

Music Flamingo is a fully open large audio–language model designed for robust understanding and reasoning over music. It builds upon the Audio Flamingo 3 architecture by including Rotary Time Embeddings (RoTE), which injects temporal position information to enable the model to handle audio sequences up to 20 minutes. The model features a unified audio encoder across speech, sound, and music with special sound boundary tokens for improved audio sequence modeling.

Links: Documentation | Paper

Add Music Flamingo (#43538) by @lashahub in #43538

Breaking changes

Mamba and hybrid model caches are now first-class native citizens in the library, so users working with Mamba-based or hybrid (Mamba + attention) models should update their code to use the new native cache classes instead of any previous workarounds.

🚨 [Cache] Native mamba & hybrid cache (#44950) by @Cyrilvallez

Remote code execution support has been removed from the native LightGlue integration, so users who were loading LightGlue with trust_remote_code=True must remove that argument and use the model directly through the standard native API.

🚨 [LightGlue] Remove remote code execution (#45122) by @vasqu

Vision

Several vision-related bugs were fixed in this release, including correcting the Gemma vision mask to support video inputs, resolving a dependency issue that incorrectly required torchvision for PIL-based image processors, and patching bugs in the Janus image generation model and image loading. Local code resolution for tokenizers and image processors was also corrected.

Generalize gemma vision mask to videos (#45185) by @zucchini-nlp in [#45185]
Fix explicit local code resolution for tokenizers and image processors (#45169) by @hmellor in [#45169]
fix bug for janus model image generation (#45044) by @kaixuanliu in [#45044]
[Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045) by @Lidang-Jiang in [#45045]
Avoid Image.open failure (#44645) by @sywangyi in [#44645]

Cache

Improved the performance of repository checks (check-repo) by introducing file-level and AST-level disk caching, achieving up to a 27x speedup (from ~46s to ~1.6s with a warm cache), and fixed the mlinter cache location in .gitignore.

refactoring: speedup static checks with disk cache (#44992) by @tarekziade in [#44992]
refactor: added cache in check_repo (#45012) by @tarekziade in [#45012]
chore: Fix mlinter cache location (#45052) by @tarekziade in [#45052]

Bugfixes and improvements

Fix resized LM head weights being overwritten by post_init (#45079) by @javierdejesusda in [#45079]
[Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration (#45124) by @danielquintas8 in [#45124]
fix(models): Fix dtype mismatch in SwitchTransformers and TimmWrapperModel (#45074) by @harshaljanjani in [#45074]
[misc] fix qwen35 tests: correct the text model type and skip reverse_mapping (#45173) by @JJJYmmm in [#45173]
🔒 Pin GitHub Actions to commit SHAs (#45180) by @paulinebm in [#45180]
Use doc-builder runnable example for GLM-ASR (#44277) by @tarekziade in [#44277]
CI] Small T5 expectations updated (#45138) by @Abdennacer-Badaoui in [#45138]
fix: correct type annotations across config classes for @strict validation (#45007) by @Krishnachaitanyakc in [#45007]
Fix T5Attention shape mismatch under Tensor Parallelism (#45109) by @aws-zhanxun in [#45109]
[refactor] Serving into proper modules (#44796) by @SunMarc in [#44796]
Re-add regex substitutions to the response parsing spec (#45166) by @Rocketknight1 in [#45166]
Fix incorrect TrainingArguments example in training.md (#45150) by @maanas1234 in [#45150]
Add parse_response to Processor, make it a bit more official (#45143) by @Rocketknight1 in [#45143]
DeepGEMM (#44832) by @IlyasMoutawwakil in [#44832]
fix: prefer registered config over remote code in AutoConfig.from_pretrained (#45094) by @HanFa in [#45094]
[serving] Fix continuous batching JSON response serialization (#45057) by @NathanHB in [#45057]
Fix stupid test fetcher (#45140) by @ydshieh in [#45140]
[CB] Add warmup feature (#45112) by @remi-or in [#45112]
feature: added import complexity checker (#45013) by @tarekziade in [#45013]
Fix tests for janus model (#44739) by @kaixuanliu in [#44739]
CB improvements for serving (#45063) by @SunMarc in [#45063]
[docs] continuous batching (#44896) by @stevhliu in [#44896]
Fix few issues in Qwen_3_Omni_Moe (#44848) by @Sai-Suraj-27 in [#44848]
Fix TypeError in rope validation when ignore_keys is a list (#45069) by @Fr0do in [#45069]
Remove unused TensorFlow env var (#45065) by @Sai-Suraj-27 in [#45065]
fix: add identity reverse_op to dequantize ops for save_pretrained (#44983) by @Hyungkeun-Park-Nota in [#44983]
Fix when RoPE params are in kwargs (#45049) by @zucchini-nlp in [#45049]
chore: update update_metdata.yml (#45054) by @hf-security-analysis[bot] in [#45054]
[FA] Fix BC support for a few versions + add deprecation cycle (#45061) by @vasqu in [#45061]
fix(testing): Fix Parakeet, Evolla, Pi0, and Phi-3 test failures on main CI (#45004) by @harshaljanjani in [#45004]
Allow advanced users to override model_type in AutoConfig.from_pretrained (#45058) by @hmellor in [#45058]
Fix failing SmolLM3IntegrationTest (#45048) by @Sai-Suraj-27 in [#45048]
chore: remove old extras (#45024) by @tarekziade in [#45024]
Embedding VLMs don't need a head (#45000) by @zucchini-nlp in [#45000]
Fix GraniteConfig type hints to accept int for multiplier fields (#45019) by @javierdejesusda in [#45019]
fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985) by @Krishnachaitanyakc in [#44985]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@ed22699
- Internalise the NomicBERT model (#43067)
@tarekziade
- Use doc-builder runnable example for GLM-ASR (#44277)
- refactoring: speedup static checks with disk cache (#44992)
- feature: added import complexity checker (#45013)
- refactor: added cache in check_repo (#45012)
- chore: remove old extras (#45024)
- chore: Fix mlinter cache location (#45052)
- refactor: speed up docstring checker (#45009)
@Krishnachaitanyakc
- fix: correct type annotations across config classes for @strict validation (#45007)
- fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985)
@lashahub
- Add Music Flamingo (#43538)
@Lidang-Jiang
- [Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045)

huggingface/transformers v5.5.0 Release v5.5.0 on GitHub