github huggingface/transformers v5.5.0
Release v5.5.0

9 hours ago

Release v5.5.0

image

New Model additions

Gemma4

Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.

image

You can find all the original Gemma 4 checkpoints under the Gemma 4 release.

The key difference from previous Gemma releases is the new design to process images of different sizes using a fixed-budget number of tokens. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There a a couple constraints to follow:

  • The total number of pixels must fit within a patch budget
  • Both height and width must be divisible by 48 (= patch size 16 × pooling kernel 3)

Important

Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).

The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is 280 soft tokens per image.

Soft Tokens Patches (before pooling) Approx. Image Area
70 630 ~161K pixels
140 1,260 ~323K pixels
280 2,520 ~645K pixels
560 5,040 ~1.3M pixels
1,120 10,080 ~2.6M pixels

To encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."

NomicBERT

NomicBERT is a BERT-inspired encoder model that applies Rotary Position Embeddings (RoPE) to create reproducible long context text embeddings. It is the first fully reproducible, open-source text embedding model with 8192 context length that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short-context MTEB and long context LoCo benchmarks. The model generates dense vector embeddings for various tasks including search, clustering, and classification using specific instruction prefixes.

Links: Documentation | Paper

MusicFlamingo

Music Flamingo is a fully open large audio–language model designed for robust understanding and reasoning over music. It builds upon the Audio Flamingo 3 architecture by including Rotary Time Embeddings (RoTE), which injects temporal position information to enable the model to handle audio sequences up to 20 minutes. The model features a unified audio encoder across speech, sound, and music with special sound boundary tokens for improved audio sequence modeling.

Links: Documentation | Paper

Breaking changes

Mamba and hybrid model caches are now first-class native citizens in the library, so users working with Mamba-based or hybrid (Mamba + attention) models should update their code to use the new native cache classes instead of any previous workarounds.

Remote code execution support has been removed from the native LightGlue integration, so users who were loading LightGlue with trust_remote_code=True must remove that argument and use the model directly through the standard native API.

  • 🚨 [LightGlue] Remove remote code execution (#45122) by @vasqu

Vision

Several vision-related bugs were fixed in this release, including correcting the Gemma vision mask to support video inputs, resolving a dependency issue that incorrectly required torchvision for PIL-based image processors, and patching bugs in the Janus image generation model and image loading. Local code resolution for tokenizers and image processors was also corrected.

Cache

Improved the performance of repository checks (check-repo) by introducing file-level and AST-level disk caching, achieving up to a 27x speedup (from ~46s to ~1.6s with a warm cache), and fixed the mlinter cache location in .gitignore.

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ed22699
    • Internalise the NomicBERT model (#43067)
  • @tarekziade
    • Use doc-builder runnable example for GLM-ASR (#44277)
    • refactoring: speedup static checks with disk cache (#44992)
    • feature: added import complexity checker (#45013)
    • refactor: added cache in check_repo (#45012)
    • chore: remove old extras (#45024)
    • chore: Fix mlinter cache location (#45052)
    • refactor: speed up docstring checker (#45009)
  • @Krishnachaitanyakc
    • fix: correct type annotations across config classes for @strict validation (#45007)
    • fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985)
  • @lashahub
  • @Lidang-Jiang
    • [Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045)

Don't miss a new transformers release

NewReleases is sending notifications on new releases.