pypi transformers 5.10.1
Release v5.10.1

4 hours ago

Release v5.10.1

v5.10.0 was yanked as we publish on a corrupted branch. Sorry everyone, this happens when we rush a release!!!

New Model additions

Gemma4 unified+ Gemma4 MTP

image

Gemma 4 12B Unified is an encoder-free multimodal model with pretrained and instruction-tuned variants. Unlike standard Gemma 4, which uses dedicated encoder towers, Gemma 4 12B Unified projects raw inputs directly into the language model's embedding space through lightweight linear pipelines. This results in a simpler architecture while maintaining strong multimodal performance.

Key differences from standard Gemma 4:

  • No Vision Tower: Raw pixel patches are projected directly into LM space via a Dense + LayerNorm pipeline with factorized 2D positional embeddings, replacing the vision encoder.
  • No Audio Tower: Raw 16 kHz waveform samples are chunked into fixed-length frames and projected through a simple RMSNorm → Linear pipeline, replacing the mel spectrogram + Conformer encoder.
  • Shared Multimodal Pipeline: Both vision and audio use the same Gemma4UnifiedMultimodalEmbedder (RMSNorm → Linear) for the final projection to text hidden space.

You can find the original Gemma 4 12B Unified checkpoints under the Gemma 4 release.

Sapiens2

Sapiens2 is a family of high-resolution vision transformers pretrained on ~1 billion curated human images, designed for human-centric computer vision tasks including pose estimation, body-part segmentation, surface normal estimation, and pointmap estimation. The models scale from 0.4B to 5B parameters and train at native 1K resolution, with hierarchical 4K variants for extended spatial reasoning. Sapiens2 achieves substantial improvements over its predecessor with +4 mAP in pose estimation, +24.3 mIoU in body-part segmentation, and 45.6% error reduction in normal estimation.

Links: Documentation | Paper

DeepSeek-OCR-2

DeepSeek-OCR-2 is an OCR-specialized vision-language model built on a distinctive architecture that combines a SAM ViT-B vision encoder with a Qwen2 hybrid attention encoder, connected through an MLP projector to a DeepSeek-V2 Mixture-of-Experts (MoE) language model. The model features a hybrid attention mechanism that applies bidirectional attention over image tokens and causal attention over query tokens, enabling efficient and accurate document understanding. It supports both plain OCR tasks and grounding capabilities with coordinate-aware output for document conversion to markdown format.

Links: Documentation

Mellum

Mellum is a code-focused Mixture-of-Experts language model developed by JetBrains. It is derived from the Qwen3-MoE architecture with per-layer-type RoPE and interleaved sliding window attention. The model has 12B total parameters with 2.5B active parameters per token, using 64 routed experts with 8 activated per token across 28 layers.

Links: Documentation

Breaking changes

The Gemma4 vision pooler now casts inputs to float32 before scaling to prevent float16 overflow (inf saturation) with large checkpoints, which may cause minor numerical differences in outputs for users running Gemma-4 vision models in float16.

Audio Language Models (ALMs) now have a dedicated base model class without a language modeling head, aligning them with the design of Vision Language Models (VLMs); users relying on the previous model class structure should update their code to use the new base model class where appropriate.

Parallelization

This release includes numerous bug fixes for model parallelism across multiple models (Gemma4, AltCLIP, ChineseClip, Blip-2, Whisper, Ovis2, Moshi) and parallel execution strategies, including fixes for tensor parallelism (TP), expert parallelism (EP), beam search under model parallel settings, and loss over-counting under TP/EP configurations. The continuous batching manager was also reworked for clearer control flow and improved TP race condition handling, and FSDP initialization via from_pretrained was introduced.

Cache

Fixed a regression in encoder-decoder cache initialization where the decoder config was incorrectly applied to the cross-attention cache, and resolved a RuntimeError caused by buffer size limits when warming up the cache on MPS devices. Additional test infrastructure improvements were made to support read-only cache environments used in CI.

Quantization

Added support for DeepGEMM BF16, mixed FP8/FP4, and MegaMoE quantization via a grouped linear refactor, while fixing two bugs: an FP8 MoE reverse substring issue affecting DSv4 initialization, and a BitsAndBytes 4-bit/8-bit quantization bug that silently dropped chunked tensors from one-to-many weight converters.

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @filipinescu
    • [docs] Romanian translation of contributing.md, modular_transformers.md, multimodal_processing.md, add_vision_processing_components.md, add_audio_processing_components.md, modeling_rules.md, model_output_tracing.md, auto_docstring.md, testing.md, pr_checks.md and add_new_model.md . (#46345)
    • [docs] Romanian translation of weightconverter.md, models.md, custom_models.md, monkey_patching.md, fusion_mapping.md, how_to_hack_models.md, model_sharing.md and serialization.md. (#46309)
    • Romanian translation of README.md, index.md, installation.md, _config.py and quicktour.md. (#46166)
  • @remi-or
    • [CB] [Major] Rework manager to have clearer control flow + handle TP (#46070)
  • @thisisiron
    • Add Deepseek-OCR-2 model (#45075)
  • @kaixuanliu
    • Add Expectations for pipeline token classification tests (#46151)
    • fix series of bugs for model parallel beam search (#46280)
    • add more generic support for distributed trainer tests (#46109)
    • add XPU Expectations for florence2 and lfm2_vl model test (#46275)
    • Fix model parallel issue for altclip model and ChineseClip model (#45487)
    • Model parallel fix (#46230)
    • fix(hrm_text): Add XPU Expectations for tests (#46214)
    • Fix model parallel bugs for Gemma4 (#45817)
    • Fix bnb 4bit/8bit quantization drop chunked tensors bug (#46210)
    • fix model parallel device mismatch issue in create_bidirectional_mask (#46221)
    • Fix a regression in encoder-decoder generation cache initialization (#46111)
  • @shadeMe
    • feat: Add support for JetBrains' Mellum v2 code generation model (#46112)
  • @vasqu
    • [Revert] FSDP+Dtensor refactor related changes (#46246)
    • [Configs] Fix layer type validation to include its mlp counterpart (#46220)
  • @zRzRzRzRzRzRzR
    • [GLM-4.6V] Update with GLM-GA Processor (#46184)
  • @eustlb
    • 🚨 [ALM] Add base model without head (#45534)

Don't miss a new transformers release

NewReleases is sending notifications on new releases.