github huggingface/transformers v5.3.0
v5.3.0: EuroBERT, VibeVoice ASR, TimesFM2.5, PP-DocLayoutV2, OlmoHybrid, ModernVBert, Higgs Audio V2

11 hours ago

New Model additions

EuroBERT

image

EuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.

Links: Documentation | Paper | Blog Post

VibeVoice ASR

image

VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.

Links: Documentation | Paper

TimesFM2.5

image

TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.

Links: Documentation | Paper

PP-DocLayoutV2

image

PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.

Links: Documentation

OlmoHybrid

OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.

Links: Documentation

ModernVBert

image

ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.

Links: Documentation | Paper

ColModernVBert

ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.

Links: Documentation | Paper

Higgs Audio V2

image

Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.

Links: Documentation

Higgs Audio V2 Tokenizer

The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.

Links: Documentation

Breaking changes

Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.

  • 🚨 fix + tests dense & MoE TP all reduce (decoder only) (#43722) by @3outeille

The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.

  • 🚨 [Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299) by @vasqu

Several pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.

3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.

🚨 Tokenizer x vLLM fixes 🚨 :

Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.

This was done in:

  • [vllm + v5 fix] handle TokenizersBackend fallback properly for v5 (#44255) by @itazap

Generation

Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.

Tokenization

Several tokenization bugs were fixed in this release, including resolving an AttributeError in MLukeTokenizer caused by the v5 rename of additional_special_tokens, correcting the Fuyu tokenizer class mapping, fixing LayoutXLM tokenization test failures from the slow tokenizer removal refactor, and adding olmo_hybrid to the auto-tokenizer mapping. The tokenizer documentation was also updated to reflect the new unified v5 backend architecture and reorganized for clarity.

Kernels

Fixed several kernel-related issues including a security vulnerability, corrected Mamba kernel loading to handle incompatible import structures, ensured Liger Kernel is properly enabled during hyperparameter search, and expanded Flash Attention to support multiple compatible implementations.

Quantization

This release adds several new quantization backends and fixes, including MLX quantization support for MPS devices, Four Over Six (4/6) NVFP4 quantization integration for NVIDIA Blackwell GPUs, and CPU support for MXFP4 models, alongside a bug fix for MXFP4 model saving using reverse_op.

Vision

Fixed backward compatibility for image processors loaded from older remote code that lack valid_kwargs definitions, and resolved test failures in AMD ROCm CI by adding the missing timm dependency to the Docker image.

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ArthurZucker
    • Add eurobert (#39455)
    • [ Dynamic weight loader] fix remote code when format matches (#44396)
    • Fix special token maps BC (#44281)
    • bring back our demons: clean_up_tokenization_spaces (#44035)
    • add default flash impl (#44081)
  • @liding-nv
    • add support for nemotron_3 (#44390)
  • @kashif
    • [timesfm2_5] fix timesfm2.5 loss (#44331)
    • [timesfm2_5] fix timesfm mlp bias (#44325)
    • Timesfm 2.5 (#41763)
  • @remi-or
    • [CB] Small fixes (#44227)
    • [CB] [Major] Asynchronous batching (#43960)
  • @ebezzam
    • [VibeVoice ASR] Use updated padding cache for ASR model. (#44392)
    • Add VibeVoice ASR (#43625)
  • @MekkCyber
    • [Quantization] Fixing mxfp4 saving using reverse_op (#43148)
    • [Quantization] Add metal quantization for MPS devices! (#43934)
  • @tarekziade
    • perf: Optimize SynthID logits processor batch index construction (#44172)
    • Improve has_similar_generate_outputs assertions (#44166)
    • fix(flaky): idefics generate cache flake (#44180)
    • chore: added CLAUDE.md alias (#44232)
    • Fix return value - fixes #44238 (#44240)
    • fix: VersionComparison.from_string return type mismatch (#43709)
    • fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201)
    • chore(typing): initial ty integration (#44167)
    • fix(flaky): test_generate_with_and_without_position_ids in GLM ORC (#44173)
    • fix(flaky): Different approach to make sure loss exists (#43804)
    • Fix: flaky Kosmos2ModelTest test (#44061)
  • @zhang-prog
    • [Model] Add PP-DocLayoutV2 Model Support (#43018)
  • @yanhong-lbh
    • Add OLMo Hybrid model (#43358)
  • @vasqu
    • 🚨 [Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299)
    • [Modular] Fix file type regression (#44283)
    • [Mamba] Fix kernel loading (#44176)
    • [Flash Attn] Enable compatible implementations (#44177)
  • @jackcook
    • Add Four Over Six quantization integration (#43970)
  • @winglian
    • refactor _inner_training_loop to smaller methods (#44041)
  • @paultltc
    • Add ModernVBERT models (#42504)
  • @TinderZ
    • [docs] Add Chinese translations for common NLP task tutorials (#44144)
  • @szhengac
    • Add Higgs Audio V2 Model (#40294)

Don't miss a new transformers release

NewReleases is sending notifications on new releases.