New Model additions
EuroBERT
EuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.
Links: Documentation | Paper | Blog Post
- Add eurobert (#39455) by @ArthurZucker in #39455
VibeVoice ASR
VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.
Links: Documentation | Paper
TimesFM2.5
TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.
Links: Documentation | Paper
PP-DocLayoutV2
PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.
Links: Documentation
- [Model] Add PP-DocLayoutV2 Model Support (#43018) by @zhang-prog in #43018
OlmoHybrid
OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.
Links: Documentation
- Add OLMo Hybrid model (#43358) by @yanhong-lbh in #43358
ModernVBert
ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.
Links: Documentation | Paper
ColModernVBert
ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.
Links: Documentation | Paper
Higgs Audio V2
Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.
Links: Documentation
Higgs Audio V2 Tokenizer
The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.
Links: Documentation
Breaking changes
Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.
- π¨ fix + tests dense & MoE TP all reduce (decoder only) (#43722) by @3outeille
The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.
Several pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.
- π¨ More V5 pipeline cleanup (#43325) by @Rocketknight1
3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.
- π¨ Unify 3D position ids (#43972) by @zucchini-nlp
π¨ Tokenizer x vLLM fixes π¨ :
Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.
This was done in:
Generation
Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.
- [higgs-audio-v2] fix sampling (#44386) by @eustlb in [#44386]
- fix(flaky): idefics generate cache flake (#44180) by @tarekziade in [#44180]
- Fix generation integration tests (#44225) by @zucchini-nlp in [#44225]
- [generate] Always pass full input_ids in
prepare_inputs_for_generation(#44226) by @Cyrilvallez in [#44226] - fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201) by @tarekziade in [#44201]
- [generate] Completely stop relying on
cache_positionto prepare inputs (#44130) by @Cyrilvallez in [#44130] - Simplify input preparation in generate (#44126) by @Cyrilvallez in [#44126]
Tokenization
Several tokenization bugs were fixed in this release, including resolving an AttributeError in MLukeTokenizer caused by the v5 rename of additional_special_tokens, correcting the Fuyu tokenizer class mapping, fixing LayoutXLM tokenization test failures from the slow tokenizer removal refactor, and adding olmo_hybrid to the auto-tokenizer mapping. The tokenizer documentation was also updated to reflect the new unified v5 backend architecture and reorganized for clarity.
- [tiny] Add olmo_hybrid to tokenizer auto-mapping (#44416) by @tyler-romero in [#44416]
- fix(tokenizer): Fix MLukeTokenizer AttributeError post-v5 refactor (#44362) by @harshaljanjani in [#44362]
- update fuyu tokenizer class (#44235) by @itazap in [#44235]
- fix(testing): Fix LayoutXLM tokenization test and LightOnOCR SDPA flash test failures on main CI (#43988) by @harshaljanjani in [#43988]
- [docs] tokenizer summary (#43965) by @stevhliu in [#43965]
- [docs] refactor tokenizer docs (#43900) by @stevhliu in [#43900]
Kernels
Fixed several kernel-related issues including a security vulnerability, corrected Mamba kernel loading to handle incompatible import structures, ensured Liger Kernel is properly enabled during hyperparameter search, and expanded Flash Attention to support multiple compatible implementations.
- Fix kernels security issue (#44395) by @Cyrilvallez in [#44395]
- Enable Liger Kernel when doing hyperparameter search. (#44329) by @linfeng-du in [#44329]
- [
Mamba] Fix kernel loading (#44176) by @vasqu in [#44176] - [
Flash Attn] Enable compatible implementations (#44177) by @vasqu in [#44177] - Fix percentage formatting in help messages for gradient checkpointing, Liger Kernel, and empty cache steps (#44100) by @qgallouedec in [#44100]
Quantization
This release adds several new quantization backends and fixes, including MLX quantization support for MPS devices, Four Over Six (4/6) NVFP4 quantization integration for NVIDIA Blackwell GPUs, and CPU support for MXFP4 models, alongside a bug fix for MXFP4 model saving using reverse_op.
- [Quantization] Fixing mxfp4 saving using reverse_op (#43148) by @MekkCyber in [#43148]
- [Quantization] Add metal quantization for MPS devices! (#43934) by @MekkCyber in [#43934]
- Enable mxfp4 model on CPU (#43512) by @jiqing-feng in [#43512]
- Add Four Over Six quantization integration (#43970) by @jackcook in [#43970]
Vision
Fixed backward compatibility for image processors loaded from older remote code that lack valid_kwargs definitions, and resolved test failures in AMD ROCm CI by adding the missing timm dependency to the Docker image.
- [AMD CI] Add missing timm dependency to ROCm Docker image (#44389) by @Abdennacer-Badaoui in [#44389]
- update glm image model expected out for tests (#43907) by @kaixuanliu in [#43907]
- Fix image processors
from_dictbackward compatibility with old remote code (#44245) by @yonigozlan in [#44245]
Bugfixes and improvements
- Update PR template (#44415) by @SunMarc in [#44415]
- Add Qwen3.5 support for sequence classification (#44406) by @medhakimbedhief in [#44406]
- update the expected output for qwen2_5_vl w/ pytorch 2.10 XPU (#44426) by @kaixuanliu in [#44426]
- add support for nemotron_3 (#44390) by @liding-nv in [#44390]
- [ Dynamic weight loader] fix remote code when format matches (#44396) by @ArthurZucker in [#44396]
- [timesfm2_5] fix timesfm2.5 loss (#44331) by @kashif in [#44331]
- Fix peft conversion mappings (#44413) by @Cyrilvallez in [#44413]
- Reduce tqdm verbosity during model loading (#44414) by @Cyrilvallez in [#44414]
- docs: Add NeMo Automodel community integration docs (#44304) by @adil-a in [#44304]
- [CB] Small fixes (#44227) by @remi-or in [#44227]
- Support non-gated experts (#44319) by @IlyasMoutawwakil in [#44319]
- [Bugfix] fix qwen3.5 no split module (#44382) by @JJJYmmm in [#44382]
- Fix mutable default arguments and resource leaks (#44287) by @jashshah999 in [#44287]
- skip 2 invalid test cases for voxtral_realtime model (#44321) by @kaixuanliu in [#44321]
- Mamba-1/-2 init weights in mixer class (#43778) by @kevinli573 in [#43778]
- add expectations for xpu for olmo_hybrid model (#44353) by @kaixuanliu in [#44353]
- [VITS] Add
speaking_rateas an optionl forward argument (#43283) by @gau-nernst in [#43283] - Strict export cleanup (#44293) by @IlyasMoutawwakil in [#44293]
- [docs] kernelconfig fix (#44337) by @stevhliu in [#44337]
- Add
ProcessingKwargsImagesKwargsetc. to docs (#44269) by @yonigozlan in [#44269] - Fix typos in comments and docstrings (#44332) by @tysoncung in [#44332]
- Add testing guide for agents for trainer tests (#44328) by @SunMarc in [#44328]
- Update common tests Trainer (#44260) by @SunMarc in [#44260]
- [timesfm2_5] fix timesfm mlp bias (#44325) by @kashif in [#44325]
- fix zero3 init config (#44236) by @SunMarc in [#44236]
- Update expected output for Jais2 model tests (#43910) by @kaixuanliu in [#43910]
- Improve
has_similar_generate_outputsassertions (#44166) by @tarekziade in [#44166] - Fix failed test case for exaone_moe model (#43938) by @kaixuanliu in [#43938]
- fix(modeling_attn_mask_utils): remove FutureWarning from logger.warning_once() (#44307) by @imstevenpmwork in [#44307]
- Remove remaining vestiges of the TranslationPipeline (#43869) by @Rocketknight1 in [#43869]
- XPU now supports backward for the FA2 fixed path (#43905) by @YangKai0616 in [#43905]
- Fix: use
TokenizersBackendfor Olmo3 to preserve custompre_tokenizer(#44294) by @mario-sanz in [#44294] - Fix special token maps BC (#44281) by @ArthurZucker in [#44281]
- [
Modular] Fix file type regression (#44283) by @vasqu in [#44283] - [auto_docstring] Improve typing parsing and add tests (#43748) by @yonigozlan in [#43748]
- Restore response_schema saving-loading (#44282) by @Rocketknight1 in [#44282]
- Use associative scan HOP mamba recurrentgemma (#43737) by @riccardofelluga in [#43737]
- chore: fixes in
Trainerclass docs (compute_loss&hyperparameter_search) (#44268) by @ethanknights in [#44268] - fix(trainer): pass optim_args to SGD, Adagrad, and RMSprop optimizers (#44203) by @nightcityblade in [#44203]
- fix(utils): Make torch_compilable_check compatible with torch.export strict mode (#44266) by @harshaljanjani in [#44266]
- Fix TypeError in convert_rope_params_to_dict when ignore_keys is a list (#44272) by @hangjun-ezra in [#44272]
- [docs] callbacks and collators (#44239) by @stevhliu in [#44239]
- [docs] trainer part 1 (#44185) by @stevhliu in [#44185]
- Remove refs to grouped_entities (#44182) by @Rocketknight1 in [#44182]
- [mimi] nit (#44237) by @eustlb in [#44237]
- Fix local dataset loading priority in run_image_classification_no_tra⦠(#44199) by @gowthamr-tech in [#44199]
- chore: added CLAUDE.md alias (#44232) by @tarekziade in [#44232]
- fix: add missing return type annotations to type-checking utilities in generic.py (#44241) by @yushiran in [#44241]
- Fix return value - fixes #44238 (#44240) by @tarekziade in [#44240]
- fix regression report_to "all" (#44250) by @SunMarc in [#44250]
- [
fix] Set input_modalities on various architectures that aren't just text (#44078) by @tomaarsen in [#44078] - Add processing tests for phi4 multimodal (#44234) by @yonigozlan in [#44234]
- fix:
VersionComparison.from_stringreturn type mismatch (#43709) by @tarekziade in [#43709] - refactor _inner_training_loop to smaller methods (#44041) by @winglian in [#44041]
- [docs] fix broken chat_templating links in tasks docs (#44115) by @Deep-unlearning in [#44115]
- Add missing backtick in
AnyToAnyPipeline.__call__docstring (#44229) by @alvarobartt in [#44229] - Docs(it): fix typo in sentencepiece install command (#44218) by @matisgagneux21 in [#44218]
- Docs(it): fix typo in docstring wording (#44219) by @matisgagneux21 in [#44219]
- fix bug with position_ids on qwen3-vl models, such that position_ids include text position (#44158) by @leopold-tzafon in [#44158]
- Update 404ing BillSum dataset URL on Summarization Task guide (#44212) by @alexandercarruthers in [#44212]
- fix(models): Fix LayoutLMv2 NER crash and broken batched truncation/padding (#44187) by @harshaljanjani in [#44187]
- [CB] [Major] Asynchronous batching (#43960) by @remi-or in [#43960]
- Fix LASR feature extractor regression from invalid center argument (#44207) by @ainergiz in [#44207]
- Models with incorrect tokenizer_class in tokenization_config.json tha⦠(#44179) by @itazap in [#44179]
- chore(typing): initial ty integration (#44167) by @tarekziade in [#44167]
- fix(flaky):
test_generate_with_and_without_position_idsin GLM ORC (#44173) by @tarekziade in [#44173] - [docs] Add Chinese translations for common NLP task tutorials (#44144) by @TinderZ in [#44144]
- [Mimi] Calibrate to ensure encoder streaming performs correctly (#43971) by @caffeinism in [#43971]
- ESM2 attention_mask and token_dropout fix (#44163) by @lhallee in [#44163]
- bring back our demons: clean_up_tokenization_spaces (#44035) by @ArthurZucker in [#44035]
- Fix
Seq2SeqTrainingArgumentsdocumentation (#35258) by @qgallouedec in [#35258] - AutoGrad support for grouped_mm fallback (#44152) by @IlyasMoutawwakil in [#44152]
- Patch
__setitem__onModelOutputeven if the parameter was previouslyNone(#44080) by @tomaarsen in [#44080] - [
simple] Fix up__repr__whitespace/brackets (#44048) by @tomaarsen in [#44048] - [
chore] Fix incorrect forward type hint for Gemma3n (#44051) by @tomaarsen in [#44051] - Raise informative error when loading video processors (#44125) by @zucchini-nlp in [#44125]
- fix(flaky): Different approach to make sure loss exists (#43804) by @tarekziade in [#43804]
- [voxtral] fix voxtral proc (#44132) by @eustlb in [#44132]
- [docs] Fix typos in GenerationConfig docstring (#44143) by @nightcityblade in [#44143]
- Fix gemma3n
get_audio_features(#44040) by @zucchini-nlp in [#44040] - Fix UMT5EncoderModel embedding weights not being tied after loading (#43880) by @jiqing-feng in [#43880]
- fix(testing): Update stale device override test in GraniteSpeech (#44113) by @harshaljanjani in [#44113]
- [Misc][vlms] Use text_config when initializing the fine-grained FP8Expert (#44032) by @JJJYmmm in [#44032]
- docs: fix typo 'AuoQuant' β 'AutoQuant' and clarify FINEGRAINED_FP8 library column (#44131) by @cluster2600 in [#44131]
- Update post proc (#44090) by @itazap in [#44090]
- Fix: flaky
Kosmos2ModelTesttest (#44061) by @tarekziade in [#44061] - AutoTokenizer ignores config when model_type is None (#44127) by @itazap in [#44127]
- Migrate GPT2 to standardized output capture decorators (#43983) by @Aki-07 in [#43983]
grouped_mmfallback (#44043) by @IlyasMoutawwakil in [#44043]- Bump dev version (#44099) by @qgallouedec in [#44099]
- Fix loading logic issue (#44095) by @Cyrilvallez in [#44095]
- [docs] customizing tokenizers (#43929) by @stevhliu in [#43929]
- Merge test_keep_in_fp32_modules and test_keep_in_fp32_modules_strict (#44097) by @Rocketknight1 in [#44097]
- [voxtral-realtime] update runner expected values (#44096) by @eustlb in [#44096]
- Use torch.isfinite (#44069) by @cyyever in [#44069]
- add default flash impl (#44081) by @ArthurZucker in [#44081]
- Remove unused dependencies (#43904) by @cyyever in [#43904]
- Fix patchtsmixer call to post_init (#44082) by @Cyrilvallez in [#44082]
- Fix false positive right-padding warning for decoder-only models in pipeline (#44021) by @ in [#44021]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ArthurZucker
- @liding-nv
- add support for nemotron_3 (#44390)
- @kashif
- @remi-or
- @ebezzam
- @MekkCyber
- @tarekziade
- perf: Optimize SynthID logits processor batch index construction (#44172)
- Improve
has_similar_generate_outputsassertions (#44166) - fix(flaky): idefics generate cache flake (#44180)
- chore: added CLAUDE.md alias (#44232)
- Fix return value - fixes #44238 (#44240)
- fix:
VersionComparison.from_stringreturn type mismatch (#43709) - fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201)
- chore(typing): initial ty integration (#44167)
- fix(flaky):
test_generate_with_and_without_position_idsin GLM ORC (#44173) - fix(flaky): Different approach to make sure loss exists (#43804)
- Fix: flaky
Kosmos2ModelTesttest (#44061)
- @zhang-prog
- [Model] Add PP-DocLayoutV2 Model Support (#43018)
- @yanhong-lbh
- Add OLMo Hybrid model (#43358)
- @vasqu
- @jackcook
- Add Four Over Six quantization integration (#43970)
- @winglian
- refactor _inner_training_loop to smaller methods (#44041)
- @paultltc
- Add ModernVBERT models (#42504)
- @TinderZ
- [docs] Add Chinese translations for common NLP task tutorials (#44144)
- @szhengac
- Add Higgs Audio V2 Model (#40294)