pypi transformers 4.57.0
v4.57.0: Qwen3-Next, Vault Gemma, Qwen3 VL, LongCat Flash, Flex OLMO, LFM2 VL, BLT, Qwen3 OMNI MoE, Parakeet, EdgeTAM, OLMO3

one day ago

New model additions

Qwen3 Next

image

The Qwen3-Next series represents the Qwen team's next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency.
The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost:

  • Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling.
  • High-Sparsity MoE: Achieves an extreme low activation ratio as 1:50 in MoE layers — drastically reducing FLOPs per token while preserving model capacity.
  • Multi-Token Prediction(MTP): Boosts pretraining model performance, and accelerates inference.
  • Other Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, Gated Attention, and other stabilizing enhancements for robust training.

Built on this architecture, they trained and open-sourced Qwen3-Next-80B-A3B — 80B total parameters, only 3B active — achieving extreme sparsity and efficiency.

Despite its ultra-efficiency, it outperforms Qwen3-32B on downstream tasks — while requiring less than 1/10 of the training cost.
Moreover, it delivers over 10x higher inference throughput than Qwen3-32B when handling contexts longer than 32K tokens.

For more details, please visit their blog Qwen3-Next (blog post).

Vault Gemma

image

VaultGemma is a text-only decoder model derived from Gemma 2, notably it drops the norms after the Attention and MLP blocks, and uses full attention for all layers instead of alternating between full attention and local sliding attention. VaultGemma is available as a pretrained model with 1B parameters that uses a 1024 token sequence length.

VaultGemma was trained from scratch with sequence-level differential privacy (DP). Its training data includes the same mixture as the Gemma 2 models, consisting of a number of documents of varying lengths. Additionally, it is trained using DP stochastic gradient descent (DP-SGD) and provides a (ε ≤ 2.0, δ ≤ 1.1e-10)-sequence-level DP guarantee, where a sequence consists of 1024 consecutive tokens extracted from heterogeneous data sources. Specifically, the privacy unit of the guarantee is for the sequences after sampling and packing of the mixture.

Qwen3 VL

image

Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions.

Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding.

These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.

Longcat Flash

image

The LongCatFlash model was proposed in LongCat-Flash Technical Report by the Meituan LongCat Team. LongCat-Flash is a 560B parameter Mixture-of-Experts (MoE) model that activates 18.6B-31.3B parameters dynamically (average ~27B). The model features a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and advanced reasoning capabilities.

The abstract from the paper is the following:

We present LongCat-Flash, a 560 billion parameter Mixture-of-Experts (MoE) language model featuring a dynamic computation mechanism that activates 18.6B-31.3B parameters based on context (average ~27B). The model incorporates a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and demonstrates strong performance across multiple benchmarks including 89.71% accuracy on MMLU and exceptional agentic tool use capabilities.

Tips:

  • LongCat-Flash uses a unique shortcut-connected MoE architecture that enables faster inference compared to traditional MoE models
  • The model supports up to 128k context length for long-form tasks
  • Dynamic parameter activation makes it computationally efficient while maintaining high performance
  • Best suited for applications requiring strong reasoning, coding, and tool-calling capabilities
  • The MoE architecture includes zero experts (nn.Identity modules) which act as skip connections, allowing tokens to bypass expert computation when appropriate

Flex Olmo

image

FlexOlmo is a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets.

You can find all the original FlexOlmo checkpoints under the FlexOlmo collection.

LFM2 VL

image

LFM2-VL first series of vision-language foundation models developed by Liquid AI. These multimodal models are designed for low-latency and device-aware deployment. LFM2-VL extends the LFM2 family of open-weight Liquid Foundation Models (LFMs) into the vision-language space, supporting both text and image inputs with variable resolutions.

Architecture

LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

  • Shape-optimized (400M) for more fine-grained vision capabilities for LFM2-VL-1.6B
  • Base (86M) for fast image processing for LFM2-VL-450M

The encoder processes images at their native resolution up to 512×512 pixels, efficiently handling smaller images without upscaling and supporting non-standard aspect ratios without distortion. Larger images are split into non-overlapping square patches of 512×512 each, preserving detail. In LFM2-VL-1.6B, the model also receives a thumbnail (a small, downscaled version of the original image capturing the overall scene) to enhance global context understanding and alignment. Special tokens mark each patch’s position and indicate the thumbnail’s start. The multimodal connector is a 2-layer MLP connector with pixel unshuffle to reduce image token count.

BLT

image

The BLT model was proposed in Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.

The abstract from the paper is the following:

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference
efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating
more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

Usage Tips:

  • Dual Model Architecture: BLT consists of two separate trained models:

    • Patcher (Entropy Model): A smaller transformer model that predicts byte-level entropy to determine patch boundaries and segment input.
    • Main Transformer Model: The primary model that processes the patches through a Local Encoder, Global Transformer, and Local Decoder.
  • Dynamic Patching: The model uses entropy-based dynamic patching where:

    • High-entropy regions (complex data) get shorter patches with more computational attention
    • Low-entropy regions (predictable data) get longer patches for efficiency
    • This allows the model to allocate compute resources where they're most needed
  • Local Encoder: Processes byte sequences with cross-attention to patch embeddings

  • Global Transformer: Processes patch-level representations with full attention across patches

  • Local Decoder: Generates output with cross-attention back to the original byte sequence

  • Byte-Level Tokenizer: Unlike traditional tokenizers that use learned vocabularies, BLT's tokenizer simply converts text to UTF-8 bytes and maps each byte to a token ID. There is no need for a vocabulary.

Qwen3 Omni MoE

image

The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.

Notes

  • Use [Qwen2_5OmniForConditionalGeneration] to generate audio and text output. To generate only one output type, use [Qwen2_5OmniThinkerForConditionalGeneration] for text-only and [Qwen2_5OmniTalkersForConditionalGeneration] for audio-only outputs.
  • Audio generation with [Qwen2_5OmniForConditionalGeneration] supports only single batch size at the moment.
  • In case out out-of-memory errors hwen working with video input, decrease processor.max_pixels. By default the maximum is set to a very arge value and high resolution visuals will not be resized, unless resolution exceeds processor.max_pixels.
  • The processor has its own [~ProcessorMixin.apply_chat_template] method to convert chat messages to model inputs.

Parakeet

image

Parakeet models, introduced by NVIDIA NeMo, are models that combine a Fast Conformer encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.

Model Architecture

  • Fast Conformer Encoder: A linearly scalable Conformer architecture that processes mel-spectrogram features and reduces sequence length through subsampling. This is more efficient version of the Conformer Encoder found in FastSpeech2Conformer (see [ParakeetEncoder] for the encoder implementation and details).
  • ParakeetForCTC: a Fast Conformer Encoder + a CTC decoder
    • CTC Decoder: Simple but effective decoder consisting of:
      • 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
      • CTC loss computation for training.
      • Greedy CTC decoding for inference.

EdgeTAM

image

The EdgeTAM model was proposed in EdgeTAM: On-Device Track Anything Model Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran.

EdgeTAM is an efficient adaptation of SAM 2 that introduces a 2D Spatial Perceiver architecture to optimize memory attention mechanisms for real-time video segmentation on mobile devices.

OLMO3

More details to come soon 👀

Continuous batching

We are introducing Continuous Batching (CB) in this release, we consider it a stable feature. The main use case for CB is batched generation, which makes it very efficient in the context of GRPO training or evaluation. Thanks to CB, researchers or model developers are now free to use transformers in these contexts without having to spin up an additional inference engine.

CB currently supports both full attention and sliding window attention: this means that the vast majority of models are supported, like llama, gemma3, gpt-oss.

CB is also integrated with transformers serve, which means that you can deploy transformers as an OpenAI-compatible HTTP server.
Here is a small snippet on how to use it:

import datasets
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507", dtype=torch.bfloat16, _attn_implementation="sdpa_paged", device_map="auto"
)
model.generation_config.max_new_tokens = 32
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507", padding_side="left")
dataset = datasets.load_dataset("openai/gsm8k", "socratic", split="test")
tokenized_datasets = dataset.map(lambda x: tokenizer(x["question"]), batched=True)
simple_batch_inputs = [item["input_ids"] for item in tokenized_datasets]

batch_outputs = model.generate_batch(inputs=simple_batch_inputs)
for request in batch_outputs:
    print(tokenizer.decode(batch_outputs[request].generated_tokens))
"""
 Let's break down the problem step by step:

1. **Total eggs laid per day**:  
   Janet’s ducks lay **16 eggs per day**
 Let's break down the problem step by step:

1. **Blue fiber**: The robe takes **2 bolts** of blue fiber.
2. **White fiber
 To determine Josh's profit from flipping the house, let's go step by step.

---

### Step 1: Initial cost of the house
Josh buys the
 To find the total distance James runs in a week, we can break down the problem step by step:

1. **Sprints per session**: James runs 
 To determine how many cups of feed Wendi needs to give her chickens in the final meal of the day, let's go step by step.
"""

Breaking changes

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @hiyouga
    • Support batch size > 1 image-text inference (#36682)
  • @cyyever
    • Fix typos (#40585)
    • Fix inexistent imports (#40580)
    • Remove unnecessary pillow version check (#40604)
    • Fix invalid typing (#40612)
    • Enable more ruff UP rules (#40579)
    • Avoid attention_mask copy in qwen2.5 (#40658)
    • Fix parent classes of ProcessingKwargs (#40676)
    • Fix parent classes of AllKwargsForChatTemplate (#40685)
    • Fix arguments (#40605)
    • Add Optional typing (#40686)
    • Fix np array typing (#40741)
    • Fix more typos (#40627)
    • Remove reference of video_load_backend and video_fps for processor (#40719)
    • Enable ruff on benchmark and scripts (#40634)
    • Fix typos in tests and util (#40780)
    • Fix invalid PipelineParallel member (#40789)
    • Use functools.cached_property (#40607)
    • Remove use_ipex option from Trainer (#40784)
    • Fix typos in src (#40782)
    • Improve torch_dtype checks (#40808)
    • Use checkpoint in auto_class_docstring (#40844)
    • Clarify passing is_causal in sdpa_attention_paged_forward (#40838)
    • Use torch.expm1 and torch.log1p for better numerical results (#40860)
    • Remove dict branch of attention_mask in sdpa_attention_paged_forward (#40882)
    • remove dummy EncodingFast (#40864)
    • Don't list dropout in eager_paged_attention_forward (#40924)
    • Benchmarking V2: framework impl (#40486)
    • Change docker image to preview for the MI355 CI (#40693)
    • Redirect MI355 CI results to dummy dataset (#40862)
  • @voidism
    • fix MetaCLIP 2 wrong link & wrong model names in the docstrings (#40565)
  • @RyanMullins
    • add: embedding model (#40694)
    • add: differential privacy research model (#40851)
  • @LawJarp-A
    • Add EfficientLoFTRImageProcessorFast for GPU-accelerated image processing (#40215)
  • @bozheng-hit
    • Adding Support for Qwen3-Next (#40771)
    • Align torch implementation of Gated DeltaNet in Qwen3-Next with fla library. (#40807)
    • Fix the misalignment between the l2norm in GDN of Qwen3-Next and the implementation in the FLA library. (#40842)
  • @wangzhen0518
  • @HyunZ118
    • 🌐 [i18n-KO] Translated clipseg.md to Korean (#39903)
    • 🌐 [i18n-KO] Translated smolvlm.md to Korean (#40414)
    • 🌐 [i18n-KO] Translated imageprocessor.md to Korean (#39557)
  • @JJJYmmm
    • Adding Support for Qwen3-VL Series (#40795)
  • @SamuelBarryCS
    • Add Fast PromptDepthAnything Processor (#40602)
  • @2015aroras

Don't miss a new transformers release

NewReleases is sending notifications on new releases.