github huggingface/transformers v5.2.0
v5.2.0: GLM-5, Qwen3.5, Voxtral Realtime, VibeVoice Acoustic Tokenizer

4 hours ago

New Model additions

VoxtralRealtime

image

VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.

The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.

GLM-5 - GlmMoeDsa

image

The zAI team launches GLM-5, and introduces it as such:

GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.

Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.

Qwen3.5, Qwen3.5 Moe

image

The Qwen team launches Qwen 3.5, and introduces it as such:

We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.

VibeVoice Acoustic Tokenizer

image

VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.

One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.

Breaking changes

  • 🚨 [Attn] New attn mask interface everywhere (#42848)
  • 🚨 Modify ModernBERT's default attention implementation to stop using FA (#43764)

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ChiaraBoretti
    • SINQ quantization strategy integration (adapted for Transformers V5) (#43112)
  • @cyyever
    • Reduce reduce CUDA sync (#44005)
    • Use torch.xlogy (#44006)
    • Improve use of torch.is_autocast_enabled (#43930)
    • Fix old tech stack in doc (#43902)
    • Update KERNELS_MIN_VERSION to 0.10.2 to be the same as setup.py (#43753)
    • Remove unnecessary code or checks for PT 2.4+ (#43787)
    • Fix old tech stack in doc (#43879)
    • Delete batch_split from EncoderDecoderCache (#43814)
    • Fix markdown documentation (#43076)
  • @eustlb
    • Add Voxtral Realtime (#43769)
    • [MistralCommonBackend] fix loading proc (#43887)
  • @ebezzam
    • Fix expected DAC outputs due to (old) change in CI settings. (#43896)
    • Add VibeVoice Acoustic Tokenizer (#43400)
  • @vasqu
    • [Jamba] Fallback to slow path and warn instead of error out (#43889)
    • 🚨 [Attn] New attn mask interface everywhere (#42848)
    • [Repo Consistency] Fix rms norm (#43803)
    • [Modular Dependencies] Fixup qwen rms norms (#43772)
  • @bozheng-hit
    • Adding Support for Qwen3.5 (#43830)

Don't miss a new transformers release

NewReleases is sending notifications on new releases.