huggingface/transformers v4.52.4-Kyutai-STT-preview on GitHub

A new model is added to transformers: Kyutai-STT
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-Kyutai-STT-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.52.4-Kyutai-STT-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Kyutai-STT model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.

Kyutai-STT

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:

kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French
kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy

Usage example

Kyutai-STT can be found on the Huggingface Hub.

Inference

import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)

# 2. load audio samples
ds = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))

# 3. prepare the model inputs
inputs = processor(
    ds[0]["audio"]["array"],
)
inputs.to(torch_device)

# 4. infer the model
output_tokens = model.generate(**inputs)

# 5. decode the generated tokens
print(processor.batch_decode(output_tokens, skip_special_tokens=True))

Batched Inference

import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)

# 2. load audio samples
ds = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))

# 3. prepare the model inputs
audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_arrays, return_tensors="pt", padding=True)
inputs = inputs.to(torch_device)

# 4. infer the model
output_tokens = model.generate(**inputs)

# 5. decode the generated tokens
decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True)
for output in decoded_outputs:
    print(output)

huggingface/transformers v4.52.4-Kyutai-STT-preview Kyutai-STT (based on v4.52.4) on GitHub

Kyutai-STT

Usage example

Inference

Batched Inference

huggingface/transformers v4.52.4-Kyutai-STT-preview
Kyutai-STT (based on v4.52.4)

on GitHub