A new model is added to transformers: Kyutai-STT
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-Kyutai-STT-preview
.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.52.4-Kyutai-STT-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Kyutai-STT model. This tag is a tagged version of the main
branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0
.
Kyutai-STT
Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:
- kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French
- kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy
Usage example
Kyutai-STT can be found on the Huggingface Hub.
Inference
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)
# 2. load audio samples
ds = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
# 3. prepare the model inputs
inputs = processor(
ds[0]["audio"]["array"],
)
inputs.to(torch_device)
# 4. infer the model
output_tokens = model.generate(**inputs)
# 5. decode the generated tokens
print(processor.batch_decode(output_tokens, skip_special_tokens=True))
Batched Inference
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)
# 2. load audio samples
ds = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
# 3. prepare the model inputs
audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_arrays, return_tensors="pt", padding=True)
inputs = inputs.to(torch_device)
# 4. infer the model
output_tokens = model.generate(**inputs)
# 5. decode the generated tokens
decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True)
for output in decoded_outputs:
print(output)