kizuna-ai-lab/sokuji v0.23.0 on GitHub

A new local transcription option: Voxtral Mini 3B

Sokuji now offers Voxtral Mini 3B 2507 as a local ASR model — a smaller, faster sibling to the existing Voxtral Mini 4B Realtime that runs entirely on your GPU.

Why pick this one over 4B Realtime?

More accurate when your source language is known. This is Voxtral 3B's headline feature: it accepts a language hint (e.g. "this is German") and uses it to lock onto the right language during transcription. The 4B Realtime model, by contrast, has to auto-detect on every sentence — which sometimes wanders, especially for typologically close languages.
Smaller and faster to load. Around 2.7 GB on GPUs with shader-f16 support, around 3.0 GB elsewhere. Lower VRAM use, quicker startup.
8 supported languages: English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian. For other source languages, the existing 4B Realtime model stays the right pick — it covers a wider language set.

The model appears in Local Inference → Model Management when your source language is one of the supported eight, and on a WebGPU-capable browser (Chrome or Edge). Download it once, then pick it in the ASR selector alongside 4B Realtime.

Finer VAD silence-duration control

The Min Silence Duration slider in Local Inference settings now allows finer adjustments — minimum 0.05 s with 0.05 s steps, down from 0.1 s previously. Useful if you want the model to commit to a transcription faster (or wait a little longer) when there's a brief pause.

Full Changelog: v0.22.0...v0.23.0