kizuna-ai-lab/sokuji v0.24.0 on GitHub

Push-to-Translate: speak the target language directly, hold to translate

A new Push-to-Translate Speech Mode for bilingual users. Inverts the idea of Push-to-Talk:

Idle (key not held): your raw mic audio flows directly to Sokuji's virtual microphone — speak the target language naturally and your voice goes straight into the meeting, no translation, no AI involvement.
Hold Space: your mic is routed to the AI provider instead. Sokuji listens.
Release Space: the translation plays back through the virtual microphone.

The use case: you're a bilingual user in a meeting, and you can speak the other side's language well enough most of the time. You no longer need to switch input devices in your meeting app to "use Sokuji only when needed" — you stay on Sokuji's virtual mic the whole call, and only press a key when you hit a phrase you can't translate yourself.

How to enable: Settings → your provider's Speech Mode section → Push-to-Translate.

Available on: OpenAI, OpenAI-compatible, Kizuna AI, Gemini, Volcengine AST 2.0, Local Inference. Requires WebSocket transport (not WebRTC). Not available on PalabraAI / Volcengine ST (those providers don't have manual hold-to-talk to build on).

Section rename: Turn Detection → Speech Mode

The provider settings section previously called "Turn Detection" / "Voice Activity Detection" is now uniformly called Speech Mode across all four providers, with a single consistent set of options:

Auto / Normal / Semantic — VAD modes (provider auto-detects when you finish speaking)
Push-to-Talk — hold a key to send audio
Push-to-Translate — the new mode above

OpenAI's third option, previously labeled "Disabled" (which made sense under "Turn Detection" but read confusingly under "Speech Mode"), now correctly displays as Push-to-Talk — the same label the other providers use for the equivalent behavior. Your saved settings carry over unchanged.

Translations

Both new strings and updated tooltips are translated across all 30 supported languages (best-effort AI translation; native-speaker review PRs welcome).

Other

Tooltip text for the Speech Mode section now uses one line per option for better readability.
Section-level tooltips on Volcengine AST 2.0 and Local Inference describe the new Push-to-Translate behavior alongside existing modes.