github pipecat-ai/pipecat v1.1.0

13 hours ago

Added

  • Added MistralSTTService for real-time speech-to-text using Mistral's Voxtral Realtime API (voxtral-mini-transcribe-realtime-2602). Supports streaming transcription with interim results, automatic language detection, and VAD-driven utterance lifecycle.
    (PR #4253)

  • Added buttons field to OutputDTMFFrame and OutputDTMFUrgentFrame for sending multi-key DTMF sequences as a list[KeypadEntry]. Use OutputDTMFFrame.from_string("123#") (or the equivalent on OutputDTMFUrgentFrame) to build one from a dial string, and to_string() to convert back.
    (PR #4313)

  • Added DailyTransport.send_dtmf() to expose the Daily call client's DTMF sending capability, enabling applications to send tones during a call (e.g. IVR navigation).
    (PR #4313)

  • Added DailyOutputDTMFFrame and DailyOutputDTMFUrgentFrame frames. In addition to the inherited buttons, they accept session_id, digit_duration_ms and method, which are forwarded to Daily's send_dtmf as sessionId, digitDurationMs and method.
    (PR #4313)

  • Added incremental pyright type checking. A pyrightconfig.json at the repo root uses typeCheckingMode: "basic" with an explicit include list of modules that pass cleanly (clocks, metrics, transcriptions, frames, observers, extensions, turns, pipeline, runner). Remaining modules will be added in subsequent PRs. CI enforces the checked set via uv run pyright in the format workflow.
    (PR #4324)

  • Added multilingual support to DeepgramFluxSTTService via a new language_hints: list[Language] setting. Works with Deepgram's new flux-general-multi model to bias transcription across English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. Omit the hints to use auto-detection, or pass a subset to bias toward expected languages. Hints can be updated mid-stream via STTUpdateSettingsFrame (sent as a Deepgram Configure control message, no reconnect) to support detect-then-lock flows.
    (PR #4326)

  • Added fine-grained server-side VAD tuning options to SarvamSTTService.Settings for the saaras:v3 model, including speech thresholds, frame-count controls, pre-speech padding, interruption sensitivity, and initial-frame skipping.
    (PR #4334)

  • Added XAISTTService for real-time speech-to-text using xAI's voice STT WebSocket API (wss://api.x.ai/v1/stt). Streams raw audio (PCM, µ-law, or A-law) and emits interim and final transcription frames driven by the server's is_final / speech_final flags. Settings expose interim_results, endpointing, language, multichannel, channels, and diarize. Requires the xai optional extra (pip install "pipecat-ai[xai]").
    (PR #4340)

  • Added XAITTSService for streaming text-to-speech using xAI's WebSocket TTS endpoint (wss://api.x.ai/v1/tts). Streams text.delta chunks up and base64 audio.delta chunks down on the same connection so audio begins flowing before the full utterance finishes synthesizing; complements the batch-HTTP XAIHttpTTSService. Defaults to raw PCM output so TTSAudioRawFrame needs no decoding. The xai optional extra now pulls in pipecat-ai[websockets-base].
    (PR #4341)

  • Added SonioxTTSService, a real-time WebSocket TTS service that streams text in and audio out over a persistent connection. Install with pip install "pipecat-ai[soniox]".
    (PR #4360)

  • Added support for Daily's built-in screenVideo destination in DailyTransport. When "screenVideo" is included in video_out_destinations transport parameter, a dedicated screen video track is created at join time and frames with transport_destination="screenVideo" are routed to it.

    params = DailyParams(
          video_out_enabled=True,
          video_out_is_live=True,
          video_out_width=1280,
          video_out_height=720,
          video_out_destinations=["screenVideo"]
    )
    
    ...
    
    frame = OutputImageRawFrame(...)
    frame.transport_destination = "screenVideo"

    (PR #4370)

  • Added camera_out_send_settings to DailyParams. This dict is passed verbatim to the Daily client's camera publishing settings, allowing applications to fully control encoding, codec, bitrate, and framerate.

    params = DailyParams(
        camera_out_send_settings={
            "maxQuality": "high",
            "encodings": {
                "high": {"maxBitrate": 2_000_000, "maxFramerate": 30}
            },
        },
    )

    (PR #4370)

  • Added tool_resources to PipelineTask and FunctionCallParams. Pass an application-defined object (DB handles, clients, state, etc.) to PipelineTask(..., tool_resources=...) and access it from any tool handler via params.tool_resources. Passed by reference; the caller retains their handle and can read mutations after the task finishes. Resolves #4256.
    (PR #4371)

Changed

  • Updated NVIDIA STT services to align with Nemotron Speech defaults and
    configuration: api_key is now optional for local deployments, additional
    recognition settings are available (including alternatives, word offsets, and
    diarization), and streaming/segmented docs now reflect Nemotron Speech APIs.

    • NVIDIA streaming STT now sets TranscriptionFrame.finalized=True when the provider marks a result as final, and preserves language on both TranscriptionFrame and InterimTranscriptionFrame.
      (PR #4269)
  • Updated NvidiaLLMService to emit model reasoning as LLMThought*Frames (from both reasoning_content and <think>...</think> output), avoid mixing reasoning text into normal assistant content, and allow keyless local NIM endpoints while warning when the cloud endpoint is used without an API key.
    (PR #4270)

  • STT services now reconnect safely when settings change: reconnection is deferred until the current user turn ends (i.e., until UserStoppedSpeakingFrame is received) rather than interrupting an active speech session. Audio frames received while the reconnect is in progress are buffered and replayed once the new connection is ready. CartesiaSTTService and DeepgramSTTService both use this new behavior.
    (PR #4311)

  • Reduced debug log noise for LLM services. The system instruction is now logged once when composed (e.g. when turn completion is enabled) instead of on every LLM call. Per-call logs now show only the conversation messages, consistent across Google, Anthropic, AWS, and OpenAI services.
    (PR #4314)

  • LiveKitRunnerArguments.token is now a required str (previously str | None with a default of None). LiveKit requires a token to join a room, so the type now reflects reality. This only affects custom runners that construct LiveKitRunnerArguments directly; code consuming the argument from the standard runner is unaffected.
    (PR #4324)

  • TranscriptionFrame.language and InterimTranscriptionFrame.language emitted by DeepgramFluxSTTService now reflect the language Deepgram detected for each turn (read from the languages field on Flux's TurnInfo event). On flux-general-multi this gives per-turn accuracy for downstream consumers (e.g. TTS voice selection). flux-general-en continues to emit Language.EN.
    (PR #4326)

  • Added includes_inter_frame_spaces parameter to
    TTSService.add_word_timestamps and _add_word_timestamps (default None).
    When True, downstream consumers will not inject additional spaces between
    tokens; None leaves each frame's own default unchanged.

    • InworldTTSService now passes includes_inter_frame_spaces=True when reporting word timestamps, since Inworld tokens already include inter-word spacing.
      (PR #4330)
  • SarvamSTTService now uses saaras:v3 as its default model instead of saarika:v2.5. Applications that relied on the previous default should set settings=SarvamSTTService.Settings(model="saarika:v2.5") explicitly.
    (PR #4334)

  • SpeechTimeoutUserTurnStopStrategy now waits only user_speech_timeout when a transcript arrives without a VAD stop event, rather than max(ttfs_p99_latency, user_speech_timeout). If you had ttfs_p99_latency > user_speech_timeout, turn detection in that path is slightly faster than before.
    (PR #4337)

  • If you use an STT service that emits finalized transcripts (Speechmatics, Soniox, Deepgram Flux, AssemblyAI) with SpeechTimeoutUserTurnStopStrategy, user turns now end as soon as user_speech_timeout elapses after VAD stop. Previously the strategy also waited for the STT P99 latency (ttfs_p99_latency) even when the transcript was already marked final. user_speech_timeout is still honored as a floor — STT finalization never shortens it.
    (PR #4337)

  • ⚠️ PlivoFrameSerializer and TelnyxFrameSerializer now raise ValueError at construction when auto_hang_up=True (the default) but required credentials are missing, matching TwilioFrameSerializer. Previously they constructed successfully and the hangup failed silently at call-end, leaving phantom billable sessions on the provider. If you relied on the old silent behavior, pass auto_hang_up=False explicitly or provide the credentials. The specific fields checked are call_id/auth_id/auth_token for Plivo and call_control_id/api_key for Telnyx.
    (PR #4349)

  • ToolsSchema(standard_tools=...) now accepts any Sequence[FunctionSchema | DirectFunction] rather than requiring an exact list of the union. Callers can pass a narrower list[FunctionSchema] (or any other Sequence) without the type checker complaining about list invariance.
    (PR #4352)

  • Updated aic-sdk dependency to ~=2.2.0. The AIC_LICENSE_KEY environment variable replaces the previous AICOUSTICS_LICENSE_KEY.
    (PR #4362)

  • Loosened the protobuf dependency to >=5.29.6,<7, so projects pinned to protobuf 5.x can install pipecat-ai again. The previous >=6.31.1,<7 pin (introduced in 1.0.8 alongside the nvidia-riva-client 2.25.1 upgrade) silently blocked any environment whose dependency graph already constrained protobuf to the 5.x line. The bundled frames_pb2.py is now compiled with protoc 5.x so it imports cleanly on both 5.x and 6.x runtimes.

    Installing the nvidia extra still pulls protobuf 6.x: nvidia-riva-client 2.25.1 ships gencode that requires a 6.x runtime, so pipecat-ai[nvidia] now declares protobuf>=6.31.1,<7 explicitly to cover an upstream packaging gap (nvidia-riva/python-clients#172).
    (PR #4372)

  • Daily rooms created by the development runner (pipecat.runner.run) now expire after 4 hours with eject_at_room_exp=True, mirroring Pipecat Cloud's max session limit. Previously, runner-created rooms inherited a 2-hour expiration on the default code paths and had no expiration at all when callers posted partial dailyRoomProperties (e.g. {"start_video_off": true}) to /start, causing rooms to accumulate indefinitely. Explicit exp and eject_at_room_exp values in dailyRoomProperties are still respected.
    (PR #4374)

  • Updated daily-python dependency to ~=0.28.0.
    (PR #4379)

Deprecated

  • Deprecated TransportParams.video_out_bitrate for the Daily transport. Use DailyParams.camera_out_send_settings instead to configure camera publishing encodings (bitrate, framerate, codec, etc.).
    (PR #4370)

Fixed

  • Fixed missing tool handlers so unregistered tool calls fail with a normal final tool result instead of leaving tool-call state hanging.
    (PR #4301)

  • Fixed pipecat-ai[tavus] not installing the required daily-python dependency. Installing the tavus extra now correctly pulls in pipecat-ai[daily].
    (PR #4304)

  • Fixed audio loss and potential errors when STT settings were updated
    mid-speech. Previously, CartesiaSTTService and DeepgramSTTService would
    immediately disconnect and reconnect when settings changed, dropping any
    in-flight audio. Reconnection is now deferred until the user stops speaking,
    and audio arriving during the reconnect window is buffered and replayed.
    (PR #4311)

  • Fixed SmallestTTSService WebSocket endpoint URL to match Smallest AI v4.0.0 API (wss://waves-api.smallest.aiwss://api.smallest.ai) and restored keepalive using a silent space message instead of the unsupported flush command.
    (PR #4320)

  • Fixed whitespace handling in TTS token streaming mode. Inter-token whitespace (e.g., spaces between words) is now preserved for correct prosody, while leading whitespace before the first non-whitespace token is still stripped to avoid issues with TTS models that are sensitive to leading spaces.
    (PR #4323)

  • Fixed SentryMetrics silently dropping MetricsFrames from stop_ttfb_metrics and stop_processing_metrics. SentryMetrics called the base FrameProcessorMetrics implementation but discarded its return value, so FrameProcessor never pushed the MetricsFrame downstream. This prevented observers (e.g. UserBotLatencyObserver, MetricsLogObserver) from seeing TTFB and processing metrics for any service using metrics=SentryMetrics(). The metrics were still calculated and Sentry transactions still completed — only the downstream frame push was affected.
    (PR #4325)

  • Fixed ElevenLabsTTSService and ElevenLabsHttpTTSService emitting word timestamps and TTSTextFrame content that matched the input text instead of the spoken audio when a pronunciation dictionary (pronunciation_dictionary_locators) or text normalization rewrote the input. Both services now consume ElevenLabs' normalized alignment, so downstream consumers (captions, transcripts, context aggregation) reflect what the listener actually hears.
    (PR #4344)

  • Fixed a crash in DeepgramSTTService when an STTUpdateSettingsFrame arrived before the WebSocket handshake completed (for example, when pushing an update upstream on StartFrame). The settings-triggered reconnect cancelled the in-flight connection task before its keepalive task was created, causing an UnboundLocalError: cannot access local variable 'keepalive_task' in the handler's finally block.
    (PR #4347)

  • Fixed direct-function registration crashing for functions without a docstring. DirectFunctionWrapper passed inspect.getdoc()'s result to docstring_parser.parse(), which raises when the docstring is None. Functions now register cleanly whether or not they have a docstring; an empty docstring produces empty description and parameter metadata as expected.
    (PR #4352)

  • Fixed AssemblyAISTTService, CartesiaSTTService, GradiumSTTService, and SonioxSTTService crashing the pipeline on transient WebSocket send failures. Each run_stt sent audio directly without catching errors, so a single network hiccup mid-stream raised an uncaught exception through process_frame. The guards now log a warning and let the connection-state check on the next call handle recovery, matching the pattern used by Deepgram, xAI, Azure, and other push-based STTs.
    (PR #4352)

  • Fixed Gemini Live losing conversation history in the (rare) case of a WebSocket reconnect before any session resumption handle is received. When the session reconnects (e.g. on system instruction change), conversation history is now re-seeded into the new session before it is marked ready for input.
    (PR #4355)

  • Fixed SmallWebRTC data channel silently stalling on networks with a 1280-byte MTU (IPv6, Tailscale overlays, many consumer VPNs). aiortc's default SCTP chunk size of 1200 bytes produces ~1305-byte UDP datagrams after headers, which the kernel rejects with EMSGSIZE; aiortc has no path-MTU discovery so it retransmits forever at the same oversized size. The chunk size is now clamped to 1100 bytes (~1205-byte datagrams, ~75 bytes of slack). Override with PIPECAT_SCTP_MAX_CHUNK_SIZE if your path MTU requires a different value.
    (PR #4358)

Don't miss a new pipecat release

NewReleases is sending notifications on new releases.