github pipecat-ai/pipecat v1.2.0

3 hours ago

Added

  • Added a session_id field to RunnerArguments so bots can log or trace a per-session identifier in local development the same way they can in Pipecat Cloud. The development runner now mints a UUID at every construction site, and paths that already returned a sessionId to the caller (Daily /start, dial-in webhook) share that same UUID with the runner args instead of generating two. The SmallWebRTC /api/offer endpoint also accepts an optional session_id query parameter so the /sessions/{session_id}/... proxy can thread it through.
    (PR #4385)

  • Added a max_buffer_delay_ms constructor argument to CartesiaTTSService for controlling Cartesia's server-side text buffering. When unset, Pipecat picks a sensible default based on text_aggregation_mode: 0 in SENTENCE mode (custom buffering — avoids stacking client-side aggregation on top of Cartesia's default 3000ms server buffer) and unset in TOKEN mode (Cartesia's managed buffering applies). Pass an explicit value (0–5000ms) to override.
    (PR #4390)

  • Added a mip_opt_out constructor argument to DeepgramTTSService and DeepgramHttpTTSService so callers can opt out of the Deepgram Model Improvement Program. When set, the value is forwarded to Deepgram as a query parameter on the speak request. Defaults to None, which preserves the existing behavior. See https://dpgr.am/deepgram-mip for pricing implications before enabling.
    (PR #4400)

  • Added an opt-in add_tool_change_messages flag to the LLM aggregators (set via LLMContextAggregatorPair(..., add_tool_change_messages=True)) that appends a developer-role message to the context whenever LLMSetToolsFrame changes the set of advertised standard tools. Helps the LLM stay coherent across mid-conversation tool changes, mitigating several flavors of tool-call-related hallucination: calling tools that have been removed, avoiding tools that have been re-added, and hallucinating output (made-up answers or tool-call-shaped non-tool-calls) when tools are unavailable.
    (PR #4404)

  • Added deferred(strategy) and DeferredUserTurnStopStrategy in pipecat.turns.user_stop. Wraps a stop strategy so it fires only the inference-triggered event and suppresses on_user_turn_stopped, leaving finalization to another strategy in the chain such as LLMTurnCompletionUserTurnStopStrategy.
    (PR #4405)

  • Added ExternalUserTurnCompletionStopStrategy in pipecat.turns.user_stop — a generic stop strategy that finalizes the user turn whenever a UserTurnInferenceCompletedFrame arrives, regardless of which component produced it. LLMTurnCompletionUserTurnStopStrategy now extends this base; future producers (Flux, custom end-of-turn classifiers, etc.) can use the base directly or subclass it to add producer-specific setup.
    (PR #4405)

  • Added on_user_turn_inference_triggered, a new event on the user turn controller, processor, aggregator and stop strategies that fires when a strategy has enough signal to start LLM inference. By default it fires together with on_user_turn_stopped; a gating strategy can fire only the inference-triggered event and defer finalization to a peer.
    (PR #4405)

  • Added FilterIncompleteUserTurnStrategies in pipecat.turns.user_turn_strategies — a UserTurnStrategies specialization that wraps the detector chain with deferred(...) and appends LLMTurnCompletionUserTurnStopStrategy as the finalizer. Common case: user_turn_strategies=FilterIncompleteUserTurnStrategies(). Pass config=UserTurnCompletionConfig(...) to customize timeouts and prompts.
    (PR #4405)

  • Added LLMTurnCompletionUserTurnStopStrategy in pipecat.turns.user_stop. When installed, the strategy gates on_user_turn_stopped on a UserTurnInferenceCompletedFrame (a new fieldless system frame emitted by any component that can judge turn completeness — e.g. the UserTurnCompletionLLMServiceMixin on ). A finalization_timeout provides a safety net if no completion frame ever arrives.
    (PR #4405)

  • Added first-class RTVI support for the UI Agent Protocol:

    • Adds ui-event, ui-snapshot, and ui-cancel-task client-to-server messages, plus ui-command and ui-task server-to-client messages, with paired *Data / *Message pydantic models.
    • Adds built-in command payload models for Toast, Navigate, ScrollTo, Highlight, Focus, Click, SetInputValue, and SelectText; matching default handlers live in @pipecat-ai/client-react.
    • Adds RTVIProcessor.on_ui_message for inbound ui-event, ui-snapshot, and ui-cancel-task messages.
    • Adds five UI pipeline frames, mirroring the client-message frame-and-event pattern: downstream code pushes RTVIUICommandFrame / RTVIUITaskFrame for the observer to wrap into outbound UICommandMessage / UITaskMessage envelopes, while the processor pushes inbound RTVIUIEventFrame, RTVIUISnapshotFrame, and RTVIUICancelTaskFrame alongside on_ui_message.
    • Bumps the RTVI PROTOCOL_VERSION from 1.2.0 to 1.3.0.
      (PR #4407)
  • AWS Transcribe STT, Polly TTS, Bedrock LLM, and the Bedrock AgentCore processor now resolve credentials via the standard boto3 provider chain (EC2 instance profiles, EKS pod roles / IRSA, ECS task roles, SSO, ~/.aws/credentials) when explicit credentials and AWS_* environment variables are absent. Services running with IAM roles no longer need to export static credentials.
    (PR #4416)

  • Added keyterms support to ElevenLabs STT services so Scribe V2 callers can bias transcription for both file-based and realtime transcription.
    (PR #4426)

  • Added watchdog_min_timeout parameter to DeepgramFluxSTT and DeepgramFluxSageMakerSTT (default 0.5 seconds) to control the minimum silence duration before the watchdog sends a silence packet to prevent dangling turns. The actual threshold is max(chunk_duration * 2, watchdog_min_timeout), so it also adapts automatically to the audio chunk size in use.
    (PR #4430)

  • Added cancel_on_interruption=False support for GeminiLiveLLMService on models that support Gemini's NON_BLOCKING tool mechanism (currently Gemini 2.x); the conversation now continues while the tool runs. On models that don't yet support NON_BLOCKING (Gemini 3.x), the service surfaces a one-time warning explaining the limitation. (Note: an intermittent 1008 error can occasionally fire on Gemini 2.5 during long-running tool calls; we auto-reconnect.)
    (PR #4448)

  • Added NvidiaSageMakerWebsocketSTTService for streaming speech recognition using NVIDIA Nemotron ASR via an AWS SageMaker bidirectional-stream endpoint. Produces InterimTranscriptionFrame and TranscriptionFrame frames, is VAD-aware, and automatically reconnects on error.
    (PR #4464)

  • Added NVIDIA Magpie TTS services via AWS SageMaker: NvidiaSageMakerHTTPTTSService (single HTTP invocation, streams raw PCM back) and NvidiaSageMakerWebsocketTTSService (persistent HTTP/2 bidi-stream with full interruption support via InterruptibleTTSService).
    (PR #4464)

  • Added support for reasoning configuration on OpenAIRealtimeLLMService, for use with reasoning-capable Realtime models such as gpt-realtime-2.
    (PR #4470)

  • Inworld TTS updates:

    • Added delivery_mode setting (STABLE/BALANCED/CREATIVE) to InworldTTSService and InworldHttpTTSService, enabling the stability-vs-creativity tradeoff in inworld-tts-2.
    • Added language support to InworldTTSService and InworldHttpTTSService. The language setting is now forwarded to the API, and a new language_to_inworld_language() helper normalizes Pipecat Language enums to Inworld's BCP-47 locale tags.
      (PR #4473)

Changed

  • Updated the default SonioxTTSService model from tts-rt-v1-preview to the generally available tts-rt-v1.
    (PR #4386)

  • Default cartesia_version for CartesiaTTSService bumped from 2025-04-16 to 2026-03-01, matching CartesiaHttpTTSService and unlocking the use_normalized_timestamps and max_buffer_delay_ms fields.
    (PR #4390)

  • ⚠️ CartesiaTTSService now sends use_normalized_timestamps: true instead of the deprecated use_original_timestamps field. Word timestamps now reflect what was actually spoken (post text-normalization and pronunciation-dictionary substitution), matching the convention Pipecat uses for ElevenLabs. This is a behavior change for sonic-3 users, who were previously receiving timestamps tied to the input transcript.
    (PR #4390)

  • Broadened tool_resources to app_resources for easy access not just in tool handlers but in other places like custom FrameProcessors. Three changes: a rename (tool_resourcesapp_resources), a new app_resources property on PipelineTask, and a new pipeline_task property on FrameProcessor. Tool handlers now read params.app_resources; custom processors read self.pipeline_task.app_resources. The previous tool_resources aliases (on PipelineTask, FunctionCallParams, and FrameProcessorSetup) keep working but are deprecated as of 1.2.0 and emit DeprecationWarnings.
    (PR #4395)

  • Lowered the per-message log in SmallWebRTCInputTransport._handle_app_message from debug to trace. App messages can be high-frequency and were noisy at debug level; set the loguru level to TRACE to see them again.
    (PR #4397)

  • Changed the default model for GrokRealtimeLLMService to grok-voice-think-fast-1.0, xAI's recommended Voice Agent model. The previous default of grok-voice-fast-1.0 has been deprecated by xAI and is being removed.
    (PR #4401)

  • Changed the default Inworld TTS model from inworld-tts-1.5-max to inworld-tts-2 (Realtime TTS-2) across InworldHttpTTSService, InworldTTSService, and the InworldRealtimeLLMService cascade. Existing users can pin the prior model explicitly via the model/tts_model argument; both inworld-tts-1.5-max and inworld-tts-1.5-mini remain valid model IDs.
    (PR #4422)

  • Changed the default model for GrokLLMService from grok-3 to grok-4.20-non-reasoning. xAI is retiring grok-3 on May 15, 2026.
    (PR #4429)

  • DeepgramFluxSTT watchdog silence threshold is now dynamic: max(chunk_duration * 2, watchdog_min_timeout) instead of a fixed 500 ms. This prevents false silence injections when large audio chunks are sent at lower frequency.
    (PR #4430)

  • ElevenLabsTTSService now sends close_context to the server as soon as the turn is complete (on on_turn_context_completed) rather than waiting until all audio has finished playing back. The isFinal message from ElevenLabs is now used to signal TTSStoppedFrame and clean up the audio context, improving turn transition timing.
    (PR #4433)

  • Updated InworldHttpTTSService and InworldTTSService to use PCM audio encoding by default, which returns audio bytes without headers.
    (PR #4446)

  • Moved create_task, cancel_task, the task_manager property, and setup(task_manager) up from FrameProcessor to BaseObject. Custom BaseObject subclasses (turn strategies, controllers, etc.) now inherit these methods directly instead of reimplementing the task manager wiring. Owners propagate the task manager to their child BaseObjects via await child.setup(task_manager).
    (PR #4449)

  • Changed the default OpenAI Realtime input audio transcription model from gpt-4o-transcribe to gpt-realtime-whisper for both OpenAIRealtimeSTTService and OpenAIRealtimeLLMService. The new model does not accept the prompt parameter; if a prompt is supplied alongside gpt-realtime-whisper, it is dropped automatically and a warning is logged. To keep using prompt hints, explicitly pin model="gpt-4o-transcribe" (or "gpt-4o-mini-transcribe").
    (PR #4450)

  • Updated the default model for CartesiaTTSService and CartesiaHttpTTSService from sonic-3 to sonic-3.5.
    (PR #4462)

  • Changed the default model for OpenAIRealtimeLLMService from gpt-realtime-1.5 to gpt-realtime-2.
    (PR #4472)

Deprecated

  • Deprecated LLMUserAggregatorParams.filter_incomplete_user_turns. Use user_turn_strategies=FilterIncompleteUserTurnStrategies() (or add LLMTurnCompletionUserTurnStopStrategy to a custom user_turn_strategies.stop) instead. Setting the legacy flag still works for one release: the aggregator emits a DeprecationWarning and rewires the strategies as if you had passed FilterIncompleteUserTurnStrategies directly.
    (PR #4405)

  • Deprecated ResampyResampler in favor of SOXRAudioResampler (or the create_file_resampler() / create_stream_resampler() factories). Instantiating ResampyResampler now emits a DeprecationWarning. The class will be removed in Pipecat 2.0 along with the default resampy and numba dependencies.
    (PR #4428)

Fixed

  • Fixed CartesiaTTSService surfacing flush_done messages from Cartesia as ErrorFrames. The latest API emits a flush_done per transcript when server-side buffering is disabled; Pipecat now consumes them silently since each turn already has its own context_id.
    (PR #4390)

  • Fixed Cartesia tag helpers (SPELL, EMOTION_TAG, PAUSE_TAG, VOLUME_TAG, SPEED_TAG) raising TypeError when called on an instance (e.g. tts.SPELL("hi")). They're now @staticmethod and callable from both the class and an instance.
    (PR #4390)

  • Fixed CartesiaHttpTTSService pushing two ErrorFrames on a non-200 response — one with the API's error text and a second, less informative "Unknown error" frame from the outer exception handler. It now pushes a single frame that includes the HTTP status code and returns cleanly.
    (PR #4390)

  • Fixed an issue where LocalSmartTurnAnalyzerV3 was imported unconditionally for user turn stop strategies. It is now only imported when default_user_turn_stop_strategies() is called. This improves startup time and removes the transformers "PyTorch/TensorFlow/Flax not found" warning when the default stop strategies are not used.
    (PR #4393)

  • Fixed GrokRealtimeLLMService ignoring the configured model. The model was stored in Settings but never sent to xAI, so every session silently fell back to xAI's server-side default. The model is now passed via the ?model= query parameter on the WebSocket URL as xAI's Voice Agent API requires.
    (PR #4401)

  • Fixed on_user_turn_stopped firing prematurely when filter_incomplete_user_turns was enabled. The event now fires only after the LLM confirms the user turn is complete (); previously the smart-turn detector's tentative stop was bubbling up before the LLM had a chance to veto it, causing observers, transcript appenders and UI indicators to receive an early — and sometimes duplicated — signal.
    (PR #4405)

  • Fixed TTSSpeakFrame(append_to_context=True) greetings sometimes splitting across two assistant messages in the LLM context and not surfacing in on_assistant_turn_stopped. The LLMAssistantPushAggregationFrame emitted at the end of a TTS context now carries a PTS just past the last word so it can't overtake clock-queued TTSTextFrames in the transport's output, and LLMAssistantAggregator now triggers on_assistant_turn_started/on_assistant_turn_stopped when it receives the frame outside an LLM response cycle (restoring v0.0.104 behavior for greeting transcripts).
    (PR #4414)

  • Fixed ElevenLabsTTSService and ElevenLabsHttpTTSService producing merged words (e.g. bookLook) when using Flash models. Flash often splits sentences mid-stream into alignment chunks that begin with a real inter-word space, but the previous fix unconditionally stripped that space from every chunk. Leading spaces are now stripped only on the first alignment chunk of an utterance, so subsequent chunks correctly flush partial words across boundaries.
    (PR #4415)

  • Fixed AWS Polly TTS, Bedrock LLM, and the Bedrock AgentCore processor erroring out when only one of AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY was set in the environment. The half-populated kwargs are no longer forwarded to aioboto3; partial env-var configurations now fall through to the boto3 credential chain like fully-unset configurations do.
    (PR #4416)

  • Fixed ElevenLabsTTSService and ElevenLabsHttpTTSService writing romanized/normalized text to the LLM context. With non-Latin input (e.g., Chinese), the assistant transcript was getting populated with pinyin (Ni Hao ! instead of 你好!), which then degraded subsequent LLM turns. The services now consume alignment by default and only switch to normalizedAlignment / normalized_alignment when pronunciation_dictionary_locators is configured (where alignment has overlapping restarts that produce duplicated/garbled words, per #4316). Both fields are read with preferred-with-fallback semantics since each is nullable per the API schema.
    (PR #4424)

  • Fixed a deadlock in TTSService that could permanently stall pipeline processing when all three conditions occurred together: pause_frame_processing=True, an interruption arrived before any TTS audio was played, and an UninterruptibleFrame (e.g. TTSUpdateSettingsFrame, FunctionCallResultFrame) was in the processing queue at that moment. The process task would block on __process_event.wait() indefinitely because BotStoppedSpeakingFrame never arrives (no audio was played) and the interruption handler did not resume processing. Affects services using pause_frame_processing=True such as ElevenLabs, Rime, AsyncAI, Gradium, and ResembleAI.
    (PR #4431)

  • Fixed interruptions being delayed when a slow non-uninterruptible frame was processing and an uninterruptible frame was waiting in the queue. The bot would stall until the slow frame finished instead of cancelling it immediately on interruption.
    (PR #4434)

  • Fixed TTSService dropping uninterruptible frames (e.g. FunctionCallResultFrame) from its internal serialization queue when an interruption occurs. Previously, the queue was recreated on every interruption, silently discarding any queued frames. The queue is now reset instead of recreated, preserving uninterruptible frames so they are always delivered downstream.
    (PR #4435)

  • Fixed a race condition in the Daily transport that caused AttributeError: 'NoneType' object has no attribute 'send_app_message' when tearing down a pipeline. Both DailyInputTransport and DailyOutputTransport share the same DailyTransportClient and both call cleanup(), which was releasing the underlying CallClient on the first call — leaving the second caller with a None client.
    (PR #4440)

  • Restored cancel_on_interruption=False support for AWSNovaSonicLLMService and OpenAIRealtimeLLMService. These services previously honored the flag by simply not cancelling in-flight function calls on interruption; the introduction of the new async-tool mechanism (which threads started/intermediate/final messages through the LLM context) broke that path because the realtime services didn't know how to interpret those messages. Note that new-style streamed intermediate results (FunctionCallResultProperties(is_final=False)) are not supported on these realtime services. Similar fixes for other impacted realtime services are forthcoming.
    (PR #4441)

  • Fixed two misspelled Gemini TTS voice names in GeminiTTSService.AVAILABLE_VOICES.
    (PR #4443)

  • Extended the cancel_on_interruption=False regression fix to GrokRealtimeLLMService, AzureRealtimeLLMService, and UltravoxRealtimeLLMService. Grok and Azure use the same approach as in #4441 (each service detects async-tool messages in the LLM context and routes the final result to its formal tool-result channel; Azure inherits transitively from OpenAIRealtimeLLMService). Ultravox needed a different approach because its API freezes the conversation between client_tool_invocation and the matching client_tool_result — for async-registered functions it now ships a placeholder client_tool_result immediately when the function is invoked (to unfreeze the conversation), then injects the real result as user-side text once the tool finishes. Streamed intermediate results (FunctionCallResultProperties(is_final=False)) are still not supported on any of these realtime services. GeminiLiveLLMService and InworldRealtimeLLMService are excluded for now: Gemini Live's async-tool path needs deeper investigation, and Inworld tool calling needs to be sorted out first.
    (PR #4447)

  • Fixed OpenAIRealtimeLLMService handling of multi-output-item responses (observed with gpt-realtime-2). A single response can now contain more than one audio item, and the first item's audio.done may arrive after the second item's deltas have started. Deltas still arrive strictly in playback order, so we continue to forward them as received (matching OpenAI's reference implementation). The fix removes spurious warnings, ensures truncation always targets the latest audio item, and emits a single bracketing TTSStartedFrame/TTSStoppedFrame pair per assistant turn (the Stopped is now pushed on response.done).
    (PR #4465)

  • Fixed missing output attribute on LLM OpenTelemetry spans when the LLM call is interrupted mid-stream.
    (PR #4467)

  • Fixed incorrect metrics.ttfb on STT OpenTelemetry spans, and parented them to the current turn span.
    (PR #4467)

  • Fixed incorrect metrics.ttfb on TTS OpenTelemetry spans for streaming services.
    (PR #4467)

  • Extended the cancel_on_interruption=False regression fix to InworldRealtimeLLMService. Uses the same approach as in #4441 (the service detects async-tool messages in the LLM context and routes the final result to its formal tool-result channel). Note: as of this writing, Inworld Realtime doesn't appear to handle the resulting delayed tool result reliably — the routing is best-effort and the service surfaces a one-time warning when async-tool messages are seen. Streamed intermediate results (FunctionCallResultProperties(is_final=False)) are still not supported on this realtime service. (Inworld was excluded from #4447 pending resolution of an unrelated tool-calling issue, which turned out to be an account-level matter.)
    (PR #4474)

  • Fixed Cartesia TTS Korean word timestamps to use normal spacing rules, preserving word boundaries and per-word timestamp alignment during downstream aggregation.
    (PR #4475)

  • Fixed Cartesia TTS Chinese and Japanese timestamp grouping to preserve provider text spacing, avoiding artificial spaces when timestamp groups are reassembled downstream.
    (PR #4475)

  • Fixed SonioxSTTService final transcription frames missing detected language metadata when Soniox returns token-level language annotations.
    (PR #4482)

  • Fixed Soniox final transcription language detection to use the most common recognized token language, avoiding mislabeling an utterance when the last token is tagged with a different language.
    (PR #4495)

  • Fixed dropped audio in streaming TTS services whose wire protocol doesn't echo context_id back on incoming audio (Sarvam, Smallest, Soniox, Inworld, and others). Previously, audio that arrived between contexts or at the very start of a turn was tagged with context_id=None and silently dropped with an "unable to append audio to context: no context ID provided" debug log. TTSService.get_active_audio_context_id() now falls back to the synthesis-side _turn_context_id when the playback cursor isn't set yet.
    (PR #4497)

Security

  • Fixed a path traversal issue in the development runner's /files/{filename:path} download endpoint. Previously, when the runner was started with --folder, a request like /files/..%2F..%2Fetc%2Fpasswd could escape the configured folder because %2F-encoded separators bypassed Starlette's path normalisation. The endpoint now resolves the joined path and rejects any filename that escapes the allowed base with a 403, and also returns 404 (instead of an implicit null 200) when --folder is unset.
    (PR #4417)

Don't miss a new pipecat release

NewReleases is sending notifications on new releases.