github pipecat-ai/pipecat v1.3.0

16 hours ago
  • Pipecat pipelines are multi-agent compatible by default. The new multi-agent framework (pipecat.workers) turns every PipelineWorker (previously PipelineTask) into a peer on a shared bus that passes typed messages, dispatches @job work, and coordinates with siblings, while existing single-pipeline code keeps running untouched. examples/multi-worker/ ships ready-to-run patterns: LLM handoff, parallel debate, sidecar code assistants and hardware controllers, distributed deployments over Redis or PGMQ, point-to-point WebSocket proxies, and UI workers driving a web client over RTVI.
    (PR #4493)

  • Added UIWorker (pipecat.workers.ui): an LLM worker that observes and drives a client web UI over the RTVI UI channel — for voice agents that act on what the user is looking at. It reads the page's accessibility snapshots, routes client UI events to @ui_event handlers, drives the page with UI commands (scroll_to, highlight, select_text, click, set_input_value), and answers screen-grounded questions. PipelineWorker connects it to the client automatically when RTVI is enabled — no extra wiring.

    • A voice agent delegates a turn via the built-in respond job; the worker returns an answer for the voice LLM to speak, or speaks it verbatim through the agent's TTS with respond_to_job(answer, tts_speak=True).
    • ReplyToolMixin provides a ready-made reply tool (a spoken answer plus the standard UI actions).
    • ui_job_group(...) fans work out to peer workers, surfaced to the client as cancellable progress cards.
    • UI_STATE_PROMPT_GUIDE is drop-in system-prompt text that teaches the LLM the <ui_state> wire format.
      (PR #4540)
  • Added VonageVideoConnectorTransport, a new transport integration for real-time Vonage WebRTC sessions using the Vonage Video Connector library.
    (PR #4052)

  • Added InceptionLLMService for Inception's Mercury 2 diffusion reasoning model, with support for reasoning_effort and realtime settings.
    (PR #4423)

  • Added plain WebSocket transport support to the development runner. Bots can now accept connections from non-telephony WebSocket clients (e.g., browser apps using protobuf framing) via the /ws-client endpoint alongside other transports.
    (PR #4442)

  • Added GET /status endpoint to the development runner that reports which transports the running instance accepts (all by default, or the single transport passed via -t).
    (PR #4442)

  • Added support for the Rime coda TTS model to RimeTTSService and RimeHttpTTSService. The temperature, top_p, and repetition_penalty settings are not used by coda. Also added a timeScaleFactor setting (for the arcana and coda models) to both services — values above 1.0 slow down audio playback; values below 1.0 speed it up.
    (PR #4511)

  • Added max_endpoint_delay_ms to SonioxSTTService.Settings, controlling the maximum delay (500-3000 ms) before endpoint detection finalizes a turn.
    (PR #4521)

  • Added LLMService.append_system_instruction(...): append durable text to a service's system instruction so it's included on every inference and survives context resets.
    (PR #4540)

  • Added CartesiaTurnsSTTService for streaming speech-to-text against the Cartesia Streaming ASR v2 (Ink-2) turn-based WebSocket endpoint (/stt/turns/websocket). The server drives turn boundaries via turn.start / turn.update / turn.end messages, which the service translates into UserStartedSpeakingFrame, finalized TranscriptionFrame, and UserStoppedSpeakingFrame. Eager end-of-turn predictions and turn resumes (turn.eager_end and turn.resume) are surfaced via the on_turn_eager_end and on_turn_resume event handlers.
    (PR #4552)

  • Added the STTService.supports_ttfs property, which subclasses can override to return False when TTFS doesn't apply to their architecture (e.g. turn-based STTs where the server defines turn boundaries). When False, STTMetadataFrame is broadcast with ttfs_p99_latency=0.0 and the "ttfs_p99_latency not set" warning is suppressed.
    (PR #4585)

Changed

  • ⚠️ The development runner now supports all transports (WebRTC, Daily, telephony, plain WebSocket) simultaneously from a single server. The /start endpoint accepts a "transport" field to select the transport per-request; omitting -t at startup enables all transports instead of defaulting to WebRTC. The Daily browser-redirect route moved from GET / to GET /daily.
    (PR #4442)

  • Changed the default model for RimeTTSService and RimeHttpTTSService from arcana to coda. Code that relied on the implicit default should set model="arcana" explicitly to preserve previous behavior.
    (PR #4511)

  • OpenRouter LLM service now defaults to openai/gpt-4.1.
    (PR #4513)

  • OpenRouter LLM requests now convert developer messages to user messages by default for broader model compatibility. Override this by subclassing OpenRouterLLMService or setting llm.supports_developer_role = True for models that support the developer role.
    (PR #4513)

  • SonioxSTTService now applies settings updates (e.g. via STTUpdateSettingsFrame) using a graceful reconnect instead of a hard disconnect/reconnect, preserving the service's reconnect retry behavior.
    (PR #4521)

  • Updated the default p99 TTFS latency values for Smallest AI, Mistral, and XAI STT so turn stop timing uses measured values instead of the conservative fallback.
    (PR #4522)

  • Updated the development runner startup banner to show the prebuilt client URL once and list enabled or disabled transports with install hints.
    (PR #4524)

  • Services and transports with missing optional dependencies now raise ImportError instead of a bare Exception when their module is imported without the required extra installed. The original ModuleNotFoundError is preserved as __cause__, so code that wraps these imports can now use except ImportError: cleanly instead of except Exception:.
    (PR #4525)

  • Bumped pipecat-ai-prebuilt to 1.0.1 in the runner extra, updating the prebuilt client UI served by the development runner.
    (PR #4531)

  • Replaced the transformers.WhisperFeatureExtractor dependency in LocalSmartTurnAnalyzerV3 with a vendored numpy-only implementation, reducing peak RSS at import from ~566 MB to ~60 MB and cold-start time from ~5.0 s to ~0.3 s. Behavior is numerically equivalent (matches the reference numpy code path within 1e-5 absolute tolerance; ONNX model output is bit-identical on representative inputs).

    • Smart Turn v3 no longer imports transformers at module load.
    • Prepares the ground for making transformers an optional dependency in a future release.
    • The vendored STFT is vectorized via numpy.lib.stride_tricks.sliding_window_view + batched np.fft.rfft, cutting _power_spectrogram runtime by ~55% (~4.0 ms → ~1.8 ms per call on a typical 8-second segment at 16 kHz) while preserving the same parity tolerances against the reference implementation.
      (PR #4536)
  • ⚠️ Renamed the RTVI UI Worker Protocol's vocabulary from the pipecat-subagents task/agent terms to Pipecat's native job/worker. This spans the wire messages (ui-taskui-job-group, ui-cancel-taskui-cancel-job-group), their envelope kinds and fields (task_idjob_id, agents/agent_nameworkers/worker_name), the paired Python models/frames (UITask*UIJobGroup*, RTVIUITask*FrameRTVIUIJobGroup*Frame), and the @pipecat-ai/client-js / client-react APIs (RTVIEvent.UITaskUIJobGroup, cancelUITaskcancelUIJobGroup, useUITasksuseUIJobGroups, UITasksProviderUIJobGroupsProvider). These primitives shipped in 1.2.0 but were never documented, so no real consumers are affected.
    (PR #4540)

  • transformers is no longer a base dependency, so pip install pipecat-ai no longer pulls it in. This follows Smart Turn v3 dropping its transformers import; the only remaining users (the deprecated LocalSmartTurnAnalyzerV2/CoreML analyzers and the Moondream service) already require the local-smart-turn and moondream extras, which continue to install transformers.
    (PR #4546)

  • Widened the deepgram extra to deepgram-sdk>=6.1.1,<8 so installations can resolve to either deepgram-sdk 6.x or 7.x. DeepgramSTTService now handles the agent_rest keyword argument that deepgram-sdk 7.2.0 added to DeepgramClientEnvironment, so custom base_url configuration keeps working on both 6.x and 7.x.
    (PR #4565)

  • Dropped the upper bound on the websockets-base extra (websockets>=13.1) so downstream deployments can resolve to websockets 16.x and beyond. Pipecat's websockets usage relies only on the modern websockets.asyncio API plus a handful of public symbols, all of which are retained in 16.x.
    (PR #4565)

  • Changed the default voice for GradiumTTSService to _6Aslh2DxfmnRLmP.
    (PR #4569)

  • InworldRealtimeLLMService now defaults the STT model to inworld/inworld-stt-1.
    (PR #4573)

Deprecated

  • FrameProcessor.pipeline_task is deprecated; read FrameProcessor.pipeline_worker instead. The old name still works but emits a DeprecationWarning and will be removed in a future release.
    (PR #4493)

  • Passing a worker to WorkerRunner.run() is deprecated. Register the worker with WorkerRunner.add_workers() before calling run() instead. The worker argument still works but emits a DeprecationWarning and will be removed in a future release.
    (PR #4493)

  • PipelineTask, PipelineTaskParams, and the pipecat.pipeline.task module have been renamed to PipelineWorker, WorkerParams, and pipecat.pipeline.worker. The old names still resolve (the module re-exports the new symbols) but constructing PipelineTask / PipelineTaskParams emits a DeprecationWarning; they will be removed in a future release.
    (PR #4493)

  • PipelineRunner has been renamed to WorkerRunner and moved to pipecat.workers.runner, since the runner now runs workers (of which PipelineWorker is one kind), not just pipelines. Import WorkerRunner from pipecat.workers.runner. The old pipecat.pipeline.runner module still re-exports both names, and PipelineRunner still works as a subclass alias, but it emits a DeprecationWarning and will be removed in a future release.
    (PR #4589)

Removed

  • Removed the unsupported Georgian (Language.KA) language mapping from SonioxSTTService.
    (PR #4521)

Fixed

  • Fixed Azure TTS last word being missed by observers and RTVI UI. The completion signal was racing with word timestamp processing, causing the final word's TTSTextFrame to arrive after TTSStoppedFrame. Completion is now routed through the word boundary queue to ensure all words are processed before signaling stream end.
    (PR #4306)

  • Fixed skipped TTS frames (e.g. code blocks filtered via skip_aggregator_types) being emitted to the assistant context immediately instead of waiting for preceding spoken frames to finish. They now hold their position in the frame sequence and are flushed only after all earlier spoken sentences are complete, keeping context ordering correct.
    (PR #4380)

  • Fixed Cartesia word timestamps leaking SSML tag text (e.g. <spell>, <emotion>, <break>) into word entries. Tags are now stripped before processing, so word-to-text attribution remains accurate when SSML markup is present in the TTS input.
    (PR #4380)

  • Fixed BaseOutputTransport reordering frames that share the same presentation timestamp. Frames with equal PTS values are now emitted in insertion order, preventing subtle audio/text sequencing bugs when multiple frames arrive at the same time.
    (PR #4380)

  • Fixed TTSTextFrame entries losing their original text structure when word timestamps are enabled. Each TTSTextFrame now carries a raw_text field containing the corresponding span of the original LLM-produced text (including pattern delimiters such as <card>4111 1111 1111 1111</card>), so the assistant context receives properly-tagged content rather than the cleaned words returned by the TTS provider. Also handles words that straddle two sentence boundaries by splitting them and attributing each part to its correct source frame.
    (PR #4380)

  • Fixed PipelineTask.cancel() hanging when cancellation is requested before the initial StartFrame reaches the pipeline sink.
    (PR #4455)

  • Fixed SmallWebRTCClient.read_audio_frame and read_video_frame busy-looping on MediaStreamError. When a track raises MediaStreamError, the generator now clears the track reference (_audio_input_track / _video_input_track / _screen_video_track) so the loop parks on the existing is None gate instead of re-entering recv() at ~100 Hz on a permanently-failed track. Renegotiation still resumes seamlessly: when _handle_client_connected reassigns a fresh track, the loop picks up frames from the new track.
    (PR #4491)

  • Fixed ElevenLabsSTTService crashing when language was passed as None. When language is not set, the service now lets ElevenLabs auto-detect the audio language.
    (PR #4507)

  • Fixed NvidiaSTTService so unexpected gRPC stream drops reconnect cleanly using the active audio iterator, while service shutdown and cancellation still close that iterator and stop the streaming worker without leaving it stuck waiting for more audio.
    (PR #4512)

  • Fixed websocket STT connection setup failures so services clear stale websocket state and emit non-fatal error frames, allowing ServiceSwitcher failover to keep agents running.
    (PR #4514)

  • Fixed ElevenLabsTTSService and ElevenLabsHttpTTSService inserting unwanted spaces between words when synthesizing Chinese or Japanese. Word timestamps for these languages already include their own spacing, so they are now forwarded with includes_inter_frame_spaces=True to avoid double-spacing in transcripts and context.
    (PR #4517)

  • Fixed the development runner so missing optional transport dependencies disable only their related routes instead of failing startup in all-transport mode.
    (PR #4524)

  • Fixed a race in ElevenLabsTTSService where the periodic keepalive could be sent for a new turn's context before that context's voice_settings initialization message, causing ElevenLabs to close the WebSocket with a 1008 policy violation (voice_settings field must be provided in the first message ...). The keepalive now only targets a context once its context-init has been sent.
    (PR #4527)

  • Switched BaseSmartTurn from time.time() to time.monotonic() for its three internal interval-math sites (audio-buffer timestamps, speech-start tracking, and the pre-speech buffer-trim loop). Wall-clock time can step forward or backward when NTP adjusts the system clock, which would silently corrupt the buffer trim (prune everything / prune nothing) and the speech-window extraction. The corrected primitive is monotonic and matches the existing time.perf_counter() usage already in place for inference-latency metrics.
    (PR #4542)

  • Fixed SOXRAudioResampler and SOXRStreamAudioResampler ignoring the configured quality setting. Both resamplers were hardcoded to VHQ, which meant RNNoiseFilter's resampler_quality argument (defaulting to QQ for low-latency real-time use) had no effect. The resamplers now honor the configured quality, with VHQ retained as the default.
    (PR #4551)

  • Fixed GeminiLiveLLMService (and GeminiVertexLiveLLMService) crashing with 'ContextWindowCompressionParams' object has no attribute 'get' when context_window_compression was supplied through the settings API (e.g. settings=GeminiLiveLLMService.Settings(context_window_compression=ContextWindowCompressionParams(...))). The setting is now handled whether it arrives as a ContextWindowCompressionParams object or as a dict.
    (PR #4563)

  • Fixed AudioBufferProcessor concatenating utterances separated by a silent gap. When no user audio arrives for more than 200 ms, silence proportional to the wall-clock gap is now inserted into the user buffer; the same fix is applied symmetrically to the bot buffer, so two bot utterances spoken seconds apart (e.g. progressive hold messages played while a slow function call runs) remain temporally separated in the recorded audio.
    (PR #4567)

  • InworldRealtimeLLMService no longer logs WARNINGs for unrecognized realtime server events (e.g. response.output_text.done); they are now logged at DEBUG.
    (PR #4573)

  • Fixed a spurious ttfs_p99_latency not set, using default 1.0s warning emitted by turn-based STT services (CartesiaTurnsSTTService, DeepgramFluxSTTService) at pipeline start. These services have no meaningful "speech end → final transcript" interval to measure, because the server defines turn boundaries directly.
    (PR #4585)

Performance

  • BaseSmartTurn now stores raw int16 PCM views in its audio buffer and defers the float32 conversion to the once-per-turn segment extraction, eliminating ~50 per-frame numpy allocations per second per analyzer. Output is bit-identical to the previous per-frame conversion path because int16 → float32 / 32768.0 distributes over concatenation; subclasses (LocalSmartTurnAnalyzerV3, LocalCoreMLSmartTurnAnalyzer, HttpSmartTurnAnalyzer) all receive the same float32 audio_array they did before. Also removes a spurious np.frombuffer(audio_int16, dtype=np.int16) re-wrap that was a no-op view-of-a-view of already-int16 data.
    (PR #4542)

  • Reduced the soxr resampling quality preset in LocalSmartTurnAnalyzerV3 from VHQ (~26-tap polyphase) to HQ (~16-tap), cutting resample CPU time by 30–50% on non-16 kHz audio sources (~3–10 ms saved per turn at 24/48 kHz). Pipelines already delivering 16 kHz audio are unaffected — the existing actual_rate == _MODEL_SAMPLE_RATE fast path skips resampling entirely. The two quality presets differ in filter length, not cutoff or interpolation semantics; on a Whisper-style log-mel feature representation the audible difference sits well below the mel filterbank's quantization noise floor, so model predictions are unchanged on representative inputs.
    (PR #4542)

Don't miss a new pipecat release

NewReleases is sending notifications on new releases.