github pipecat-ai/pipecat v0.0.108

7 hours ago

Added

  • Added SarvamLLMService with support for sarvam-30b, sarvam-30b-16k, sarvam-105b and sarvam-105b-32k.
    (PR #3978)

  • Added on_turn_context_created(context_id) hook to TTSService. Override this to perform provider-specific setup (e.g. eagerly opening a server-side context) before text starts flowing. Called each time a new turn context ID is created.
    (PR #4013)

  • Added XAIHttpTTSService for text-to-speech using xAI's HTTP TTS API.
    (PR #4031)

  • Added support for "developer" role messages in conversation context across all LLM adapters. For non-OpenAI services (Anthropic, Google, AWS Bedrock), "developer" messages are converted to "user" messages (use system_instruction to set the system instruction). For OpenAI services, "developer" messages pass through in conversation history. For the Responses API, they are kept as "developer" role (matching the existing "system" → "developer" conversion).
    (PR #4089)

  • Added SmallestTTSService, a WebSocket-based TTS service integration with Smallest AI's Waves API. Supports the Lightning v2 and v3.1 models with configurable voice, language, speed, consistency, similarity, and enhancement settings.
    (PR #4092)

  • Added warnings in turn stop strategies when VADParams.stop_secs differs from the recommended default (0.2s) or when stop_secs >= STT p99 latency, which collapses the STT wait timeout to 0s and may cause delayed turn detection. The warnings guide developers to re-run the stt-benchmark with their VAD settings.
    (PR #4115)

  • Added domain parameter to AssemblyAISTTSettings for specialized recognition modes such as Medical Mode (domain="medical-v1").
    (PR #4117)

  • Added NovitaLLMService for using Novita AI's LLM models via their OpenAI-compatible API.
    (PR #4119)

  • Added cleanup() method to VADAnalyzer and VADController so VAD analyzer resources are properly released when no longer needed. Custom VADAnalyzer subclasses can override cleanup() to free any held resources.
    (PR #4120)

  • Added on_end_of_turn event handler to AssemblyAISTTService. This fires after the final transcript is pushed, providing a reliable hook for end-of-turn logic that doesn't race with TranscriptionFrame. Works in both Pipecat and AssemblyAI turn detection modes.
    (PR #4128)

  • Added DeepgramFluxSageMakerSTTService for running Deepgram Flux speech-to-text on AWS SageMaker endpoints. Use with ExternalUserTurnStrategies to take advantage of Flux's turn detection.
    (PR #4143)

  • Added Mem0MemoryService.get_memories() convenience method for retrieving all stored memories outside the pipeline (e.g. to build a personalized greeting at connection time). This avoids the need to manually handle client type branching, filter construction, and async wrapping.
    (PR #4156)

Changed

  • Added context prewarming path for InworldTTSService to improve first audio latency.
    (PR #4013)

  • Added KrispVivaVadAnalyzer for Voice Activity Detection using the Krisp VIVA SDK (requires krisp_audio).
    (PR #4022)

  • Modified InworldTTSService to close context at end of turn instead of relying on idle timeout. (PR #4028)

  • Added Gemini 3 support to the Gemini Live service.
    (PR #4078)

  • TTSService: the default stop_frame_timeout_s (idle time before an automatic TTSStoppedFrame is pushed when push_stop_frames=True) has changed from 2.0 to 3.0 seconds.
    (PR #4084)

  • ⚠️ GeminiLLMAdapter now only treats messages[0] as the initial system message, matching all other adapters. Previously it searched for the first "system" message anywhere in the conversation history. A "system" message appearing later in the list will now be converted to "user" instead of being extracted as the system instruction.

    (PR #4089)

  • Fixed InworldTtsService to fallback to full text when TTS timestamps are not received.
    (PR #4113)

  • ⚠️ Realtime services (Gemini Live, OpenAI Realtime, Grok Realtime, Nova Sonic) now prefer system_instruction from service settings over an initial system message in the LLM context, matching the behavior of non-realtime services. Previously, context-provided system instructions took precedence. A warning is now logged when both are set.
    (PR #4130)

  • Bumped nvidia-riva-client minimum version to >=2.25.1.
    (PR #4136)

  • Upgraded protobuf from 5.x to 6.x (>=6.31.1,<7).
    (PR #4136)

  • Unrecognized language strings (e.g. Deepgram's "multi") no longer produce a warning at startup. The log message has been downgraded to debug level since these are valid service-specific values that are passed through correctly.
    (PR #4137)

  • GrokLLMService and GrokRealtimeLLMService now live in the pipecat.services.xai module alongside XAIHttpTTSService, since all three use the same xAI API. Update imports from pipecat.services.grok.* to pipecat.services.xai.* (e.g. from pipecat.services.xai.llm import GrokLLMService).
    (PR #4142)

  • ⚠️ Bumped mem0ai dependency from ~=0.1.94 to >=1.0.8,<2. Users of the mem0 extra will need to update their mem0ai package.
    (PR #4156)

Deprecated

  • pipecat.services.grok.llm, pipecat.services.grok.realtime.llm, and
    pipecat.services.grok.realtime.events are deprecated. The old import paths
    still work but emit a DeprecationWarning; use pipecat.services.xai.llm,
    pipecat.services.xai.realtime.llm, and
    pipecat.services.xai.realtime.events instead.
    (PR #4142)

Removed

  • ⚠️ TTSService.add_word_timestamps() no longer supports the "Reset" and "TTSStoppedFrame" sentinel strings. If you have a custom TTS service that called await self.add_word_timestamps([("Reset", 0)]) or await self.add_word_timestamps([("TTSStoppedFrame", 0), ("Reset", 0)], ctx_id), replace them with await self.append_to_audio_context(ctx_id, TTSStoppedFrame(context_id=ctx_id)) and let _handle_audio_context manage the word-timestamp reset automatically.
    (PR #4145)

  • Removed SambaNovaSTTService. SambaNova no longer offers speech-to-text audio models. Use another STT provider instead.
    (PR #4154)

Fixed

  • Fixed Gemini Live (GoogleGeminiLiveLLMService) not honoring settings.system_instruction. The system instruction was being read from a deprecated constructor parameter instead of the settings object, causing it to be silently ignored.
    (PR #4089)

  • Fixed AWSBedrockLLMAdapter sending an empty message list to the API when the only message in context was a system message. The lone system message is now converted to "user" role instead of being extracted, matching the existing Anthropic adapter behavior.
    (PR #4089)

  • Fixed Gemini Live pipeline hanging indefinitely when an EndFrame was deferred while waiting for the bot to finish responding and turn_complete never arrived. As a possible root-cause fix, turn_complete messages are now handled even if they lack usage_metadata. As a fallback, the deferred EndFrame now has a 30-second safety timeout.
    (PR #4125)

  • Fixed ElevenLabs WebSocket disconnections (1008 "Maximum simultaneous contexts exceeded") caused by rapid user interruptions. When interruptions arrived before any TTS text was generated, phantom contexts were created on the ElevenLabs server that were never closed, eventually exceeding the 5-context limit.
    (PR #4126)

  • Fixed the final sentence being dropped from the conversation context when using RTVI text input with non-word-timestamp TTS services. The LLMFullResponseEndFrame was racing ahead of the last TTSTextFrame, causing the LLMAssistantAggregator to finalize the context before the final sentence arrived.
    (PR #4127)

  • Fixed audio crackling and popping in recordings when both user and bot are speaking. AudioBufferProcessor no longer injects silence into a track's buffer while that track is actively producing audio, preventing mid-utterance interruptions in the recorded output.
    (PR #4135)

  • Fixed websocket TTS word timestamps so interrupted contexts cannot leak stale words or backward PTS values into later turns.
    (PR #4145)

  • Fixed a race condition in InterruptibleTTSService where, if run_tts had been invoked but BotStartedSpeakingFrame had not yet been received, a user interruption could allow stale audio to leak through.
    (PR #4145)

  • Fixed Gemini Live local VAD mode (GeminiVADParams(disabled=True) with external VAD) not working. The bot now correctly detects user speech and signals turn boundaries to the Gemini API.
    (PR #4146)

  • Fixed Gemini Live message handling to process all server_content fields independently. Gemini 3.x can bundle multiple fields (e.g. model_turn and output_transcription) on the same message, but the previous elif chain only processed the first match, silently dropping the rest.
    (PR #4147)

  • Fixed ServiceSwitcher with ServiceSwitcherStrategyFailover incorrectly triggering failover when ErrorFrames from other pipeline stages (e.g. TTS) propagated upstream through the switcher. Previously, any non-fatal error passing through would be misattributed to the active service and trigger an unwanted service switch. Now only errors originating from the switcher's own managed services trigger failover.
    (PR #4149)

  • Fixed LiveKitOutputTransport not clearing the rtc.AudioSource internal buffer on interruption, causing the bot to continue speaking for several seconds after being interrupted.
    (PR #4151)

  • Fixed a crash in OpenAI LLM processing when the provider returns chunk.choices[0].delta.audio = None, which caused 'NoneType' object has no attribute 'get' errors during audio transcript handling.
    (PR #4152)

  • Fixed error floods in DeepgramSTTService when the WebSocket connection drops. With Deepgram SDK 6.x, send_media() raises exceptions on a dead connection instead of silently failing, causing every queued audio frame to log an error. Now send_media() failures are caught gracefully — a single warning is logged and audio frames are skipped until the existing reconnection logic restores the connection.
    (PR #4153)

  • Mem0MemoryService no longer blocks the event loop during memory storage and retrieval. All Mem0 API calls now run in a background thread, and message storage is fire-and-forget so it doesn't delay downstream processing.
    (PR #4156)

  • Fixed Mem0MemoryService failing to store messages when the context contained system or developer role messages. The Mem0 API only accepts user and assistant roles, so other roles are now filtered out before storing.
    (PR #4156)

  • Added missing on_dtmf_event callback to LemonSliceTransportClient.setup() DailyCallbacks construction, fixing a ValidationError at pipeline setup time.
    (PR #4161)

  • Fixed an issue in InworldTTSService where, in cases of fast interruption, we would continue receiving audio from the previous context.
    (PR #4167)

  • Fixed a word timestamp interleaving issue in InworldTTSService when processing multiple sentences.
    (PR #4167)

  • Fixed duplicate TTSStoppedFrame being pushed in TTS services using push_stop_frames=True. When the stop-frame timeout fired, a second TTSStoppedFrame could be pushed after the normal one at context completion.
    (PR #4172)

  • ⚠️ Fixed DeepgramSTTService compatibility with deepgram-sdk 6.1.0. The SDK now requires explicit message objects for send_keep_alive(), send_close_stream(), and send_finalize(). The minimum deepgram-sdk version is now 6.1.0.
    (PR #4174)

  • Fixed RTVI events not being delivered to clients when using WebSocket transports. ProtobufFrameSerializer now sets ignore_rtvi_messages=False by default.
    (PR #4176)

  • Fixed a timing issue where turn detection timer tasks (idle controller, speech timeout, turn analyzer, and turn completion) could miss their first tick because the newly created asyncio task was not yet scheduled when the caller continued.
    (PR #4183)

  • Fixed FastAPIWebsocketTransport intermittently hanging on shutdown when the remote side (e.g. Twilio) disconnects while audio is being sent. A race condition between the send and receive paths could cause the on_client_disconnected callback to be skipped, leaving the pipeline waiting for a disconnect signal that never came.
    (PR #4186)

Performance

  • RimeTTSService now handles Rime's done WebSocket message to complete audio contexts immediately, eliminating the 3-second idle timeout that previously added latency at the end of each utterance.
    (PR #4172)

Don't miss a new pipecat release

NewReleases is sending notifications on new releases.