Added
-
Introducing user turn strategies. User turn strategies indicate when the user turn starts or stops. In conversational agents, these are often referred to as start/stop speaking or turn-taking plans or policies.
User turn start strategies indicate when the user starts speaking (e.g. using VAD events or when a user says one or more words).
User turn stop strategies indicate when the user stops speaking (e.g. using an end-of-turn detection model or by observing incoming transcriptions).
A list of strategies can be specified for both strategies; strategies are evaluated in order until one evaluates to true.
Available user turn start strategies:
- VADUserTurnStartStrategy
- TranscriptionUserTurnStartStrategy
- MinWordsUserTurnStartStrategy
- ExternalUserTurnStartStrategyAvailable user turn stop strategies:
- TranscriptionUserTurnStopStrategy
- TurnAnalyzerUserTurnStopStrategy
- ExternalUserTurnStopStrategyThe default strategies are:
- start: [VADUserTurnStartStrategy, TranscriptionUserTurnStartStrategy]
- stop: [TranscriptionUserTurnStopStrategy]Turn strategies are configured when setting up
LLMContextAggregatorPair. For example:context_aggregator = LLMContextAggregatorPair( context, user_params=LLMUserAggregatorParams( user_turn_strategies=UserTurnStrategies( stop=[ TurnAnalyzerUserTurnStopStrategy( turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()) ) ], ) ), )
In order to use the user turn strategies you must update to the new universal
LLMContextandLLMContextAggregatorPair. (PR #3045) -
Added
RNNoiseFilterfor real-time noise suppression using RNNoise neural network via pyrnnoise library. (PR #3205) -
Added
GrokRealtimeLLMServicefor xAI's Grok Voice Agent API with real-time voice conversations:- Support for real-time audio streaming with WebSocket connection
- Built-in server-side VAD (Voice Activity Detection)
- Multiple voice options: Ara, Rex, Sal, Eve, Leo
- Built-in tools support: web_search, x_search, file_search
- Custom function calling with standard Pipecat tools schema
- Configurable audio formats (PCM at 8kHz-48kHz)
(PR #3267)
-
Added an approximation of TTFB for Ultravox.
(PR #3268) -
Added a new
AudioContextTTSServiceto the TTS service base classes. TheAudioContextWordTTSServicenow inherits fromAudioContextTTSServiceandWebsocketWordTTSService. (PR #3289) -
LLMUserAggregatornow exposes the following events:on_user_turn_started: triggered when a user turn startson_user_turn_stopped: triggered when a user turn endson_user_turn_stop_timeout: triggered when a user turn does not stop and times out
(PR #3291)
-
Introducing user mute strategies. User mute strategies indicate when user input should be muted based on the current system state.
In conversational agents, user mute strategies are used to prevent user input from interrupting bot speech, tool execution, or other critical system operations.
A list of strategies can be specified; all strategies are evaluated for every frame so that each strategy can maintain its internal state. A user frame is muted if any of the configured strategies indicates it should be muted.
Available user mute strategies:
FirstSpeechUserMuteStrategyMuteUntilFirstBotCompleteUserMuteStrategyAlwaysUserMuteStrategyFunctionCallUserMuteStrategy
User mute strategies replace the legacy
STTMuteFilterand provide a more flexible and composable approach to muting user input.User mute strategies are configured when setting up the
LLMContextAggregatorPair. For example:context_aggregator = LLMContextAggregatorPair( context, user_params=LLMUserAggregatorParams( user_mute_strategies=[ FirstSpeechUserMuteStrategy(), ] ), )
In order to use user mute strategies you should update to the new universal
LLMContextandLLMContextAggregatorPair.
(PR #3292) -
Added
use_sslparameter toNvidiaSTTService,NvidiaSegmentedSTTServiceandNvidiaTTSService.
(PR #3300) -
Added
enable_interruptionsconstructor argument to all user turn strategies. This tells theLLMUserAggregatorto push or not push anInterruptionFrame.
(PR #3316) -
Added
split_sentencesparameter toSpeechmaticsSTTServiceto control sentence splitting behavior for finals on sentence boundaries.
(PR #3328) -
Added word-level timestamp support to
AzureTTSServicefor accurate text-to-audio synchronization.
(PR #3334) -
Added
pronunciation_dict_idparameter toCartesiaTTSService.InputParamsandCartesiaHttpTTSService.InputParamsto support Cartesia's pronunciation dictionary feature for custom pronunciations.
(PR #3346) -
Added support for using the HeyGen LiveAvatar API with the
HeyGenTransport(see https://www.liveavatar.com/).
(PR #3357) -
Added image support to
OpenAIRealtimeLLMServiceviaInputImageRawFrame:- New
start_video_pausedparameter to control initial video input state - New
video_frame_detailparameter to set image processing quality ("auto", "low", or "high"). This corresponds to OpenAI Realtime'simage_detailparameter. set_video_input_paused()method to pause/resume video input at runtimeset_video_frame_detail()method to adjust video frame quality dynamically- Automatic rate limiting (1 frame per second) to prevent API overload
(PR #3360)
- New
-
Added
UserTurnProcessor, a frame processor built onUserTurnControllerthat pushesUserStartedSpeakingFrameandUserStoppedSpeakingFrameframes and interruptions based on the controller's user turn strategies.
(PR #3372) -
Added
UserTurnControllerto manage user turns. It emitson_user_turn_started,on_user_turn_stopped, andon_user_turn_stop_timeoutevents, and can be integrated into processors to detect and handle user turns.LLMUserAggregatorandUserTurnProcessorare implemented using this controller.
(PR #3372) -
Added
should_interruptproperty toDeepgramFluxSTTService,DeepgramSTTService, andSpeechmaticsSTTServiceto configure whether the bot should be interrupted when the external service detects user speech.
(PR #3374) -
LLMAssistantAggregatornow exposes the following events:on_assistant_turn_started: triggered when the assistant turn startson_assistant_turn_stopped: triggered when the assistant turn endson_assistant_thought: triggered when there's an assistant thought available
(PR #3385)
-
Added
KrispVivaTurnanalyzer for end of turn detection using the Krisp VIVA SDK (requireskrisp_audio).
(PR #3391) -
Added support for setting up a pipeline task from external files. You can now register custom pipeline task setup files by setting the
PIPECAT_SETUP_FILESenvironment variable. This variable should contain a colon-separated list of Python files (e.g.export PIPECAT_SETUP_FILES="setup1.py:setup.py:..."). Each file must define a function with the following signature:async def setup_pipeline_task(task: PipelineTask): ...
(PR #3397)
-
Added a keepalive task for
InworldTTSServiceto keep the service connected in the event of no generations for longer periods of time.
(PR #3403) -
Added
enable_vadtoParamsfor use in theGladiaSTTService. When enabled,GladiaSTTServiceacts as the turn controller, emittingUserStartedSpeakingFrame,UserStoppedSpeakingFrame, and optionallyInterruptionFrame.
(PR #3404) -
Added
should_interruptproperty toGladiaSTTServiceto configure whether the bot should be interrupted when the external service detects user speech.
(PR #3404) -
Added
VonageFrameSerializerfor the Vonage Video API Audio Connector WebSocket protocol.
(PR #3410) -
Added
append_trailing_spaceparameter toTTSServiceto automatically append a trailing space to text before sending to TTS, helping prevent some services from vocalizing trailing punctuation.
(PR #3424)
Changed
-
Updated
ElevenLabsRealtimeSTTServiceto accept theinclude_language_detectionparameter to detect language.stt = ElevenLabsRealtimeSTTService( api_key=os.getenv("ELEVENLABS_API_KEY"), include_language_detection=True )
(PR #3216)
-
Updated
SpeechmaticsSTTServiceto use new Python Voice SDK with improved VAD, Smart Turn capabilities, and brings dramatic improvements to latency without any impact on accuracy. Use theturn_detection_modeparameter to control the endpointing of speech, withTurnDetectionMode.EXTERNAL(default),TurnDetectionMode.ADAPTIVE, orTurnDetectionMode.SMART_TURN.stt = SpeechmaticsSTTService( api_key=os.getenv("SPEECHMATICS_API_KEY"), params=SpeechmaticsSTTService.InputParams( language=Language.EN, turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.ADAPTIVE, speaker_active_format="<{speaker_id}>{text}</{speaker_id}>", ), )
(PR #3225)
-
daily-pythonupdated to 0.23.0.
(PR #3257) -
TranscriptionFrameandInterimTranscriptionFrameproduced byDailyTransportnow include the transport source (i.e., the originating audio track).
(PR #3257) -
Updates to Inworld TTS services:
- Improved
InworldTTSService's websocket implementation to better flush and close context to better handle long inputs. - Improved docstrings for
InworldTTSServiceandInworldHttpTTSService.
(PR #3288)
- Improved
-
Updated
DeepgramSTTServiceto push user started/stopped speaking and interruption frames whenvad_enabledis set to true. This centralizes the frames into the service, removing the need to have your application code handle Deepgram's events and push these frames.
(PR #3314) -
Added encoding validation to
DeepgramTTSServiceto prevent unsupported encodings from reaching the API. The service now raisesValueErrorat initialization with a clear error message.
(PR #3329) -
Updated
read_audio_frame&read_video_framemethods inSmallWebRTCClientto check if the track is enabled before logging a warning.
(PR #3336) -
Updated
CartesiaTTSServiceto support settinglanguage=None, resulting in Cartesia auto-detecting the language of the conversation.
(PR #3366) -
The bundled Smart Turn weights are now updated to v3.2, which has better handling of short utterances, and is more robust against background noise.
(PR #3367) -
Updated
SpeechmaticsSTTServicedependency tospeechmatics-voice[smart]>=0.2.6
(PR #3371) -
Smart Turn now takes into account
vad_start_secondswhen buffering audio, meaning that the start of the turn audio is not cut off. This improves accuracy for short utterances.- The default value of
pre_speech_msis now set to 500ms for Smart Turn.
(PR #3377)
- The default value of
-
Improved Krisp SDK management to allow
KrispVivaTurnandKrispVivaFilterto share a single SDK instance within the same process.
(PR #3391) -
Updated default model for
GroqTTSServicetocanopylabs/orpheus-v1-englishand voice ID toautumn.
(PR #3399) -
Enhanced
FastAPIWebsocketTransportwith optional protocol-level audio packetization via thefixed_audio_packet_sizeparameter to support media endpoints requiring strict framing and real-time pacing.
(PR #3410) -
DeepgramTTSServiceandRimeTTSServicenow setappend_trailing_spacetoTrueto prevent punctuation (e.g., “dot”) from being pronounced.
(PR #3424) -
Updated
GeminiLiveLLMServiceto pushLLMThoughtStartFrame,LLMThoughtTextFrame, andLLMThoughtEndFramewhen the model returns thought content.
(PR #3431)
Deprecated
-
pipecat.audio.interruptions.MinWordsInterruptionStrategyis deprecated. Usepipecat.turns.user_start.MinWordsUserTurnStartStrategywithLLMUserAggregator's newuser_turn_strategiesparameter instead.
(PR #3045) -
FrameProcessor.interruption_strategiesis deprecated, useLLMUserAggregator's newuser_turn_strategiesparameter instead.
(PR #3045) -
The
LLMUserAggregatorParamsandLLMAssistantAggregatorParamsclasses inpipecat.processors.aggregators.llm_responseare now deprecated. Use the new universalLLMContextandLLMContextAggregatorPairinstead.
(PR #3045) -
Deprecated the
emulatedfield in theUserStartedSpeakingFrameandUserStoppedSpeakingFrameframes.
(PR #3045) -
EmulateUserStartedSpeakingFrameandEmulateUserStoppedSpeakingFrameframes are deprecated.
(PR #3045) -
⚠️
TransportParams.turn_analyzeris deprecated and might result in unexpected behavior, useLLMUserAggregator's newuser_turn_strategiesparameter instead.
(PR #3045) -
For
SpeechmaticsSTTService, theend_of_utterance_modeparameter is deprecated. Use the newturn_detection_modeparameter instead, withTurnDetectionMode.EXTERNAL,TurnDetectionMode.ADAPTIVE, orTurnDetectionMode.SMART_TURN. Theenable_vadparameter is also deprecated and is inferred from theturn_detection_mode.
(PR #3225) -
OpenAILLMContextand its associated things (context aggregators, etc.) are now deprecated in favor of the universalLLMContextand its associated things.From the developer's point of view, switching to using
LLMContextmachinery will usually be a matter of going from this:context = OpenAILLMContext(messages, tools) context_aggregator = llm.create_context_aggregator(context)
To this:
context = LLMContext(messages, tools) context_aggregator = LLMContextAggregatorPair(context)(PR #3263)
-
STTMuteFilteris deprecated and will be removed in a future version. UseLLMUserAggregator's newuser_mute_strategiesinstead.
(PR #3292) -
FrameProcessor.interruptions_allowedis now deprecated, useLLMUserAggregator's new parameteruser_mute_strategiesinstead.
(PR #3297) -
PipelineParams.allow_interruptionsis now deprecated, useLLMUserAggregator's new parameteruser_turn_strategiesinstead. For example, to disable interruptions but still get user turns you can do:context_aggregator = LLMContextAggregatorPair( context, user_params=LLMUserAggregatorParams( user_turn_strategies=UserTurnStrategies( start=[TranscriptionUserTurnStartStrategy(enable_interruptions=False)], ), ), )
(PR #3297)
-
TranscriptProcessorand related data classes and frames (TranscriptionMessage,ThoughtTranscriptionMessage,TranscriptionUpdateFrame) are deprecated. UseLLMUserAggregator's andLLMAssistantAggregator's new events (on_user_turn_stoppedandon_assistant_turn_stopped) instead.
(PR #3385) -
Deprecated support for the
vad_eventsLiveOptionsinDeepgramSTTService. Instead, use a local Silero VAD for VAD events. Additionally, deprecatedshould_interruptwhich will be removed along withvad_eventssupport in a future release.
(PR #3386) -
Loading external observers from files is deprecated, use the new pipeline task setup files and
PIPECAT_SETUP_FILESenvironment variable instead.
(PR #3397)
Fixed
-
Improved error handling in
ElevenLabsRealtimeSTTService- Fixed an issue in
ElevenLabsRealtimeSTTServicecausing an infinite loop that blocks the process if the websocket disconnects due to an error
(PR #3233)
- Fixed an issue in
-
Fixed a bug in
STTMuteFilterwhere the user was not always muted during function calls, especially when there were multiple simultaneous calls.
(PR #3292) -
Fixed a
RNNoiseFilterissue that would cause a "[Errno 12] Cannot allocate memory" error when processing silence audio frames.
(PR #3322) -
Updated
SpeechmaticsSTTServicefor version0.0.99+:- Fixed
SpeechmaticsSTTServiceto listen forVADUserStoppedSpeakingFramein order to finalize transcription. - Default to
TurnDetectionMode.FIXEDfor Pipecat-controlled end of turn detection. - Only emit VAD + interruption frames if VAD is enabled within the plugin (modes other than
TurnDetectionMode.FIXEDorTurnDetectionMode.EXTERNAL).
(PR #3328)
- Fixed
-
Fixed an issue with function calling where a handler failing to invoke its result callback could leave the context stuck in IN_PROGRESS, causing LLM inference for subsequent function call results to block while waiting on the unresolved call.
(PR #3343) -
Fixed an issue with DeepgramTTSService where the model would output "Dot" instead of a period in some circumstances.
(PR #3345) -
Fixed an issue in
traced_sttwheremodel_namein OpenTelemetry appears asunknown.
(PR #3351) -
Fixed an issue in GeminiLiveLLMService where TranscriptionFrames were occasionally not pushed.
(PR #3356) -
Fixed potential memory leaks and initialization issues in
KrispVivaFilterby improving SDK lifecycle management.
(PR #3391) -
Fixed timing issue in
BaseOutputTransportwhere the bot speaking flag was set after awaiting, allowing the event loop to re-enter the method before the guard was set.
(PR #3400) -
Fixed an issue in
traced_llmwheremodel_namein OpenTelemetry appears asunknown.
(PR #3422) -
Fixed an issue in
traced_tts,traced_gemini_live, andtraced_openai_realtimewheremodel_namein OpenTelemetry appears asunknown.
(PR #3428) -
Fixed
request_image_frame(for backwards compatibility) and restored function-call–related fields inUserImageRequestFrameandUserImageRawFrame, preventing a case where adding a non-LLM message to the context could trigger duplicate LLM inferences (on image arrival and on function-call result), potentially causing an infinite inference loop.
(PR #3430) -
Fixed
LLMContext.create_audio_message()by correcting an internal helper that was incorrectly declared async while being run inasyncio.to_thread().
(PR #3435)
Other
-
Added
52-live-transcription.pyfoundational example demonstrating live transcription and translation from English to Spanish. In this example, the bot is not interruptible: as the user continues speaking, English transcriptions are queued, and the bot continuously translates and speaks each queued sentence in Spanish without being interrupted by new user speech.
(PR #3316) -
Added a new foundational example
53-concurrent-llm-evaluation.pythat shows how to useUserTurnProcessor.
(PR #3372) -
Added a new foundational example
28-user-assistant-turns.pythat shows how to use the newLLMUserAggregatorandLLMAssistantAggregatorevents to gather a conversation transcript.
(PR #3385)