github pipecat-ai/pipecat v0.0.99

8 hours ago

Added

  • Introducing user turn strategies. User turn strategies indicate when the user turn starts or stops. In conversational agents, these are often referred to as start/stop speaking or turn-taking plans or policies.

    User turn start strategies indicate when the user starts speaking (e.g. using VAD events or when a user says one or more words).

    User turn stop strategies indicate when the user stops speaking (e.g. using an end-of-turn detection model or by observing incoming transcriptions).

    A list of strategies can be specified for both strategies; strategies are evaluated in order until one evaluates to true.

    Available user turn start strategies:
    - VADUserTurnStartStrategy
    - TranscriptionUserTurnStartStrategy
    - MinWordsUserTurnStartStrategy
    - ExternalUserTurnStartStrategy

    Available user turn stop strategies:
    - TranscriptionUserTurnStopStrategy
    - TurnAnalyzerUserTurnStopStrategy
    - ExternalUserTurnStopStrategy

    The default strategies are:
    - start: [VADUserTurnStartStrategy, TranscriptionUserTurnStartStrategy]
    - stop: [TranscriptionUserTurnStopStrategy]

    Turn strategies are configured when setting up LLMContextAggregatorPair. For example:

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            user_turn_strategies=UserTurnStrategies(
                stop=[
                    TurnAnalyzerUserTurnStopStrategy(
                      turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams())
                    )
                ],
            )
        ),
    )

    In order to use the user turn strategies you must update to the new universal LLMContext and LLMContextAggregatorPair. (PR #3045)

  • Added RNNoiseFilter for real-time noise suppression using RNNoise neural network via pyrnnoise library. (PR #3205)

  • Added GrokRealtimeLLMService for xAI's Grok Voice Agent API with real-time voice conversations:

    • Support for real-time audio streaming with WebSocket connection
    • Built-in server-side VAD (Voice Activity Detection)
    • Multiple voice options: Ara, Rex, Sal, Eve, Leo
    • Built-in tools support: web_search, x_search, file_search
    • Custom function calling with standard Pipecat tools schema
    • Configurable audio formats (PCM at 8kHz-48kHz)
      (PR #3267)
  • Added an approximation of TTFB for Ultravox.
    (PR #3268)

  • Added a new AudioContextTTSService to the TTS service base classes. The AudioContextWordTTSService now inherits from AudioContextTTSService and WebsocketWordTTSService. (PR #3289)

  • LLMUserAggregator now exposes the following events:

    • on_user_turn_started: triggered when a user turn starts
    • on_user_turn_stopped: triggered when a user turn ends
    • on_user_turn_stop_timeout: triggered when a user turn does not stop and times out
      (PR #3291)
  • Introducing user mute strategies. User mute strategies indicate when user input should be muted based on the current system state.

    In conversational agents, user mute strategies are used to prevent user input from interrupting bot speech, tool execution, or other critical system operations.

    A list of strategies can be specified; all strategies are evaluated for every frame so that each strategy can maintain its internal state. A user frame is muted if any of the configured strategies indicates it should be muted.

    Available user mute strategies:

    • FirstSpeechUserMuteStrategy
    • MuteUntilFirstBotCompleteUserMuteStrategy
    • AlwaysUserMuteStrategy
    • FunctionCallUserMuteStrategy

    User mute strategies replace the legacy STTMuteFilter and provide a more flexible and composable approach to muting user input.

    User mute strategies are configured when setting up the LLMContextAggregatorPair. For example:

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            user_mute_strategies=[
                FirstSpeechUserMuteStrategy(),
            ]
        ),
    )

    In order to use user mute strategies you should update to the new universal LLMContext and LLMContextAggregatorPair.
    (PR #3292)

  • Added use_ssl parameter to NvidiaSTTService, NvidiaSegmentedSTTService and NvidiaTTSService.
    (PR #3300)

  • Added enable_interruptions constructor argument to all user turn strategies. This tells the LLMUserAggregator to push or not push an InterruptionFrame.
    (PR #3316)

  • Added split_sentences parameter to SpeechmaticsSTTService to control sentence splitting behavior for finals on sentence boundaries.
    (PR #3328)

  • Added word-level timestamp support to AzureTTSService for accurate text-to-audio synchronization.
    (PR #3334)

  • Added pronunciation_dict_id parameter to CartesiaTTSService.InputParams and CartesiaHttpTTSService.InputParams to support Cartesia's pronunciation dictionary feature for custom pronunciations.
    (PR #3346)

  • Added support for using the HeyGen LiveAvatar API with the HeyGenTransport (see https://www.liveavatar.com/).
    (PR #3357)

  • Added image support to OpenAIRealtimeLLMService via InputImageRawFrame:

    • New start_video_paused parameter to control initial video input state
    • New video_frame_detail parameter to set image processing quality ("auto", "low", or "high"). This corresponds to OpenAI Realtime's image_detail parameter.
    • set_video_input_paused() method to pause/resume video input at runtime
    • set_video_frame_detail() method to adjust video frame quality dynamically
    • Automatic rate limiting (1 frame per second) to prevent API overload
      (PR #3360)
  • Added UserTurnProcessor, a frame processor built on UserTurnController that pushes UserStartedSpeakingFrame and UserStoppedSpeakingFrame frames and interruptions based on the controller's user turn strategies.
    (PR #3372)

  • Added UserTurnController to manage user turns. It emits on_user_turn_started, on_user_turn_stopped, and on_user_turn_stop_timeout events, and can be integrated into processors to detect and handle user turns. LLMUserAggregator and UserTurnProcessor are implemented using this controller.
    (PR #3372)

  • Added should_interrupt property to DeepgramFluxSTTService, DeepgramSTTService, and SpeechmaticsSTTService to configure whether the bot should be interrupted when the external service detects user speech.
    (PR #3374)

  • LLMAssistantAggregator now exposes the following events:

    • on_assistant_turn_started: triggered when the assistant turn starts
    • on_assistant_turn_stopped: triggered when the assistant turn ends
    • on_assistant_thought: triggered when there's an assistant thought available
      (PR #3385)
  • Added KrispVivaTurn analyzer for end of turn detection using the Krisp VIVA SDK (requires krisp_audio).
    (PR #3391)

  • Added support for setting up a pipeline task from external files. You can now register custom pipeline task setup files by setting the PIPECAT_SETUP_FILES environment variable. This variable should contain a colon-separated list of Python files (e.g. export PIPECAT_SETUP_FILES="setup1.py:setup.py:..."). Each file must define a function with the following signature:

    async def setup_pipeline_task(task: PipelineTask):
        ...

    (PR #3397)

  • Added a keepalive task for InworldTTSService to keep the service connected in the event of no generations for longer periods of time.
    (PR #3403)

  • Added enable_vad to Params for use in the GladiaSTTService. When enabled, GladiaSTTService acts as the turn controller, emitting UserStartedSpeakingFrame, UserStoppedSpeakingFrame, and optionally InterruptionFrame.
    (PR #3404)

  • Added should_interrupt property to GladiaSTTService to configure whether the bot should be interrupted when the external service detects user speech.
    (PR #3404)

  • Added VonageFrameSerializer for the Vonage Video API Audio Connector WebSocket protocol.
    (PR #3410)

  • Added append_trailing_space parameter to TTSService to automatically append a trailing space to text before sending to TTS, helping prevent some services from vocalizing trailing punctuation.
    (PR #3424)

Changed

  • Updated ElevenLabsRealtimeSTTService to accept the include_language_detection parameter to detect language.

      stt = ElevenLabsRealtimeSTTService(
          api_key=os.getenv("ELEVENLABS_API_KEY"),
          include_language_detection=True
      )

    (PR #3216)

  • Updated SpeechmaticsSTTService to use new Python Voice SDK with improved VAD, Smart Turn capabilities, and brings dramatic improvements to latency without any impact on accuracy. Use the turn_detection_mode parameter to control the endpointing of speech, with TurnDetectionMode.EXTERNAL (default), TurnDetectionMode.ADAPTIVE, or TurnDetectionMode.SMART_TURN.

      stt = SpeechmaticsSTTService(
          api_key=os.getenv("SPEECHMATICS_API_KEY"),
          params=SpeechmaticsSTTService.InputParams(
              language=Language.EN,
              turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.ADAPTIVE,
              speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
          ),
      )

    (PR #3225)

  • daily-python updated to 0.23.0.
    (PR #3257)

  • TranscriptionFrame and InterimTranscriptionFrame produced by DailyTransport now include the transport source (i.e., the originating audio track).
    (PR #3257)

  • Updates to Inworld TTS services:

    • Improved InworldTTSService's websocket implementation to better flush and close context to better handle long inputs.
    • Improved docstrings for InworldTTSService and InworldHttpTTSService.
      (PR #3288)
  • Updated DeepgramSTTService to push user started/stopped speaking and interruption frames when vad_enabled is set to true. This centralizes the frames into the service, removing the need to have your application code handle Deepgram's events and push these frames.
    (PR #3314)

  • Added encoding validation to DeepgramTTSService to prevent unsupported encodings from reaching the API. The service now raises ValueError at initialization with a clear error message.
    (PR #3329)

  • Updated read_audio_frame & read_video_frame methods in SmallWebRTCClient to check if the track is enabled before logging a warning.
    (PR #3336)

  • Updated CartesiaTTSService to support setting language=None, resulting in Cartesia auto-detecting the language of the conversation.
    (PR #3366)

  • The bundled Smart Turn weights are now updated to v3.2, which has better handling of short utterances, and is more robust against background noise.
    (PR #3367)

  • Updated SpeechmaticsSTTService dependency to speechmatics-voice[smart]>=0.2.6
    (PR #3371)

  • Smart Turn now takes into account vad_start_seconds when buffering audio, meaning that the start of the turn audio is not cut off. This improves accuracy for short utterances.

    • The default value of pre_speech_ms is now set to 500ms for Smart Turn.
      (PR #3377)
  • Improved Krisp SDK management to allow KrispVivaTurn and KrispVivaFilter to share a single SDK instance within the same process.
    (PR #3391)

  • Updated default model for GroqTTSService to canopylabs/orpheus-v1-english and voice ID to autumn.
    (PR #3399)

  • Enhanced FastAPIWebsocketTransport with optional protocol-level audio packetization via the fixed_audio_packet_size parameter to support media endpoints requiring strict framing and real-time pacing.
    (PR #3410)

  • DeepgramTTSService and RimeTTSService now set append_trailing_space to True to prevent punctuation (e.g., “dot”) from being pronounced.
    (PR #3424)

  • Updated GeminiLiveLLMService to push LLMThoughtStartFrame, LLMThoughtTextFrame, and LLMThoughtEndFrame when the model returns thought content.
    (PR #3431)

Deprecated

  • pipecat.audio.interruptions.MinWordsInterruptionStrategy is deprecated. Use pipecat.turns.user_start.MinWordsUserTurnStartStrategy with LLMUserAggregator's new user_turn_strategies parameter instead.
    (PR #3045)

  • FrameProcessor.interruption_strategies is deprecated, use LLMUserAggregator's new user_turn_strategies parameter instead.
    (PR #3045)

  • The LLMUserAggregatorParams and LLMAssistantAggregatorParams classes in pipecat.processors.aggregators.llm_response are now deprecated. Use the new universal LLMContext and LLMContextAggregatorPair instead.
    (PR #3045)

  • Deprecated the emulated field in the UserStartedSpeakingFrame and UserStoppedSpeakingFrame frames.
    (PR #3045)

  • EmulateUserStartedSpeakingFrame and EmulateUserStoppedSpeakingFrame frames are deprecated.
    (PR #3045)

  • ⚠️ TransportParams.turn_analyzer is deprecated and might result in unexpected behavior, use LLMUserAggregator's new user_turn_strategies parameter instead.
    (PR #3045)

  • For SpeechmaticsSTTService, the end_of_utterance_mode parameter is deprecated. Use the new turn_detection_mode parameter instead, with TurnDetectionMode.EXTERNAL,TurnDetectionMode.ADAPTIVE, or TurnDetectionMode.SMART_TURN. The enable_vad parameter is also deprecated and is inferred from the turn_detection_mode.
    (PR #3225)

  • OpenAILLMContext and its associated things (context aggregators, etc.) are now deprecated in favor of the universal LLMContext and its associated things.

    From the developer's point of view, switching to using LLMContext machinery will usually be a matter of going from this:

    context = OpenAILLMContext(messages, tools)
    context_aggregator = llm.create_context_aggregator(context)

    To this:

    context = LLMContext(messages, tools)
    context_aggregator = LLMContextAggregatorPair(context)
    

    (PR #3263)

  • STTMuteFilter is deprecated and will be removed in a future version. Use LLMUserAggregator's new user_mute_strategies instead.
    (PR #3292)

  • FrameProcessor.interruptions_allowed is now deprecated, use LLMUserAggregator's new parameter user_mute_strategies instead.
    (PR #3297)

  • PipelineParams.allow_interruptions is now deprecated, use LLMUserAggregator's new parameter user_turn_strategies instead. For example, to disable interruptions but still get user turns you can do:

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            user_turn_strategies=UserTurnStrategies(
                start=[TranscriptionUserTurnStartStrategy(enable_interruptions=False)],
            ),
        ),
    )

    (PR #3297)

  • TranscriptProcessor and related data classes and frames (TranscriptionMessage, ThoughtTranscriptionMessage, TranscriptionUpdateFrame) are deprecated. Use LLMUserAggregator's and LLMAssistantAggregator's new events (on_user_turn_stopped and on_assistant_turn_stopped) instead.
    (PR #3385)

  • Deprecated support for the vad_events LiveOptions in DeepgramSTTService. Instead, use a local Silero VAD for VAD events. Additionally, deprecated should_interrupt which will be removed along with vad_events support in a future release.
    (PR #3386)

  • Loading external observers from files is deprecated, use the new pipeline task setup files and PIPECAT_SETUP_FILES environment variable instead.
    (PR #3397)

Fixed

  • Improved error handling in ElevenLabsRealtimeSTTService

    • Fixed an issue in ElevenLabsRealtimeSTTService causing an infinite loop that blocks the process if the websocket disconnects due to an error
      (PR #3233)
  • Fixed a bug in STTMuteFilter where the user was not always muted during function calls, especially when there were multiple simultaneous calls.
    (PR #3292)

  • Fixed a RNNoiseFilter issue that would cause a "[Errno 12] Cannot allocate memory" error when processing silence audio frames.
    (PR #3322)

  • Updated SpeechmaticsSTTService for version 0.0.99+:

    • Fixed SpeechmaticsSTTService to listen for VADUserStoppedSpeakingFrame in order to finalize transcription.
    • Default to TurnDetectionMode.FIXED for Pipecat-controlled end of turn detection.
    • Only emit VAD + interruption frames if VAD is enabled within the plugin (modes other than TurnDetectionMode.FIXED or TurnDetectionMode.EXTERNAL).
      (PR #3328)
  • Fixed an issue with function calling where a handler failing to invoke its result callback could leave the context stuck in IN_PROGRESS, causing LLM inference for subsequent function call results to block while waiting on the unresolved call.
    (PR #3343)

  • Fixed an issue with DeepgramTTSService where the model would output "Dot" instead of a period in some circumstances.
    (PR #3345)

  • Fixed an issue in traced_stt where model_name in OpenTelemetry appears as unknown.
    (PR #3351)

  • Fixed an issue in GeminiLiveLLMService where TranscriptionFrames were occasionally not pushed.
    (PR #3356)

  • Fixed potential memory leaks and initialization issues in KrispVivaFilter by improving SDK lifecycle management.
    (PR #3391)

  • Fixed timing issue in BaseOutputTransport where the bot speaking flag was set after awaiting, allowing the event loop to re-enter the method before the guard was set.
    (PR #3400)

  • Fixed an issue in traced_llm where model_name in OpenTelemetry appears as unknown.
    (PR #3422)

  • Fixed an issue in traced_tts, traced_gemini_live, and traced_openai_realtime where model_name in OpenTelemetry appears as unknown.
    (PR #3428)

  • Fixed request_image_frame (for backwards compatibility) and restored function-call–related fields in UserImageRequestFrame and UserImageRawFrame, preventing a case where adding a non-LLM message to the context could trigger duplicate LLM inferences (on image arrival and on function-call result), potentially causing an infinite inference loop.
    (PR #3430)

  • Fixed LLMContext.create_audio_message() by correcting an internal helper that was incorrectly declared async while being run in asyncio.to_thread().
    (PR #3435)

Other

  • Added 52-live-transcription.py foundational example demonstrating live transcription and translation from English to Spanish. In this example, the bot is not interruptible: as the user continues speaking, English transcriptions are queued, and the bot continuously translates and speaks each queued sentence in Spanish without being interrupted by new user speech.
    (PR #3316)

  • Added a new foundational example 53-concurrent-llm-evaluation.py that shows how to use UserTurnProcessor.
    (PR #3372)

  • Added a new foundational example 28-user-assistant-turns.py that shows how to use the new LLMUserAggregator and LLMAssistantAggregator events to gather a conversation transcript.
    (PR #3385)

Don't miss a new pipecat release

NewReleases is sending notifications on new releases.