pipecat-ai/pipecat v0.0.92 on GitHub

🎃 The Haunted Edition 👻

Added

Added a new DeepgramHttpTTSService, which delivers a meaningful reduction in latency when compared to the DeepgramTTSService.
Add support for speaking_rate input parameter in GoogleHttpTTSService.
Added enable_speaker_diarization and enable_language_identification to SonioxSTTService.
Added SpeechmaticsTTSService, which uses Speechmatic's TTS API. Updated examples 07a* to use the new TTS service.
Added support for including images or audio to LLM context messages using LLMContext.create_image_message() or LLMContext.create_image_url_message() (not all LLMs support URLs) and LLMContext.create_audio_message(). For example, when creating LLMMessagesAppendFrame:
```
message = LLMContext.create_image_message(image=..., size= ...)
await self.push_frame(LLMMessagesAppendFrame(messages=[message], run_llm=True))
```
New event handlers for the DeepgramFluxSTTService: on_start_of_turn, on_turn_resumed, on_end_of_turn, on_eager_end_of_turn, on_update.
Added generation_config parameter support to CartesiaTTSService and CartesiaHttpTTSService for Cartesia Sonic-3 models. Includes a new GenerationConfig class with volume (0.5-2.0), speed (0.6-1.5), and emotion (60+ options) parameters for fine-grained speech generation control.
Expanded support for univeral LLMContext to OpenAIRealtimeLLMService. As a reminder, the context-setup pattern when using LLMContext is:
```
context = LLMContext(messages, tools)
context_aggregator = LLMContextAggregatorPair(context)
```
(Note that even though OpenAIRealtimeLLMService now supports the universal LLMContext, it is not meant to be swapped out for another LLM service at runtime with LLMSwitcher.)
Note: TranscriptionFrames and InterimTranscriptionFrames now go upstream from OpenAIRealtimeLLMService, so if you're using TranscriptProcessor, say, you'll want to adjust accordingly:
```
pipeline = Pipeline(
  [
    transport.input(),
    context_aggregator.user(),

    # BEFORE
    llm,
    transcript.user(),

    # AFTER
    transcript.user(),
    llm,

    transport.output(),
    transcript.assistant(),
    context_aggregator.assistant(),
  ]
)
```
Also worth noting: whether or not you use the new context-setup pattern with OpenAIRealtimeLLMService, some types have changed under the hood:
```
## BEFORE:

# Context aggregator type
context_aggregator: OpenAIContextAggregatorPair

# Context frame type
frame: OpenAILLMContextFrame

# Context type
context: OpenAIRealtimeLLMContext
# or
context: OpenAILLMContext

## AFTER:

# Context aggregator type
context_aggregator: LLMContextAggregatorPair

# Context frame type
frame: LLMContextFrame

# Context type
context: LLMContext
```
Also note that RealtimeMessagesUpdateFrame and RealtimeFunctionCallResultFrame have been deprecated, since they're no longer used by OpenAIRealtimeLLMService. OpenAI Realtime now works more like other LLM services in Pipecat, relying on updates to its context, pushed by context aggregators, to update its internal state. Listen for LLMContextFrames for context updates.
Finally, LLMTextFrames are no longer pushed from OpenAIRealtimeLLMService when it's configured with output_modalities=['audio']. If you need to process its output, listen for TTSTextFrames instead.
Expanded support for universal LLMContext to GeminiLiveLLMService. As a reminder, the context-setup pattern when using LLMContext is:
```
context = LLMContext(messages, tools)
context_aggregator = LLMContextAggregatorPair(context)
```
(Note that even though GeminiLiveLLMService now supports the universal LLMContext, it is not meant to be swapped out for another LLM service at runtime with LLMSwitcher.)
Worth noting: whether or not you use the new context-setup pattern with GeminiLiveLLMService, some types have changed under the hood:
```
## BEFORE:

# Context aggregator type
context_aggregator: GeminiLiveContextAggregatorPair

# Context frame type
frame: OpenAILLMContextFrame

# Context type
context: GeminiLiveLLMContext
# or
context: OpenAILLMContext

## AFTER:

# Context aggregator type
context_aggregator: LLMContextAggregatorPair

# Context frame type
frame: LLMContextFrame

# Context type
context: LLMContext
```
Also note that LLMTextFrames are no longer pushed from GeminiLiveLLMService when it's configured with modalities=GeminiModalities.AUDIO. If you need to process its output, listen for TTSTextFrames instead.

Changed

The development runner's /start endpoint now supports passing dailyRoomProperties and dailyMeetingTokenProperties in the request body when createDailyRoom is true. Properties are validated against the DailyRoomProperties and DailyMeetingTokenProperties types respectively and passed to Daily's room and token creation APIs.
UserImageRawFrame new fields append_to_context and text. The append_to_context field indicates if this image and text should be added to the LLM context (by the LLM assistant aggregator). The text field, if set, might also guide the LLM or the vision service on how to analyze the image.
UserImageRequestFrame new fiels append_to_context and text. Both fields will be used to set the same fields on the captured UserImageRawFrame.
UserImageRequestFrame don't require function call name and ID anymore.
Updated MoondreamService to process UserImageRawFrame.
VisionService expects UserImageRawFrame in order to analyze images.
DailyTransport triggers on_error event if transcription can't be started or stopped.
DailyTransport updates: start_dialout() now returns two values: session_id and error. start_recording() now returns two values: stream_id and error.
Updated daily-python to 0.21.0.
SimliVideoService now accepts api_key and face_id parameters directly, with optional params for max_session_length and max_idle_time configuration, aligning with other Pipecat service patterns.
Updated the default model to sonic-3 for CartesiaTTSService and CartesiaHttpTTSService.
FunctionFilter now has a filter_system_frames arg, which controls whether or not SystemFrames are filtered.
Upgraded aws_sdk_bedrock_runtime to v0.1.1 to resolve potential CPU issues when running AWSNovaSonicLLMService.

Deprecated

The expect_stripped_words parameter of LLMAssistantAggregatorParams is ignored when used with the newer LLMAssistantAggregator, which now handles word spacing automatically.
LLMService.request_image_frame() is deprecated, push a UserImageRequestFrame instead.
UserResponseAggregator is deprecated and will be removed in a future version.
The send_transcription_frames argument to OpenAIRealtimeLLMService is deprecated. Transcription frames are now always sent. They go upstream, to be handled by the user context aggregator. See "Added" section for details.
Types in pipecat.services.openai.realtime.context and pipecat.services.openai.realtime.frames are deprecated, as they're no longer used by OpenAIRealtimeLLMService. See "Added" section for details.
SimliVideoService simli_config parameter is deprecated. Use api_key and face_id parameters instead.

Removed

Removed enable_non_final_tokens and max_non_final_tokens_duration_ms from SonioxSTTService.
Removed the aiohttp_session arg from SarvamTTSService as it's no longer used.

Fixed

Fixed a PipelineTask issue that was causing an idle timeout for frames that were being generated but not reaching the end of the pipeline. Since the exact point when frames are discarded is unknown, we now monitor pipeline frames using an observer. If the observer detects frames are being generated, it will prevent the pipeline from being considered idle.
Fixed an issue in HumeTTSService that was only using Octave 2, which does not support the description field. Now, if a description is provided, it switches to Octave 1.
Fixed an issue where DailyTransport would timeout prematurely on join and on leave.
Fixed an issue in the runner where starting a DailyTransport room via /start didn't support using the DAILY_SAMPLE_ROOM_URL env var.
Fixed an issue in ServiceSwitcher where the STTServices would result in all STT services producing TranscriptionFrames.

Other

Updated all vision 12-series foundational examples to load images from a file.
Added 14-series video examples for different services. These new examples request an image from the user camera through a function call.