deepset-ai/haystack v2.24.0-rc2 on GitHub

⭐️ Highlights

🔌 Pipelines got simpler

With the updated logic, Pipelines can now:

connect multiple list[T] component outputs directly to a single list[T] input of the next component, simplifying pipeline definitions when multiple components produce compatible outputs. E.g., you can directly connect multiple converters to a writer component without a DocumentJoiner component in ingestion pipelines.
support the automatic conversion between ChatMessage and str types, enabling simpler connections between various components. E.g., you can easily connect an Agent component (which returns ChatMessage as last_message) to a text embedder component (which expects a str as query) without an OutputAdapter component.
automatically convert list[ChatMessage] to ChatMessage and list[str] to str by taking the first element of the list. Another supported conversion is list[ChatMessage] to str, enabling the connection between a chat generator (which returns list[ChatMessage] as messages) and a BM25 retriever (which expects a str as query).
perform list wrapping: a component returning type T can be connected to a component expecting type list[T].

Together, these changes eliminate the need for OutputAdapter and many joiners (ListJoiner, DocumentJoiner) in common setups such as query rewriting and hybrid search.

🏁 Rankers handles duplicate documents (more consistent hybrid retrieval)

Rankers now deduplicate documents by id before ranking, preventing the same document from being scored multiple times in hybrid retrieval setups and eliminating the need for a DocumentJoiner after recent pipeline updates.

⚠️ Breaking change: MultiQueryEmbeddingRetriever and MultiQueryTextRetriever now also deduplicate by id (not by content). Documents are only considered duplicates if they share the same id.

⬆️ Upgrade Notes

Deduplication in Rankers is a breaking change for users who rely on keeping duplicate documents with the same user-defined id in the ranking output. This change only affects users with custom document ids who want duplicates preserved. To keep the previous behavior, ensure that your user-defined document ids are unique across retriever outputs.
Affected Rankers: HuggingFaceTEIRanker, LostInTheMiddleRanker, MetaFieldRanker, MetaFieldGroupingRanker, SentenceTransformersDiversityRanker, SentenceTransformersSimilarityRanker, TransformersSimilarityRanker.
Deduplication behavior in MultiQueryEmbeddingRetriever and MultiQueryTextRetriever has changed and may be breaking for users who relied on deduplication based on document content rather than document id.
Documents are now considered duplicates only if they share the same id. Document ids can be user-defined or are automatically generated as a hash of the document’s attributes (e.g. content, metadata, etc.).
This change affects setups where multiple documents have identical content but different ids (for example, due to differing metadata). To preserve the previous behavior, ensure that documents with identical content are assigned the same id across retriever outputs.
Removed the deprecated deserialize_document_store_in_init_params_inplace function. This function was deprecated in Haystack 2.23.0 and is no longer used.

🚀 New Features

Pipelines now natively support connecting multiple outputs directly to a single component input without requiring an explicit Joiner component. This only works when the connected outputs and inputs are of compatible list types, such as list[Document].

This simplifies pipeline definitions when multiple components produce compatible outputs. For example, multiple outputs from a FileTypeRouter can now be connected directly to a single converter or writer, without defining an intermediate ListJoiner or DocumentJoiner.

from haystack import Pipeline
from haystack.components.converters import HTMLToDocument, TextFileToDocument
from haystack.components.routers import FileTypeRouter
from haystack.components.writers import DocumentWriter
from haystack.dataclasses import ByteStream
from haystack.document_stores.in_memory import InMemoryDocumentStore
sources = [
    ByteStream.from_string(text="Text file content", mime_type="text/plain", meta={"file_type": "txt"}),
    ByteStream.from_string(
        text="\n<html><body>Some content</body></html>\n", mime_type="text/html", meta={"file_type": "html"},
    ),
]

doc_store = InMemoryDocumentStore()
pipe = Pipeline()

pipe.add_component("router", FileTypeRouter(mime_types=["text/plain", "text/html"]))
pipe.add_component("txt_converter", TextFileToDocument())
pipe.add_component("html_converter", HTMLToDocument())
pipe.add_component("writer", DocumentWriter(doc_store))

pipe.connect("router.text/plain", "txt_converter.sources")
pipe.connect("router.text/html", "html_converter.sources")
# The DocumentWriter accepts documents from both converters without needing a DocumentJoiner
pipe.connect("txt_converter.documents", "writer.documents")
pipe.connect("html_converter.documents", "writer.documents")

result = pipe.run({"router": {"sources": sources}})
# result["writer"]["documents_written"] == 2

Pipelines now support connection and automatic conversion between ChatMessage and str types.
- When a str output is connected to a ChatMessage input, it is automatically converted to a user ChatMessage.
- When a ChatMessage output is connected to a str input, its text attribute is automatically extracted. If text is None, an informative PipelineRuntimeError is raised.
- To maintain backward compatibility, when multiple connections are available, strict type matching is prioritized over conversion.

Pipelines now support list wrapping: a component returning type T can be connected to a component expecting type list[T]. The output will be wrapped automatically.

In addition, Pipelines support automatic conversion between list[T] and T, for str and ChatMessage types only. When converting from list[T] to T, the first element of the list is used. If the list is empty, an informative PipelineRuntimeError is raised.

With other recent changes, this makes pipelines more flexible and removes the need for explicit adapter components in many cases. For example, the following pipeline automatically converts a list[ChatMessage] produced by the LLM into a str expected by the retriever, which previously required an Output Adapter component.

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import Document
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
document_store = InMemoryDocumentStore()

documents = [
    Document(content="Bob lives in Paris."),
    Document(content="Alice lives in London."),
    Document(content="Ivy lives in Melbourne."),
    Document(content="Kate lives in Brisbane."),
    Document(content="Liam lives in Adelaide."),
]

document_store.write_documents(documents)

template ="""{% message role="user" %}
Rewrite the following query to be used for keyword search.
{{ query }}
{% endmessage %}
"""

p = Pipeline()
p.add_component("prompt_builder", ChatPromptBuilder(template=template))
p.add_component("llm", OpenAIChatGenerator(model="gpt-4.1-mini"))
p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store, top_k=3))

p.connect("prompt_builder", "llm")
p.connect("llm", "retriever")

query = """Someday I'd love to visit Brisbane, but for now I just want
to know the names of the people who live there."""

results = p.run(data={"prompt_builder": {"query": query}})

Introduced the MarkdownHeaderSplitter component:
- Splits documents into chunks at Markdown headers (#, ##, etc.), preserving header hierarchy as metadata.
- Supports secondary splitting (by word, passage, period, or line) for further chunking after header-based splitting using Haystack's DocumentSplitter.
- Preserves and propagates metadata such as parent headers and page numbers.
- Handles edge cases such as documents with no headers, empty content, and non-text documents.
Added support for Chat Messages that include files using the FileContent dataclass to OpenAIResponsesChatGenerator and AzureOpenAIResponsesChatGenerator.
Users can now pass files such as PDFs when using Haystack Chat Generators based on the Responses API.

User Chat Messages can now include files using the new FileContent dataclass.

Most API-based LLMs support file inputs such as PDFs (and, for some models, additional file types).

For now, this feature is implemented for OpenAIChatGenerator and AzureOpenAIChatGenerator, with support for more model providers coming soon.

For advanced PDF handling, such as reducing image size or selecting specific page ranges, we recommend using the PDFToImageContent component instead.

FileContent example:

from haystack.components.generators.chat.openai import OpenAIChatGenerator
from haystack.dataclasses.chat_message import ChatMessage
from haystack.dataclasses.file_content import FileContent
file_content = FileContent.from_url("https://arxiv.org/pdf/2309.08632")
chat_message = ChatMessage.from_user(content_parts=[file_content, "Summarize this paper in 100 words."])
llm = OpenAIChatGenerator(model="gpt-4.1-mini")
response = llm.run(messages=[chat_message])

⚡️Enhancement Notes

Resolve postponed type annotations (from from __future__ import annotations) when creating component input sockets, so pipelines can correctly match compatible types. This fixes cases where connecting ChatPromptBuilder to FallbackChatGenerator failed because the generator’s annotations were interpreted as strings (for example 'list[ChatMessage]'), resulting in a PipelineConnectError due to mismatched socket types.

Agent components allow grouping multiple tools under a single confirmation strategy.

Here is an example of how three tools can be grouped under a BlockingConfirmationStrategy:

confirmation_strategies = {
("tool1", "tool2", "tool3"): BlockingConfirmationStrategy()
}

instead of previously needing

confirmation_strategies = {
"tool1": BlockingConfirmationStrategy(),
"tool2": BlockingConfirmationStrategy(),
"tool3": BlockingConfirmationStrategy()
}

Add run_async method to SearchApiWebSearch and SerperDevWebSearch
Add new DocumentStore standard tests for the following operations: delete_all_documents(), update_by_filter(), delete_by_filter()
The InMemoryDocumentStore now has three new operations delete_all_documents(), update_by_filter() and delete_by_filter()

Rankers now deduplicate documents by id before ranking, preventing identical documents from being scored multiple times in hybrid retrieval setups and keeping ranking outputs more consistent.

It also means you can connect multiple retriever outputs directly to a Ranker without inserting a DocumentJoiner just to avoid duplicates. For example:

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
text_embedder = SentenceTransformersTextEmbedder(model="BAAI/bge-small-en-v1.5")

embedding_retriever = InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)
ranker = TransformersSimilarityRanker(model="BAAI/bge-reranker-base")

hybrid_retrieval = Pipeline()
hybrid_retrieval.add_component("text_embedder", text_embedder)
hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
hybrid_retrieval.add_component("ranker", ranker)

hybrid_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_retrieval.connect("embedding_retriever", "ranker")
hybrid_retrieval.connect("bm25_retriever", "ranker")

query = "apnea in infants"
result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query}, "ranker": {"query": query}}
)

Add strip_whitespaces and replace_regexes parameters to DocumentCleaner component.
The strip_whitespaces parameter removes leading and trailing whitespace from document content using Python's str.strip()method. Unlike remove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace which is useful for maintaining markdown formatting.
The replace_regexes parameter accepts a dictionary mapping regex patterns to replacement strings, allowing custom text transformations. For example, {r'\\n\\n+': '\\n'} replaces multiple consecutive newlines with a single newline. This is applied after remove_regex and provides more flexibility than simple pattern removal.
Example usage:
```
from haystack.components.preprocessors import DocumentCleaner
from haystack.dataclasses import Document

cleaner = DocumentCleaner(
    strip_whitespaces=True,
    replace_regexes={r'\n\n+': '\n'}
)

doc = Document(content="  \n\nHello World\n\n\n  ")
result = cleaner.run(documents=[doc])
# Result: "Hello World\n"
```
The MultiQueryEmbeddingRetriever and MultiQueryTextRetriever now deduplicate documents by id instead of by content, preventing identical documents from being returned multiple times.

Deprecation Notes

Deprecated PipelineTemplate, PredefinedPipeline and its options (like PredefinedPipeline.CHAT_WITH_WEBSITE). These templates will be removed in Haystack 2.25. Users should switch to using Pipeline YAML files.

Bug Fixes

Remove outdated Marqo document store links from the document store guide and docs sidebars to avoid broken integration URLs.
Ensure Pipeline and AsyncPipeline deep-copies component inputs before execution so mutable outputs (e.g., Document dataclasses) shared across multiple downstream components don't get mutated by reference. This prevents side effects where one component's in-place modifications could unexpectedly affect other branches in the pipeline.

💙 Big thank you to everyone who contributed to this release!

@agnieszka-m, @Amanbig, @anakin87, @bilgeyucel, @Bobholamovic, @bogdankostic, @davidsbatista, @julian-risch, @kacperlukawski, @maxdswain, @OGuggenbuehl, @OiPunk, @sjrl, @srini047, @VedantMadane