⭐️ Highlights
🔌 Pipelines got simpler
With the updated logic, Pipelines can now:
- connect multiple
list[T]component outputs directly to a singlelist[T]input of the next component, simplifying pipeline definitions when multiple components produce compatible outputs. E.g., you can directly connect multiple converters to a writer component without aDocumentJoinercomponent in ingestion pipelines. - support the automatic conversion between
ChatMessageandstrtypes, enabling simpler connections between various components. E.g., you can easily connect an Agent component (which returnsChatMessageaslast_message) to a text embedder component (which expects astras query) without anOutputAdaptercomponent. - automatically convert
list[ChatMessage]toChatMessageandlist[str]tostrby taking the first element of the list. Another supported conversion islist[ChatMessage]tostr, enabling the connection between a chat generator (which returnslist[ChatMessage]asmessages) and a BM25 retriever (which expects astras query). - perform list wrapping: a component returning type
Tcan be connected to a component expecting typelist[T].
Together, these changes eliminate the need for OutputAdapter and many joiners (ListJoiner, DocumentJoiner) in common setups such as query rewriting and hybrid search.
🏁 Rankers handles duplicate documents (more consistent hybrid retrieval)
Rankers now deduplicate documents by id before ranking, preventing the same document from being scored multiple times in hybrid retrieval setups and eliminating the need for a DocumentJoiner after recent pipeline updates.
⚠️ Breaking change: MultiQueryEmbeddingRetriever and MultiQueryTextRetriever now also deduplicate by id (not by content). Documents are only considered duplicates if they share the same id.
⬆️ Upgrade Notes
-
Deduplication in Rankers is a breaking change for users who rely on keeping duplicate documents with the same user-defined
idin the ranking output. This change only affects users with custom document ids who want duplicates preserved. To keep the previous behavior, ensure that your user-defined document ids are unique across retriever outputs.Affected Rankers:
HuggingFaceTEIRanker,LostInTheMiddleRanker,MetaFieldRanker,MetaFieldGroupingRanker,SentenceTransformersDiversityRanker,SentenceTransformersSimilarityRanker,TransformersSimilarityRanker. -
Deduplication behavior in
MultiQueryEmbeddingRetrieverandMultiQueryTextRetrieverhas changed and may be breaking for users who relied on deduplication based on document content rather than document id.Documents are now considered duplicates only if they share the same
id. Document ids can be user-defined or are automatically generated as a hash of the document’s attributes (e.g. content, metadata, etc.).This change affects setups where multiple documents have identical content but different ids (for example, due to differing metadata). To preserve the previous behavior, ensure that documents with identical content are assigned the same
idacross retriever outputs. -
Removed the deprecated
deserialize_document_store_in_init_params_inplacefunction. This function was deprecated in Haystack 2.23.0 and is no longer used.
🚀 New Features
-
Pipelines now natively support connecting multiple outputs directly to a single component input without requiring an explicit Joiner component. This only works when the connected outputs and inputs are of compatible list types, such as
list[Document].This simplifies pipeline definitions when multiple components produce compatible outputs. For example, multiple outputs from a
FileTypeRoutercan now be connected directly to a single converter or writer, without defining an intermediateListJoinerorDocumentJoiner.from haystack import Pipeline from haystack.components.converters import HTMLToDocument, TextFileToDocument from haystack.components.routers import FileTypeRouter from haystack.components.writers import DocumentWriter from haystack.dataclasses import ByteStream from haystack.document_stores.in_memory import InMemoryDocumentStore sources = [ ByteStream.from_string(text="Text file content", mime_type="text/plain", meta={"file_type": "txt"}), ByteStream.from_string( text="\n<html><body>Some content</body></html>\n", mime_type="text/html", meta={"file_type": "html"}, ), ] doc_store = InMemoryDocumentStore() pipe = Pipeline() pipe.add_component("router", FileTypeRouter(mime_types=["text/plain", "text/html"])) pipe.add_component("txt_converter", TextFileToDocument()) pipe.add_component("html_converter", HTMLToDocument()) pipe.add_component("writer", DocumentWriter(doc_store)) pipe.connect("router.text/plain", "txt_converter.sources") pipe.connect("router.text/html", "html_converter.sources") # The DocumentWriter accepts documents from both converters without needing a DocumentJoiner pipe.connect("txt_converter.documents", "writer.documents") pipe.connect("html_converter.documents", "writer.documents") result = pipe.run({"router": {"sources": sources}}) # result["writer"]["documents_written"] == 2
-
Pipelines now support connection and automatic conversion between
ChatMessageandstrtypes.- When a
stroutput is connected to aChatMessageinput, it is automatically converted to a userChatMessage.- When a
ChatMessageoutput is connected to astrinput, itstextattribute is automatically extracted. IftextisNone, an informativePipelineRuntimeErroris raised. - To maintain backward compatibility, when multiple connections are available, strict type matching is prioritized over conversion.
- When a
-
Pipelines now support list wrapping: a component returning type T can be connected to a component expecting type list[T]. The output will be wrapped automatically.
In addition, Pipelines support automatic conversion between list[T] and T, for str and ChatMessage types only. When converting from list[T] to T, the first element of the list is used. If the list is empty, an informative
PipelineRuntimeErroris raised.With other recent changes, this makes pipelines more flexible and removes the need for explicit adapter components in many cases. For example, the following pipeline automatically converts a
list[ChatMessage]produced by the LLM into astrexpected by the retriever, which previously required an Output Adapter component.from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.dataclasses import Document from haystack.components.retrievers import InMemoryBM25Retriever from haystack import Pipeline from haystack.components.builders import ChatPromptBuilder from haystack.components.generators.chat import OpenAIChatGenerator document_store = InMemoryDocumentStore() documents = [ Document(content="Bob lives in Paris."), Document(content="Alice lives in London."), Document(content="Ivy lives in Melbourne."), Document(content="Kate lives in Brisbane."), Document(content="Liam lives in Adelaide."), ] document_store.write_documents(documents) template ="""{% message role="user" %} Rewrite the following query to be used for keyword search. {{ query }} {% endmessage %} """ p = Pipeline() p.add_component("prompt_builder", ChatPromptBuilder(template=template)) p.add_component("llm", OpenAIChatGenerator(model="gpt-4.1-mini")) p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store, top_k=3)) p.connect("prompt_builder", "llm") p.connect("llm", "retriever") query = """Someday I'd love to visit Brisbane, but for now I just want to know the names of the people who live there.""" results = p.run(data={"prompt_builder": {"query": query}})
-
Introduced the
MarkdownHeaderSplittercomponent:- Splits documents into chunks at Markdown headers (
#,##, etc.), preserving header hierarchy as metadata. - Supports secondary splitting (by word, passage, period, or line) for further chunking after header-based splitting using Haystack's
DocumentSplitter. - Preserves and propagates metadata such as parent headers and page numbers.
- Handles edge cases such as documents with no headers, empty content, and non-text documents.
- Splits documents into chunks at Markdown headers (
-
Added support for Chat Messages that include files using the
FileContentdataclass toOpenAIResponsesChatGeneratorandAzureOpenAIResponsesChatGenerator.Users can now pass files such as PDFs when using Haystack Chat Generators based on the Responses API.
-
User Chat Messages can now include files using the new
FileContentdataclass.Most API-based LLMs support file inputs such as PDFs (and, for some models, additional file types).
For now, this feature is implemented for
OpenAIChatGeneratorandAzureOpenAIChatGenerator, with support for more model providers coming soon.For advanced PDF handling, such as reducing image size or selecting specific page ranges, we recommend using the
PDFToImageContentcomponent instead.FileContentexample:from haystack.components.generators.chat.openai import OpenAIChatGenerator from haystack.dataclasses.chat_message import ChatMessage from haystack.dataclasses.file_content import FileContent file_content = FileContent.from_url("https://arxiv.org/pdf/2309.08632") chat_message = ChatMessage.from_user(content_parts=[file_content, "Summarize this paper in 100 words."]) llm = OpenAIChatGenerator(model="gpt-4.1-mini") response = llm.run(messages=[chat_message])
⚡️Enhancement Notes
-
Resolve postponed type annotations (from
from __future__ import annotations) when creating component input sockets, so pipelines can correctly match compatible types. This fixes cases where connectingChatPromptBuildertoFallbackChatGeneratorfailed because the generator’s annotations were interpreted as strings (for example'list[ChatMessage]'), resulting in aPipelineConnectErrordue to mismatched socket types. -
Agent components allow grouping multiple tools under a single confirmation strategy.
Here is an example of how three tools can be grouped under a BlockingConfirmationStrategy:
confirmation_strategies = { ("tool1", "tool2", "tool3"): BlockingConfirmationStrategy() }
instead of previously needing
confirmation_strategies = { "tool1": BlockingConfirmationStrategy(), "tool2": BlockingConfirmationStrategy(), "tool3": BlockingConfirmationStrategy() }
-
Add
run_asyncmethod toSearchApiWebSearchandSerperDevWebSearch -
Add new DocumentStore standard tests for the following operations:
delete_all_documents(),update_by_filter(),delete_by_filter() -
The
InMemoryDocumentStorenow has three new operationsdelete_all_documents(),update_by_filter()anddelete_by_filter() -
Rankers now deduplicate documents by
idbefore ranking, preventing identical documents from being scored multiple times in hybrid retrieval setups and keeping ranking outputs more consistent.It also means you can connect multiple retriever outputs directly to a Ranker without inserting a
DocumentJoinerjust to avoid duplicates. For example:from haystack import Pipeline from haystack.components.embedders import SentenceTransformersTextEmbedder from haystack.components.rankers import TransformersSimilarityRanker from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever from haystack.document_stores.in_memory import InMemoryDocumentStore document_store = InMemoryDocumentStore() text_embedder = SentenceTransformersTextEmbedder(model="BAAI/bge-small-en-v1.5") embedding_retriever = InMemoryEmbeddingRetriever(document_store) bm25_retriever = InMemoryBM25Retriever(document_store) ranker = TransformersSimilarityRanker(model="BAAI/bge-reranker-base") hybrid_retrieval = Pipeline() hybrid_retrieval.add_component("text_embedder", text_embedder) hybrid_retrieval.add_component("embedding_retriever", embedding_retriever) hybrid_retrieval.add_component("bm25_retriever", bm25_retriever) hybrid_retrieval.add_component("ranker", ranker) hybrid_retrieval.connect("text_embedder", "embedding_retriever") hybrid_retrieval.connect("embedding_retriever", "ranker") hybrid_retrieval.connect("bm25_retriever", "ranker") query = "apnea in infants" result = hybrid_retrieval.run( {"text_embedder": {"text": query}, "bm25_retriever": {"query": query}, "ranker": {"query": query}} )
-
Add
strip_whitespacesandreplace_regexesparameters to DocumentCleaner component.The
strip_whitespacesparameter removes leading and trailing whitespace from document content using Python'sstr.strip()method. Unlikeremove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace which is useful for maintaining markdown formatting.The
replace_regexesparameter accepts a dictionary mapping regex patterns to replacement strings, allowing custom text transformations. For example,{r'\\n\\n+': '\\n'}replaces multiple consecutive newlines with a single newline. This is applied afterremove_regexand provides more flexibility than simple pattern removal.Example usage:
from haystack.components.preprocessors import DocumentCleaner from haystack.dataclasses import Document cleaner = DocumentCleaner( strip_whitespaces=True, replace_regexes={r'\n\n+': '\n'} ) doc = Document(content=" \n\nHello World\n\n\n ") result = cleaner.run(documents=[doc]) # Result: "Hello World\n"
-
The
MultiQueryEmbeddingRetrieverandMultiQueryTextRetrievernow deduplicate documents byidinstead of bycontent, preventing identical documents from being returned multiple times.
Deprecation Notes
- Deprecated
PipelineTemplate,PredefinedPipelineand its options (likePredefinedPipeline.CHAT_WITH_WEBSITE). These templates will be removed in Haystack 2.25. Users should switch to using Pipeline YAML files.
Bug Fixes
- Remove outdated Marqo document store links from the document store guide and docs sidebars to avoid broken integration URLs.
- Ensure
PipelineandAsyncPipelinedeep-copies component inputs before execution so mutable outputs (e.g.,Documentdataclasses) shared across multiple downstream components don't get mutated by reference. This prevents side effects where one component's in-place modifications could unexpectedly affect other branches in the pipeline.
💙 Big thank you to everyone who contributed to this release!
@agnieszka-m, @Amanbig, @anakin87, @bilgeyucel, @Bobholamovic, @bogdankostic, @davidsbatista, @julian-risch, @kacperlukawski, @maxdswain, @OGuggenbuehl, @OiPunk, @sjrl, @srini047, @VedantMadane