Release Notes
v2.24.0-rc1
Upgrade Notes
-
Deduplication in Rankers is a breaking change for users who rely on keeping duplicate documents with the same user-defined
idin the ranking output. This change only affects users with custom document ids who want duplicates preserved. To keep the previous behavior, ensure that your user-defined document ids are unique across retriever outputs.Affected Rankers:
HuggingFaceTEIRanker,LostInTheMiddleRanker,MetaFieldRanker,MetaFieldGroupingRanker,SentenceTransformersDiversityRanker,SentenceTransformersSimilarityRanker,TransformersSimilarityRanker. -
Deduplication behavior in
MultiQueryEmbeddingRetrieverandMultiQueryTextRetrieverhas changed and may be breaking for users who relied on deduplication based on document content rather than document id.Documents are now considered duplicates only if they share the same
id. Document ids can be user-defined or are automatically generated as a hash of the document’s attributes (e.g. content, metadata, etc.).This change affects setups where multiple documents have identical content but different ids (for example, due to differing metadata). To preserve the previous behavior, ensure that documents with identical content are assigned the same
idacross retriever outputs. -
Removed the deprecated
deserialize_document_store_in_init_params_inplacefunction. This function was deprecated in Haystack 2.23.0 and is no longer used.
New Features
-
Pipelines now natively support connecting multiple outputs directly to a single component input without requiring an explicit Joiner component. This only works when the connected outputs and inputs are of compatible list types, such as
list[Document].This simplifies pipeline definitions when multiple components produce compatible outputs. For example, multiple outputs from a
FileTypeRoutercan now be connected directly to a single converter or writer, without defining an intermediateListJoinerorDocumentJoiner.from haystack import Pipeline from haystack.components.converters import HTMLToDocument, TextFileToDocument from haystack.components.routers import FileTypeRouter from haystack.components.writers import DocumentWriter from haystack.dataclasses import ByteStream from haystack.document_stores.in_memory import InMemoryDocumentStore sources = [ ByteStream.from_string(text="Text file content", mime_type="text/plain", meta={"file_type": "txt"}), ByteStream.from_string( text="\n<html><body>Some content</body></html>\n", mime_type="text/html", meta={"file_type": "html"}, ), ] doc_store = InMemoryDocumentStore() pipe = Pipeline() pipe.add_component("router", FileTypeRouter(mime_types=["text/plain", "text/html"])) pipe.add_component("txt_converter", TextFileToDocument()) pipe.add_component("html_converter", HTMLToDocument()) pipe.add_component("writer", DocumentWriter(doc_store)) pipe.connect("router.text/plain", "txt_converter.sources") pipe.connect("router.text/html", "html_converter.sources") # The DocumentWriter accepts documents from both converters without needing a DocumentJoiner pipe.connect("txt_converter.documents", "writer.documents") pipe.connect("html_converter.documents", "writer.documents") result = pipe.run({"router": {"sources": sources}}) # result["writer"]["documents_written"] == 2
-
Pipelines now support connection and automatic conversion between
ChatMessageandstrtypes.- When a
stroutput is connected to aChatMessageinput, it is automatically converted to a userChatMessage.- When a
ChatMessageoutput is connected to astrinput, itstextattribute is automatically extracted. IftextisNone, an informativePipelineRuntimeErroris raised. - To maintain backward compatibility, when multiple connections are available, strict type matching is prioritized over conversion.
- When a
-
Pipelines now support list wrapping: a component returning type T can be connected to a component expecting type list[T]. The output will be wrapped automatically.
In addition, Pipelines support automatic conversion between list[T] and T, for str and ChatMessage types only. When converting from list[T] to T, the first element of the list is used. If the list is empty, an informative
PipelineRuntimeErroris raised.With other recent changes, this makes pipelines more flexible and removes the need for explicit adapter components in many cases. For example, the following pipeline automatically converts a
list[ChatMessage]produced by the LLM into astrexpected by the retriever, which previously required an Output Adapter component.from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.dataclasses import Document from haystack.components.retrievers import InMemoryBM25Retriever from haystack import Pipeline from haystack.components.builders import ChatPromptBuilder from haystack.components.generators.chat import OpenAIChatGenerator document_store = InMemoryDocumentStore() documents = [ Document(content="Bob lives in Paris."), Document(content="Alice lives in London."), Document(content="Ivy lives in Melbourne."), Document(content="Kate lives in Brisbane."), Document(content="Liam lives in Adelaide."), ] document_store.write_documents(documents) template ="""{% message role="user" %} Rewrite the following query to be used for keyword search. {{ query }} {% endmessage %} """ p = Pipeline() p.add_component("prompt_builder", ChatPromptBuilder(template=template)) p.add_component("llm", OpenAIChatGenerator(model="gpt-4.1-mini")) p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store, top_k=3)) p.connect("prompt_builder", "llm") p.connect("llm", "retriever") query = """Someday I'd love to visit Brisbane, but for now I just want to know the names of the people who live there.""" results = p.run(data={"prompt_builder": {"query": query}})
Enhancement Notes
-
Add run_async method to SearchApiWebSearch Add run_async method to SerperDevWebSearch
-
Add new DocumentStore standard tests for the following operations: delete_all_documents(), update_by_filter(), delete_by_filter()
-
The
InMemoryDocumentStorenow has three new operationsdelete_all_documents(),update_by_filter()anddelete_by_filter() -
Rankers now deduplicate documents by
idbefore ranking, preventing identical documents from being scored multiple times in hybrid retrieval setups and keeping ranking outputs more consistent.It also means you can connect multiple retriever outputs directly to a Ranker without inserting a
DocumentJoinerjust to avoid duplicates. For example:from haystack import Pipeline from haystack.components.embedders import SentenceTransformersTextEmbedder from haystack.components.rankers import TransformersSimilarityRanker from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever from haystack.document_stores.in_memory import InMemoryDocumentStore document_store = InMemoryDocumentStore() text_embedder = SentenceTransformersTextEmbedder(model="BAAI/bge-small-en-v1.5") embedding_retriever = InMemoryEmbeddingRetriever(document_store) bm25_retriever = InMemoryBM25Retriever(document_store) ranker = TransformersSimilarityRanker(model="BAAI/bge-reranker-base") hybrid_retrieval = Pipeline() hybrid_retrieval.add_component("text_embedder", text_embedder) hybrid_retrieval.add_component("embedding_retriever", embedding_retriever) hybrid_retrieval.add_component("bm25_retriever", bm25_retriever) hybrid_retrieval.add_component("ranker", ranker) hybrid_retrieval.connect("text_embedder", "embedding_retriever") hybrid_retrieval.connect("embedding_retriever", "ranker") hybrid_retrieval.connect("bm25_retriever", "ranker") query = "apnea in infants" result = hybrid_retrieval.run( {"text_embedder": {"text": query}, "bm25_retriever": {"query": query}, "ranker": {"query": query}} )
-
Add
strip_whitespacesandreplace_regexesparameters to DocumentCleaner component.The
strip_whitespacesparameter removes leading and trailing whitespace from document content using Python'sstr.strip()method. Unlikeremove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace which is useful for maintaining markdown formatting.The
replace_regexesparameter accepts a dictionary mapping regex patterns to replacement strings, allowing custom text transformations. For example,{r'\\n\\n+': '\\n'}replaces multiple consecutive newlines with a single newline. This is applied afterremove_regexand provides more flexibility than simple pattern removal.Example usage:
from haystack.components.preprocessors import DocumentCleaner from haystack.dataclasses import Document cleaner = DocumentCleaner( strip_whitespaces=True, replace_regexes={r'\n\n+': '\n'} ) doc = Document(content=" \n\nHello World\n\n\n ") result = cleaner.run(documents=[doc]) # Result: "Hello World\n"
-
The
MultiQueryEmbeddingRetrieverandMultiQueryTextRetrievernow deduplicate documents byidinstead of bycontent, preventing identical documents from being returned multiple times.
Deprecation Notes
- Deprecated
PipelineTemplate,PredefinedPipelineand its options (likePredefinedPipeline.CHAT_WITH_WEBSITE). These templates will be removed in Haystack 2.25. Users should switch to using Pipeline YAML files.
Bug Fixes
- Remove outdated Marqo document store links from the document store guide and docs sidebars to avoid broken integration URLs.
💙 Big thank you to everyone who contributed to this release!
@agnieszka-m, @Amanbig, @anakin87, @bilgeyucel, @Bobholamovic, @bogdankostic, @davidsbatista, @julian-risch, @kacperlukawski, @maxdswain, @OiPunk, @sjrl, @VedantMadane