Release Notes

v2.24.0-rc1

Upgrade Notes

Deduplication in Rankers is a breaking change for users who rely on keeping duplicate documents with the same user-defined id in the ranking output. This change only affects users with custom document ids who want duplicates preserved. To keep the previous behavior, ensure that your user-defined document ids are unique across retriever outputs.
Affected Rankers: HuggingFaceTEIRanker, LostInTheMiddleRanker, MetaFieldRanker, MetaFieldGroupingRanker, SentenceTransformersDiversityRanker, SentenceTransformersSimilarityRanker, TransformersSimilarityRanker.
Deduplication behavior in MultiQueryEmbeddingRetriever and MultiQueryTextRetriever has changed and may be breaking for users who relied on deduplication based on document content rather than document id.
Documents are now considered duplicates only if they share the same id. Document ids can be user-defined or are automatically generated as a hash of the document’s attributes (e.g. content, metadata, etc.).
This change affects setups where multiple documents have identical content but different ids (for example, due to differing metadata). To preserve the previous behavior, ensure that documents with identical content are assigned the same id across retriever outputs.
Removed the deprecated deserialize_document_store_in_init_params_inplace function. This function was deprecated in Haystack 2.23.0 and is no longer used.

New Features

Pipelines now natively support connecting multiple outputs directly to a single component input without requiring an explicit Joiner component. This only works when the connected outputs and inputs are of compatible list types, such as list[Document].

This simplifies pipeline definitions when multiple components produce compatible outputs. For example, multiple outputs from a FileTypeRouter can now be connected directly to a single converter or writer, without defining an intermediate ListJoiner or DocumentJoiner.

from haystack import Pipeline
from haystack.components.converters import HTMLToDocument, TextFileToDocument
from haystack.components.routers import FileTypeRouter
from haystack.components.writers import DocumentWriter
from haystack.dataclasses import ByteStream
from haystack.document_stores.in_memory import InMemoryDocumentStore
sources = [
    ByteStream.from_string(text="Text file content", mime_type="text/plain", meta={"file_type": "txt"}),
    ByteStream.from_string(
        text="\n<html><body>Some content</body></html>\n", mime_type="text/html", meta={"file_type": "html"},
    ),
]

doc_store = InMemoryDocumentStore()
pipe = Pipeline()

pipe.add_component("router", FileTypeRouter(mime_types=["text/plain", "text/html"]))
pipe.add_component("txt_converter", TextFileToDocument())
pipe.add_component("html_converter", HTMLToDocument())
pipe.add_component("writer", DocumentWriter(doc_store))

pipe.connect("router.text/plain", "txt_converter.sources")
pipe.connect("router.text/html", "html_converter.sources")
# The DocumentWriter accepts documents from both converters without needing a DocumentJoiner
pipe.connect("txt_converter.documents", "writer.documents")
pipe.connect("html_converter.documents", "writer.documents")

result = pipe.run({"router": {"sources": sources}})
# result["writer"]["documents_written"] == 2

Pipelines now support connection and automatic conversion between ChatMessage and str types.
- When a str output is connected to a ChatMessage input, it is automatically converted to a user ChatMessage.
- When a ChatMessage output is connected to a str input, its text attribute is automatically extracted. If text is None, an informative PipelineRuntimeError is raised.
- To maintain backward compatibility, when multiple connections are available, strict type matching is prioritized over conversion.

Pipelines now support list wrapping: a component returning type T can be connected to a component expecting type list[T]. The output will be wrapped automatically.

In addition, Pipelines support automatic conversion between list[T] and T, for str and ChatMessage types only. When converting from list[T] to T, the first element of the list is used. If the list is empty, an informative PipelineRuntimeError is raised.

With other recent changes, this makes pipelines more flexible and removes the need for explicit adapter components in many cases. For example, the following pipeline automatically converts a list[ChatMessage] produced by the LLM into a str expected by the retriever, which previously required an Output Adapter component.

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import Document
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
document_store = InMemoryDocumentStore()

documents = [
    Document(content="Bob lives in Paris."),
    Document(content="Alice lives in London."),
    Document(content="Ivy lives in Melbourne."),
    Document(content="Kate lives in Brisbane."),
    Document(content="Liam lives in Adelaide."),
]

document_store.write_documents(documents)

template ="""{% message role="user" %}
Rewrite the following query to be used for keyword search.
{{ query }}
{% endmessage %}
"""

p = Pipeline()
p.add_component("prompt_builder", ChatPromptBuilder(template=template))
p.add_component("llm", OpenAIChatGenerator(model="gpt-4.1-mini"))
p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store, top_k=3))

p.connect("prompt_builder", "llm")
p.connect("llm", "retriever")

query = """Someday I'd love to visit Brisbane, but for now I just want
to know the names of the people who live there."""

results = p.run(data={"prompt_builder": {"query": query}})

Enhancement Notes

Add run_async method to SearchApiWebSearch Add run_async method to SerperDevWebSearch
Add new DocumentStore standard tests for the following operations: delete_all_documents(), update_by_filter(), delete_by_filter()
The InMemoryDocumentStore now has three new operations delete_all_documents(), update_by_filter() and delete_by_filter()

Rankers now deduplicate documents by id before ranking, preventing identical documents from being scored multiple times in hybrid retrieval setups and keeping ranking outputs more consistent.

It also means you can connect multiple retriever outputs directly to a Ranker without inserting a DocumentJoiner just to avoid duplicates. For example:

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
text_embedder = SentenceTransformersTextEmbedder(model="BAAI/bge-small-en-v1.5")

embedding_retriever = InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)
ranker = TransformersSimilarityRanker(model="BAAI/bge-reranker-base")

hybrid_retrieval = Pipeline()
hybrid_retrieval.add_component("text_embedder", text_embedder)
hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
hybrid_retrieval.add_component("ranker", ranker)

hybrid_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_retrieval.connect("embedding_retriever", "ranker")
hybrid_retrieval.connect("bm25_retriever", "ranker")

query = "apnea in infants"
result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query}, "ranker": {"query": query}}
)

Add strip_whitespaces and replace_regexes parameters to DocumentCleaner component.
The strip_whitespaces parameter removes leading and trailing whitespace from document content using Python's str.strip()method. Unlike remove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace which is useful for maintaining markdown formatting.
The replace_regexes parameter accepts a dictionary mapping regex patterns to replacement strings, allowing custom text transformations. For example, {r'\\n\\n+': '\\n'} replaces multiple consecutive newlines with a single newline. This is applied after remove_regex and provides more flexibility than simple pattern removal.
Example usage:
```
from haystack.components.preprocessors import DocumentCleaner
from haystack.dataclasses import Document

cleaner = DocumentCleaner(
    strip_whitespaces=True,
    replace_regexes={r'\n\n+': '\n'}
)

doc = Document(content="  \n\nHello World\n\n\n  ")
result = cleaner.run(documents=[doc])
# Result: "Hello World\n"
```
The MultiQueryEmbeddingRetriever and MultiQueryTextRetriever now deduplicate documents by id instead of by content, preventing identical documents from being returned multiple times.

Deprecation Notes

Deprecated PipelineTemplate, PredefinedPipeline and its options (like PredefinedPipeline.CHAT_WITH_WEBSITE). These templates will be removed in Haystack 2.25. Users should switch to using Pipeline YAML files.

Bug Fixes

Remove outdated Marqo document store links from the document store guide and docs sidebars to avoid broken integration URLs.

💙 Big thank you to everyone who contributed to this release!

@agnieszka-m, @Amanbig, @anakin87, @bilgeyucel, @Bobholamovic, @bogdankostic, @davidsbatista, @julian-risch, @kacperlukawski, @maxdswain, @OiPunk, @sjrl, @VedantMadane

deepset-ai/haystack v2.24.0-rc1 on GitHub