github deepset-ai/haystack v2.24.0-rc1

pre-release16 hours ago

Release Notes

v2.24.0-rc1

Upgrade Notes

  • Deduplication in Rankers is a breaking change for users who rely on keeping duplicate documents with the same user-defined id in the ranking output. This change only affects users with custom document ids who want duplicates preserved. To keep the previous behavior, ensure that your user-defined document ids are unique across retriever outputs.

    Affected Rankers: HuggingFaceTEIRanker, LostInTheMiddleRanker, MetaFieldRanker, MetaFieldGroupingRanker, SentenceTransformersDiversityRanker, SentenceTransformersSimilarityRanker, TransformersSimilarityRanker.

  • Deduplication behavior in MultiQueryEmbeddingRetriever and MultiQueryTextRetriever has changed and may be breaking for users who relied on deduplication based on document content rather than document id.

    Documents are now considered duplicates only if they share the same id. Document ids can be user-defined or are automatically generated as a hash of the document’s attributes (e.g. content, metadata, etc.).

    This change affects setups where multiple documents have identical content but different ids (for example, due to differing metadata). To preserve the previous behavior, ensure that documents with identical content are assigned the same id across retriever outputs.

  • Removed the deprecated deserialize_document_store_in_init_params_inplace function. This function was deprecated in Haystack 2.23.0 and is no longer used.

New Features

  • Pipelines now natively support connecting multiple outputs directly to a single component input without requiring an explicit Joiner component. This only works when the connected outputs and inputs are of compatible list types, such as list[Document].

    This simplifies pipeline definitions when multiple components produce compatible outputs. For example, multiple outputs from a FileTypeRouter can now be connected directly to a single converter or writer, without defining an intermediate ListJoiner or DocumentJoiner.

    from haystack import Pipeline
    from haystack.components.converters import HTMLToDocument, TextFileToDocument
    from haystack.components.routers import FileTypeRouter
    from haystack.components.writers import DocumentWriter
    from haystack.dataclasses import ByteStream
    from haystack.document_stores.in_memory import InMemoryDocumentStore
    sources = [
        ByteStream.from_string(text="Text file content", mime_type="text/plain", meta={"file_type": "txt"}),
        ByteStream.from_string(
            text="\n<html><body>Some content</body></html>\n", mime_type="text/html", meta={"file_type": "html"},
        ),
    ]
    
    doc_store = InMemoryDocumentStore()
    pipe = Pipeline()
    
    pipe.add_component("router", FileTypeRouter(mime_types=["text/plain", "text/html"]))
    pipe.add_component("txt_converter", TextFileToDocument())
    pipe.add_component("html_converter", HTMLToDocument())
    pipe.add_component("writer", DocumentWriter(doc_store))
    
    pipe.connect("router.text/plain", "txt_converter.sources")
    pipe.connect("router.text/html", "html_converter.sources")
    # The DocumentWriter accepts documents from both converters without needing a DocumentJoiner
    pipe.connect("txt_converter.documents", "writer.documents")
    pipe.connect("html_converter.documents", "writer.documents")
    
    result = pipe.run({"router": {"sources": sources}})
    # result["writer"]["documents_written"] == 2
  • Pipelines now support connection and automatic conversion between ChatMessage and str types.

    - When a str output is connected to a ChatMessage input, it is automatically converted to a user ChatMessage.

    • When a ChatMessage output is connected to a str input, its text attribute is automatically extracted. If text is None, an informative PipelineRuntimeError is raised.
    • To maintain backward compatibility, when multiple connections are available, strict type matching is prioritized over conversion.
  • Pipelines now support list wrapping: a component returning type T can be connected to a component expecting type list[T]. The output will be wrapped automatically.

    In addition, Pipelines support automatic conversion between list[T] and T, for str and ChatMessage types only. When converting from list[T] to T, the first element of the list is used. If the list is empty, an informative PipelineRuntimeError is raised.

    With other recent changes, this makes pipelines more flexible and removes the need for explicit adapter components in many cases. For example, the following pipeline automatically converts a list[ChatMessage] produced by the LLM into a str expected by the retriever, which previously required an Output Adapter component.

    from haystack.document_stores.in_memory import InMemoryDocumentStore
    from haystack.dataclasses import Document
    from haystack.components.retrievers import InMemoryBM25Retriever
    from haystack import Pipeline
    from haystack.components.builders import ChatPromptBuilder
    from haystack.components.generators.chat import OpenAIChatGenerator
    document_store = InMemoryDocumentStore()
    
    documents = [
        Document(content="Bob lives in Paris."),
        Document(content="Alice lives in London."),
        Document(content="Ivy lives in Melbourne."),
        Document(content="Kate lives in Brisbane."),
        Document(content="Liam lives in Adelaide."),
    ]
    
    document_store.write_documents(documents)
    
    template ="""{% message role="user" %}
    Rewrite the following query to be used for keyword search.
    {{ query }}
    {% endmessage %}
    """
    
    p = Pipeline()
    p.add_component("prompt_builder", ChatPromptBuilder(template=template))
    p.add_component("llm", OpenAIChatGenerator(model="gpt-4.1-mini"))
    p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store, top_k=3))
    
    p.connect("prompt_builder", "llm")
    p.connect("llm", "retriever")
    
    query = """Someday I'd love to visit Brisbane, but for now I just want
    to know the names of the people who live there."""
    
    results = p.run(data={"prompt_builder": {"query": query}})

Enhancement Notes

  • Add run_async method to SearchApiWebSearch Add run_async method to SerperDevWebSearch

  • Add new DocumentStore standard tests for the following operations: delete_all_documents(), update_by_filter(), delete_by_filter()

  • The InMemoryDocumentStore now has three new operations delete_all_documents(), update_by_filter() and delete_by_filter()

  • Rankers now deduplicate documents by id before ranking, preventing identical documents from being scored multiple times in hybrid retrieval setups and keeping ranking outputs more consistent.

    It also means you can connect multiple retriever outputs directly to a Ranker without inserting a DocumentJoiner just to avoid duplicates. For example:

    from haystack import Pipeline
    from haystack.components.embedders import SentenceTransformersTextEmbedder
    from haystack.components.rankers import TransformersSimilarityRanker
    from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
    from haystack.document_stores.in_memory import InMemoryDocumentStore
    
    document_store = InMemoryDocumentStore()
    text_embedder = SentenceTransformersTextEmbedder(model="BAAI/bge-small-en-v1.5")
    
    embedding_retriever = InMemoryEmbeddingRetriever(document_store)
    bm25_retriever = InMemoryBM25Retriever(document_store)
    ranker = TransformersSimilarityRanker(model="BAAI/bge-reranker-base")
    
    hybrid_retrieval = Pipeline()
    hybrid_retrieval.add_component("text_embedder", text_embedder)
    hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
    hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
    hybrid_retrieval.add_component("ranker", ranker)
    
    hybrid_retrieval.connect("text_embedder", "embedding_retriever")
    hybrid_retrieval.connect("embedding_retriever", "ranker")
    hybrid_retrieval.connect("bm25_retriever", "ranker")
    
    query = "apnea in infants"
    result = hybrid_retrieval.run(
        {"text_embedder": {"text": query}, "bm25_retriever": {"query": query}, "ranker": {"query": query}}
    )
  • Add strip_whitespaces and replace_regexes parameters to DocumentCleaner component.

    The strip_whitespaces parameter removes leading and trailing whitespace from document content using Python's str.strip()method. Unlike remove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace which is useful for maintaining markdown formatting.

    The replace_regexes parameter accepts a dictionary mapping regex patterns to replacement strings, allowing custom text transformations. For example, {r'\\n\\n+': '\\n'} replaces multiple consecutive newlines with a single newline. This is applied after remove_regex and provides more flexibility than simple pattern removal.

    Example usage:

    from haystack.components.preprocessors import DocumentCleaner
    from haystack.dataclasses import Document
    
    cleaner = DocumentCleaner(
        strip_whitespaces=True,
        replace_regexes={r'\n\n+': '\n'}
    )
    
    doc = Document(content="  \n\nHello World\n\n\n  ")
    result = cleaner.run(documents=[doc])
    # Result: "Hello World\n"
  • The MultiQueryEmbeddingRetriever and MultiQueryTextRetriever now deduplicate documents by id instead of by content, preventing identical documents from being returned multiple times.

Deprecation Notes

  • Deprecated PipelineTemplate, PredefinedPipeline and its options (like PredefinedPipeline.CHAT_WITH_WEBSITE). These templates will be removed in Haystack 2.25. Users should switch to using Pipeline YAML files.

Bug Fixes

  • Remove outdated Marqo document store links from the document store guide and docs sidebars to avoid broken integration URLs.

💙 Big thank you to everyone who contributed to this release!

@agnieszka-m, @Amanbig, @anakin87, @bilgeyucel, @Bobholamovic, @bogdankostic, @davidsbatista, @julian-risch, @kacperlukawski, @maxdswain, @OiPunk, @sjrl, @VedantMadane

Don't miss a new haystack release

NewReleases is sending notifications on new releases.