github deepset-ai/haystack v2.11.0-rc2

pre-release22 hours ago

Release Notes

v2.12.0-rc0

Highlights

Thanks to lazy imports, Haystack's import performance has significantly improved. Importing individual components now uses 50% less CPU time on average. The overall import performance has also improved: for example, import haystack now uses between 2-5% of its previous CPU time.

Upgrade Notes

  • The ExtractedTableAnswer dataclass and the dataframe field in the Document dataclass, deprecated in Haystack 2.10.0, have now been removed. pandas is no longer a required dependency for Haystack, making the installation lighter. If a component you use requires pandas, an informative error will be raised, prompting you to install it. For details and motivation, see the GitHub discussion: #8688.

  • Starting from Haystack 2.11.0 Python 3.8 is no longer supported. Python 3.8 reached its end of life on October 2024.

  • The AzureOCRDocumentConverter no longer produces Document objects with the deprecated dataframe field.

    Am I affected?

    • If your workflow relies on the dataframe field in Document objects generated by AzureOCRDocumentConverter, you are affected.
    • If you saw a DeprecationWarning in Haystack 2.10 when initializing a Document with a dataframe, this change will now remove that field entirely.

    How to handle the change:

    • Instead of storing detected tables as a dataframe, AzureOCRDocumentConverter now represents tables as CSV-formatted text in the content field of the Document.
    • Update your processing logic to handle CSV-formatted tables instead of a dataframe. If needed, you can convert the CSV text back into a dataframe using pandas.read_csv().

New Features

  • Add a new MSGToDocument component to convert .msg files into Haystack Document objects.
    • Extracts email metadata (e.g. sender, recipients, CC, BCC, subject) and body content into a Document.
    • Converts attachments into ByteStream objects which can be passed onto a FileTypeRouter + relevant converters.
  • We've introduced a new type_validation parameter to control type compatibility checks in pipeline connections. It can be set to True (default) or False which means no type checks will be done and everything is allowed.
  • Add run_async method to HuggingFaceAPIChatGenerator. This method relies internally on the AsyncInferenceClient from huggingface to generate chat completions and supports the same parameters as the run method. It returns a coroutine that can be awaited.
  • Add run_async method to OpenAIChatGenerator. This method internally uses the async version of the OpenAI client to generate chat completions and supports the same parameters as the run method. It returns a coroutine that can be awaited.
  • The InMemoryDocumentStore and the associated InMemoryBM25Retriever and InMemoryEmbeddingRetriever retrievers now support async mode.
  • Add run_async method to DocumentWriter. This method supports the same parameters as the run method and relies on the DocumentStore to implement write_documents_async. It returns a coroutine that can be awaited.
  • Add run_async method to AzureOpenAIChatGenerator. This method uses AsyncAzureOpenAI to generate chat completions and supports the same parameters as the run method. It returns a coroutine that can be awaited.
  • Sentence Transformers components now support ONNX and OpenVINO backends through the "backend" parameter. Supported backends are torch (default), onnx, and openvino. Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) for more information.

Enhancement Notes

  • Improved AzureDocumentEmbedder to handle embedding generation failures gracefully. Errors are logged, and processing continues with the remaining batches.
  • In the FileTypeRouter add explicit support for classifying .msg files with mimetype "application/vnd.ms-outlook" since the mimetypes module returns None for .msg files by default.
  • Added the store_full_path init variable to XLSXToDocument to allow users to toggle whether to store the full path of the source file in the meta of the Document. This is set to False by default to increase privacy.
  • Increased default timeout for Mermaid server to 30 seconds. Mermaid server is used to draw Pipelines. Exposed the timeout as a parameter for the Pipeline.show and Pipeline.draw methods. This allows users to customize the timeout as needed.
  • Optimize import times through extensive use of lazy imports across packages. Importing one component of a certain package, no longer leads to importing all components of the same package. For example, importing OpenAIChatGenerator no longer imports AzureOpenAIChatGenerator.
  • Haystack now officially supports Python 3.13. Some components and integrations may not yet be compatible. Specifically, the NamedEntityExtractor does not work with Python 3.13 when using the spacy backend. Additionally, you may encounter issues installing openai-whisper, which is required by the LocalWhisperTranscriber component, if you use uv or poetry for installation. In this case, we recommend using pip for installation.
  • EvaluationRunResult can now output the results in JSON, a pandas Dataframe or in a CSV file.
  • Update ListJoiner to only optionally need list_type to be passed. By default it uses type List which acts like List[Any].
    • This allows the ListJoiner to combine any incoming lists into a single flattened list.
    • Users can still pass list_type if they would like to have stricter type validation in their pipelines.

Deprecation Notes

  • The use of pandas Dataframe in EvaluationRunResult is now optional and the methods score_report, to_pandas and comparative_individual_scores_report are deprecated and will be removed in the next haystack release.

Bug Fixes

  • In the ChatMessage.to_openai_dict_format utility method, include the name field in the returned dictionary, if present. Previously, the name field was erroneously skipped.
  • Pipelines with components that return plain pandas dataframes failed. The comparison of socket values is now 'is not' instead of '!=' to avoid errors with dataframes.
  • Make sure that OpenAIChatGenerator sets additionalProperties: False in the tool schema when tool_strict is set to True.
  • Fix a bug where the output_type of a ConditionalRouter was not being serialized correctly. This would cause the router to work incorrectly after being serialized and deserialized.
  • Fixed accumulation of a tools arguments when streaming with an OpenAIChatGenerator
  • Added a fix to the pipeline's component scheduling alogrithm to reduce edge cases where the execution order of components that are simultaneously waiting for inputs has an impact on a pipeline's output. We look at topolgical order first to see which of the waiting components should run first and fall back to lexicographical order when both components are on the same topology-level. In cyclic pipelines, if the waiting components are in the same cycle, we fall back to lexicographical order immediately.
  • Fixes serialization of typing.Any when using serialize_type utility
  • Fixes an edge case in the pipeline-run logic where an existing input could be overwritten if the same component connects to the socket from multiple output sockets.
  • ComponentTool does not truncate 'description' anymore.
  • Updates import paths for type hints to get ddtrace 3.0.0 working with our datadog tracer

v2.11.0-rc0

Highlights

We are introducing the `AsyncPipeline`: Supports running pipelines asynchronously. Schedules components concurrently whenever possible. Leads to major speed improvements for any pipelines that may run workloads in parallel.

Major refactoring of Pipeline.run() to fix multiple bugs. We moved from a mostly graph-based to a dynamic dataflow driven execution logic. While most pipelines should remain unaffected, we recommend carefully checking your pipeline executions to ensure their output hasn't changed.

Upgrade Notes

  • The DOCXToDocument converter now returns a Document object with DOCX metadata stored in the meta field as a dictionary under the key docx. Previously, the metadata was represented as a DOCXMetadata dataclass. This change does not impact reading from or writing to a Document Store.
  • Removed the deprecated NLTKDocumentSplitter, it's functionalities are now supported by the DocumentSplitter.
  • The deprecated FUNCTION role has been removed from the ChatRole enum. Use TOOL instead. The deprecated class method ChatMessage.from_function has been removed. Use ChatMessage.from_tool instead.

New Features

  • Added a new component ListJoiner which joins lists of values from different components to a single list.

  • Introduced the OpenAPIConnector component, enabling direct invocation of REST endpoints as specified in an OpenAPI specification. This component is designed for direct REST endpoint invocation without LLM-generated payloads, users needs to pass the run parameters explicitly.

    Example: `python from haystack.utils import Secret from haystack.components.connectors.openapi import OpenAPIConnector connector = OpenAPIConnector( openapi_spec="https://bit.ly/serperdev_openapi", credentials=Secret.from_env_var("SERPERDEV_API_KEY"), ) response = connector.run( operation_id="search", parameters={"q": "Who was Nikola Tesla?"} )`

  • Adding a new component LLMMetadaExtractor which can be used in an indexing pipeline to extract metadata from documents based on a user given prompt, and return the documents with the metadata field with the output of the LLM.

  • Add support for Tools in the Azure OpenAI Chat Generator.

  • Introduced CSVDocumentCleaner component for cleaning CSV documents.

    • Removes empty rows and columns, while preserving specified ignored rows and columns.
    • Customizable number of rows and columns to ignore during processing.
  • Introducing CSVDocumentSplitter: The CSVDocumentSplitter splits CSV documents into structured sub-tables by recursively splitting by empty rows and columns larger than a specified threshold. This is particularly useful when converting Excel files which can often have multiple tables within one sheet.

  • Drawing pipelines, i.e.: calls to draw() or show(), can now be done using a custom Mermaid server and additional parameters. This allows for more flexibility in how pipelines are rendered. See Mermaid.ink's [documentation](https://github.com/jihchi/mermaid.ink) for more information on how to set up a custom server.

  • Added a new AsyncPipeline implementation that allows pipelines to be executed from async code, supporting concurrent scheduling of pipeline components for faster processing.

  • Adds tooling support to HuggingFaceLocalChatGenerator

Enhancement Notes

  • Enhanced SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder to accept an additional parameter, which is passed directly to the underlying SentenceTransformer.encode method for greater flexibility in embedding customization.
  • Added completion_start_time metadata to track time-to-first-token (TTFT) in streaming responses from Hugging Face API and OpenAI (Azure).
  • Enhancements to Date Filtering in MetadataRouter
    • Improved date parsing in filter utilities by introducing _parse_date, which first attempts datetime.fromisoformat(value) for backward compatibility and then falls back to dateutil.parser.parse() for broader ISO 8601 support.
    • Resolved a common issue where comparing naive and timezone-aware datetimes resulted in TypeError. Added _ensure_both_dates_naive_or_aware, which ensures both datetimes are either naive or aware. If one is missing a timezone, it is assigned the timezone of the other for consistency.
  • When Pipeline.from_dict receives an invalid type (e.g. empty string), an informative PipelineError is now raised.
  • Add jsonschema library as a core dependency. It is used in Tool and JsonSchemaValidator.
  • Streaming callback run param support for HF chat generators.
  • For the CSVDocumentCleaner, added remove_empty_rows & remove_empty_columns to optionally remove rows and columns. Also added keep_id to optionally allow for keeping the original document ID.
  • Enhanced OpenAPIServiceConnector to support and be compatible with the new ChatMessage format.
  • Updated Document's meta data after initializing the Document in DocumentSplitter as requested in issue #8741

Deprecation Notes

  • The ExtractedTableAnswer dataclass and the dataframe field in the Document dataclass are deprecated and will be removed in Haystack 2.11.0. Check out the GitHub discussion for motivation and details: #8688

Bug Fixes

  • Fixes a bug that causes pyright type checker to fail for all component objects.
  • Haystack pipelines with Mermaid graphs are now compressed to reduce the size of the encoded base64 and avoid HTTP 400 errors when the graph is too large.
  • The DOCXToDocument component now skips comment blocks in DOCX files that previously caused errors.
  • Callable deserialization now works for all fully qualified import paths.
  • Fix error messages for Document Classifier components, that suggested using nonexistent components for text classification.
  • Fixed JSONConverter to properly skip converting JSON files that are not utf-8 encoded.
    • acyclic pipelines with multiple lazy variadic components not running all components
    • cyclic pipelines not passing intermediate outputs to components outside the cycle
    • cyclic pipelines with two or more optional or greedy variadic edges showing unexpected execution behavior
    • cyclic pipelines with two cycles sharing an edge raising errors
  • Updated PDFMinerToDocument convert function to to double new lines between container_text so that passages can later by DocumentSplitter.
  • In the Hugging Face API embedders, the InferenceClient.feature_extraction method is now used instead of InferenceClient.post to compute embeddings. This ensures a more robust and future-proof implementation.
  • Improved OpenAIChatGenerator streaming response tool call processing: The logic now scans all chunks to correctly identify the first chunk with tool calls, ensuring accurate payload construction and preventing errors when tool call data isn’t confined to the initial chunk.

Don't miss a new haystack release

NewReleases is sending notifications on new releases.