deepset-ai/haystack v1.20.0 on GitHub

⭐ Highlights

🪄LostInTheMiddleRanker and DiversityRanker

We are excited to introduce two new rankers to Haystack: LostInTheMiddleRanker and DiversityRanker!

LostInTheMiddleRanker is based on the research paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al. It reorders documents according to the "Lost in the Middle" strategy, which places the most relevant paragraphs at the beginning and end of the context, while less relevant paragraphs are positioned in the middle. This ranker can be used in Retrieval-Augmented Generation (RAG) pipelines. Here is an example of how to use it:

web_retriever = WebRetriever(api_key=search_key, top_search_results=5, mode="preprocessed_documents", top_k=50)

sampler = TopPSampler(top_p=0.97)
diversity_ranker = DiversityRanker()
litm_ranker = LostInTheMiddleRanker(word_count_threshold=1024)

pipeline = Pipeline()
pipeline.add_node(component=web_retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=sampler, name="Sampler", inputs=["Retriever"])
pipeline.add_node(component=diversity_ranker, name="DiversityRanker", inputs=["Sampler"])
pipeline.add_node(component=litm_ranker, name="LostInTheMiddleRanker", inputs=["DiversityRanker"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["LostInTheMiddleRanker"])

In this example, we have positioned the LostInTheMiddleRanker as the last component before the PromptNode. This is because the LostInTheMiddleRanker is designed to be used in combination with other rankers. It is recommended to place it towards the end of the pipeline (as the last ranker), so that it can reorder the documents that have already been ranked by other rankers.

DiversityRanker is a tool that helps to increase the diversity of a set of documents. It uses sentence-transformer models to calculate semantic embeddings for each document and then ranks them in a way that ensures that each subsequent document is the least similar to the ones that have already been selected. This results in a list where each document contributes the most to the overall diversity of the selected set.

We'll reuse the same example from the LostInTheMiddleRanker to point out that the DiversityRanker can be used in combination with other rankers. It is recommended to place it in the pipeline after the similarity ranker but before the LostInTheMiddleRanker. Note that DiversityRanker is typically used in generative RAG pipelines to ensure that the generated answer is drawn from a diverse set of documents. This setup is typical for Long-Form Question Answering (LFQA) tasks. Check out Enhancing RAG Pipelines in Haystack: Introducing DiversityRanker and LostInTheMiddleRanker article on Haystack Blog for details.

📰 New release note management

We have implemented a new release note management system, reno. From now on, every contributor is responsible for adding release notes for the feature or bugfix they're introducing in Haystack in the same Pull Request containing the code changes. The goal is to encourage detailed and accurate notes for every release, especially when it comes to complex features or breaking changes.

See how to work with the new release notes in our Contribution Guide.

⬆️ Upgrade Notes

If you're a Haystack contributor, you need a new tool called reno to manage the release notes.
Please run pip install -e .[dev] to ensure you have reno available in your environment.

The Opensearch custom query syntax changes: the old filter placeholders for custom_query are no longer supported.
Replace your custom filter expressions with the new ${filters} placeholder:

Old:

  retriever = BM25Retriever(
    custom_query="""
      {
          "query": {
              "bool": {
                  "should": [{"multi_match": {
                      "query": ${query},
                      "type": "most_fields",
                      "fields": ["content", "title"]}}
                  ],
                  "filter": [
                      {"terms": {"year": ${years}}},
                      {"terms": {"quarter": ${quarters}}},
                      {"range": {"date": {"gte": ${date}}}}
                  ]
              }
          }
      }
    """
  )

  retriever.retrieve(
      query="What is the meaning of life?",
      filters={"years": [2019, 2020], "quarters": [1, 2, 3], "date": "2019-03-01"}
  )

New:

  retriever = BM25Retriever(
    custom_query="""
      {
          "query": {
              "bool": {
                  "should": [{"multi_match": {
                      "query": ${query},
                      "type": "most_fields",
                      "fields": ["content", "title"]}}
                  ],
                  "filter": ${filters}
              }
          }
      }
    """
  )

  retriever.retrieve(
      query="What is the meaning of life?",
      filters={"year": [2019, 2020], "quarter": [1, 2, 3], "date": {"$gte": "2019-03-01"}}
  )

This update impacts only those who have created custom invocation layers by subclassing PromptModelInvocationLayer.
Previously, the invoke() method in your custom layer received all prompt template parameters (like query,
documents, etc.) as keyword arguments. With this change, these parameters will no longer be passed in as keyword
arguments. If you've implemented such a custom layer, you'll need to potentially update your code to accommodate
this change.

🥳 New Features

The LostInTheMiddleRanker can be used like other rankers in Haystack. After initializing LostInTheMiddleRanker with the desired parameters, it can be used to rank/reorder a list of documents based on the "Lost in the Middle" order - the most relevant documents are located at the top and bottom of the returned list, while the least relevant documents are found in the middle. We advise that you use this ranker in combination with other rankers, and to place it towards the end of the pipeline.
The DiversityRanker can be used like other rankers in Haystack and it can be particularly helpful in cases where you have highly relevant yet similar sets of documents. By ensuring a diversity of documents, this new ranker facilitates a more comprehensive utilization of the documents and, particularly in RAG pipelines, potentially contributes to more accurate and rich model responses.

When using custom_query in BM25Retriever along with OpenSearch or Elasticsearch, we added support for dynamic filters, like in regular queries. With this change, you can pass filters at query-time without having to modify the custom_query:
Instead of defining filter expressions and field placeholders, all you have to do is setting the ${filters} placeholder analogous to the ${query} placeholder into your custom_query.
For example:

  {
      "query": {
          "bool": {
              "should": [{"multi_match": {
                  "query": ${query},                 // mandatory query placeholder
                  "type": "most_fields",
                  "fields": ["content", "title"]}}
              ],
              "filter": ${filters}                 // optional filters placeholder
          }
      }
  }

DeepsetCloudDocumentStore supports searching multiple fields in sparse queries. This enables you to search meta fields as well when using BM25Retriever. For example set search_fields=["content", "title"] to search the title meta field along with the document content.
Rework DocumentWriter to remove DocumentStoreAwareMixin. Now we require a generic DocumentStore when initialisating the writer.
Rework MemoryRetriever to remove DocumentStoreAwareMixin. Now we require a MemoryDocumentStore when initialisating the retriever.
Introduced allowed_domains parameter in WebRetriever for domain-specific searches, thus enabling "talk to a website" and "talk to docs" scenarios.

✨ Enhancements

The WebRetriever now employs an enhanced caching mechanism that caches web page content based on search engine results rather than the query.
Upgrade transformers to the latest version 4.32.1 so that Haystack benefits from Llama and T5 bugfixes: https://github.com/huggingface/transformers/releases/tag/v4.32.1
Upgrade Transformers to the latest version 4.32.0.
This version adds support for the GPTQ quantization and integrates MPT models.
Add top_k parameter to the DiversityRanker init method.
Enable setting the max_length value when running PromptNodes using local HF text2text-generation models.
Enable passing use_fast to the underlying transformers' pipeline
Enhance FileTypeClassifier to detect media file types like mp3, mp4, mpeg, m4a, and similar.
Minor PromptNode HFLocalInvocationLayer test improvements
Several minor enhancements for LinkContentFetcher:
- Dynamic content handler resolution
- Custom User-Agent header (optional, minimize blocking)
- PDF support
- Register new content handlers
If LinkContentFetcher encounters a block or receives any response code other than HTTPStatus.OK, return the search engine snippet as content, if it's available.
Allow loading Tokenizers for prompt models not natively supported by transformers by setting trust_remote_code to True.
Refactor and simplify WebRetriever to use LinkContentFetcher component
Remove template variables from invocation layer kwargs
Allow WebRetriever users to specify a custom LinkContentFetcher instance

🐛 Bug Fixes

Fix the bug that the responses of Agents using local HF models contain the prompt text.
Fix issue 5485, TransformersImageToText.generate_captions accepts "str"
Fix StopWordsCriteria not checking stop word tokens in a continuous and sequential order
Ensure the leading whitespace in the generated text is preserved when using stop_words in the Hugging Face invocation layer of the PromptNode.
Restricts the criteria for identifying an OpenAI model in the PromptNode and in the EmbeddingRetriever.
Previously, the criteria were quite loose, leading to more false positives.
Make the Crawler work properly with Selenium>=4.11.0.
Simplify the Crawler, as the new version of Selenium automatically finds or installs the necessary drivers.

👁️ Haystack 2.0 preview

Adds FileExtensionClassifier to preview components.
Add SentenceTransformersDocumentEmbedder.
It computes embeddings of Documents. The embedding of each Document is stored in the embedding field of the Document.
Add SentenceTransformersTextEmbedder.
It is a simple component that embeds strings into vectors.
Add Answer base class for Haystack v2
Add GeneratedAnswer and ExtractedAnswer
Improve error messaging in the FileExtensionClassifier constructor to avoid common mistakes.
Migrate existing v2 components to Canals 0.4.0
Fix TextFileToDocument using wrong Document class
Change import paths under the "preview" package to minimize module namespace pollution.
Migrate all components to Canals==0.7.0
Add serialization and deserialization methods for all Haystack components
Added new DocumentWriter component to Haystack v2 preview so that documents can be written to stores.
Copy lazy_imports.py to preview
Remove BaseTestComponent class used to test Components
Remove DocumentStoreAwareMixin as it's not necessary anymore
Remove Pipeline specialisation to support DocumentStores.
Add Sentence Transformers Embedding Backend.
It will be used by Embedder components and is responsible for computing embeddings.
Add utility function store_class factory to create Stores for testing purposes.
Add from_dict and to_dict methods to Store Protocol
Add default from_dict and to_dict implementations to classes decorated with @store
Add new TextFileToDocument component to Haystack v2 preview so that text files can be converted to Haystack Documents.