deepset-ai/haystack v1.20.0-rc1 on GitHub

⭐ Highlights

🪄LostInTheMiddleRanker and DiversityRanker

We are excited to introduce two new rankers to Haystack: LostInTheMiddleRanker and DiversityRanker!

LostInTheMiddleRanker is based on the research paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al. It reorders documents according to the "Lost in the Middle" strategy, which places the most relevant paragraphs at the beginning and end of the context, while less relevant paragraphs are positioned in the middle. This ranker can be used in Retrieval-Augmented Generation (RAG) pipelines.

DiversityRanker aims to maximize the overall diversity of the given documents. It leverages sentence-transformer models to calculate semantic embeddings for each document. The ranker orders documents so that each next one, on average, is least similar to the already selected documents. This ranking results in a list where each subsequent document contributes the most to the overall diversity of the selected document set.

📰 New release note management

We have implemented a new release note management system, reno. From now on, every contributor is responsible for adding release notes for the feature or bugfix they're introducing in Haystack in the same Pull Request containing the code changes. The goal is to encourage detailed and accurate notes for every release, especially when it comes to complex features or breaking changes.

See how to work with the new release notes in our Contribution Guide.

⬆️ Upgrade Notes

If you're a Haystack contributor, you need a new tool called reno to manage the release notes.
Please run pip install -e .[dev] to ensure you have reno available in your environment.

The Opensearch custom query syntax changes: the old filter placeholders for custom_query are no longer supported.
Replace your custom filter expressions with the new ${filters} placeholder:

Old:

  retriever = BM25Retriever(
    custom_query="""
      {
          "query": {
              "bool": {
                  "should": [{"multi_match": {
                      "query": ${query},
                      "type": "most_fields",
                      "fields": ["content", "title"]}}
                  ],
                  "filter": [
                      {"terms": {"year": ${years}}},
                      {"terms": {"quarter": ${quarters}}},
                      {"range": {"date": {"gte": ${date}}}}
                  ]
              }
          }
      }
    """
  )

  retriever.retrieve(
      query="What is the meaning of life?",
      filters={"years": [2019, 2020], "quarters": [1, 2, 3], "date": "2019-03-01"}
  )

New:

  retriever = BM25Retriever(
    custom_query="""
      {
          "query": {
              "bool": {
                  "should": [{"multi_match": {
                      "query": ${query},
                      "type": "most_fields",
                      "fields": ["content", "title"]}}
                  ],
                  "filter": ${filters}
              }
          }
      }
    """
  )

  retriever.retrieve(
      query="What is the meaning of life?",
      filters={"year": [2019, 2020], "quarter": [1, 2, 3], "date": {"$gte": "2019-03-01"}}
  )

This update impacts only those who have created custom invocation layers by subclassing PromptModelInvocationLayer.
Previously, the invoke() method in your custom layer received all prompt template parameters (like query,
documents, etc.) as keyword arguments. With this change, these parameters will no longer be passed in as keyword
arguments. If you've implemented such a custom layer, you'll need to potentially update your code to accommodate
this change.

🥳 New Features

The LostInTheMiddleRanker can be used like other rankers in Haystack. After
initializing LostInTheMiddleRanker with the desired parameters, it can be
used to rank/reorder a list of documents based on the "Lost in the Middle"
order - the most relevant documents are located at the top and bottom of
the returned list, while the least relevant documents are found in the
middle. We advise that you use this ranker in combination with other rankers,
and to place it towards the end of the pipeline.
The DiversityRanker can be used like other rankers in Haystack and
it can be particularly helpful in cases where you have highly relevant
yet similar sets of documents. By ensuring a diversity of documents,
this new ranker facilitates a more comprehensive utilization of the
documents and, particularly in RAG pipelines, potentially contributes
to more accurate and rich model responses.

When using custom_query in BM25Retriever along with OpenSearch
or Elasticsearch, we added support for dynamic filters, like in regular queries.
With this change, you can pass filters at query-time without having to modify the custom_query:
Instead of defining filter expressions and field placeholders, all you have to do
is setting the ${filters} placeholder analogous to the ${query} placeholder into
your custom_query.
For example:

  {
      "query": {
          "bool": {
              "should": [{"multi_match": {
                  "query": ${query},                 // mandatory query placeholder
                  "type": "most_fields",
                  "fields": ["content", "title"]}}
              ],
              "filter": ${filters}                 // optional filters placeholder
          }
      }
  }

DeepsetCloudDocumentStore supports searching multiple fields in sparse queries. This enables you to search meta fields as well when using BM25Retriever. For example set search_fields=["content", "title"] to search the title meta field along with the document content.
Rework DocumentWriter to remove DocumentStoreAwareMixin. Now we require a generic DocumentStore when initialisating the writer.
Rework MemoryRetriever to remove DocumentStoreAwareMixin. Now we require a MemoryDocumentStore when initialisating the retriever.
Introduced allowed_domains parameter in WebRetriever for domain-specific searches,
thus enabling "talk to a website" and "talk to docs" scenarios.

✨ Enhancements

The WebRetriever now employs an enhanced caching mechanism that caches web page content based on search engine
results rather than the query.
Upgrade transformers to the latest version 4.32.1 so that Haystack benefits from Llama and T5 bugfixes: https://github.com/huggingface/transformers/releases/tag/v4.32.1
Upgrade Transformers to the latest version 4.32.0.
This version adds support for the GPTQ quantization and integrates MPT models.
Add top_k parameter to the DiversityRanker init method.
Enable setting the max_length value when running PromptNodes using local HF text2text-generation models.
enable passing use_fast to the underlying transformers' pipeline
Enhance FileTypeClassifier to detect media file types like mp3, mp4, mpeg, m4a, and similar.
Minor PromptNode HFLocalInvocationLayer test improvements
Several minor enhancements for LinkContentFetcher:
- Dynamic content handler resolution
- Custom User-Agent header (optional, minimize blocking)
- PDF support
- Register new content handlers
If LinkContentFetcher encounters a block or receives any response code other than HTTPStatus.OK, return the search
engine snippet as content, if it's available.
Allow loading Tokenizers for prompt models not natively supported by transformers by setting trust_remote_code to True.
Refactor and simplify WebRetriever to use LinkContentFetcher component
Remove template variables from invocation layer kwargs
Allow WebRetriever users to specify a custom LinkContentFetcher instance

🐛 Bug Fixes

Fix the bug that the responses of Agents using local HF models contain the prompt text.
fix issue 5485, TransformersImageToText.generate_captions accepts "str"
Fix StopWordsCriteria not checking stop word tokens in a continuous and sequential order
Ensure the leading whitespace in the generated text is preserved when using stop_words in the Hugging Face invocation layer of the PromptNode.
Restricts the criteria for identifying an OpenAI model in the PromptNode and in the EmbeddingRetriever.
Previously, the criteria were quite loose, leading to more false positives.
Make the Crawler work properly with Selenium>=4.11.0.
Simplify the Crawler, as the new version of Selenium automatically finds or installs the necessary drivers.

👁️ Haystack 2.0 preview

Adds FileExtensionClassifier to preview components.
Add Sentence Transformers Document Embedder.
It computes embeddings of Documents. The embedding of each Document is stored in the embedding field of the Document.
Add Sentence Transformers Text Embedder.
It is a simple component that embeds strings into vectors.
Add Answer base class for haystack v2
Add GeneratedAnswer and ExtractedAnswer
Improve error messaging in the FileExtensionClassifier constructor
to avoid common mistakes.
Migrate existing v2 components to Canals 0.4.0
Fix TextFileToDocument using wrong Document class
Change import paths under the "preview" package to minimize
module namespace pollution.
Migrate all components to Canals==0.7.0
Add serialization and deserialization methods for all Haystack components
Added new DocumentWriter component to Haystack v2 preview so that documents can be written to stores.
copy lazy_imports.py to preview
Remove BaseTestComponent class used to test Components
Remove DocumentStoreAwareMixin as it's not necessary anymore
Remove Pipeline specialisation to support DocumentStores.
Add Sentence Transformers Embedding Backend.
It will be used by Embedder components and is responsible for computing embeddings.
Add utility function store_class factory to create Stores for testing purposes.
Add from_dict and to_dict methods to Store Protocol
Add default from_dict and to_dict implementations to classes decorated with @store
Add new TextFileToDocument component to Haystack v2 preview so that text files can be converted to Haystack Documents.