github deepset-ai/haystack v1.7.0

latest releases: v1.25.5, v1.25.4, v1.25.3...
20 months ago

⭐ Highlights

This time we have a couple of smaller yet important feature highlights: lots of them coming from you, our amazing community!
🥂 Alongside that, as we notice more frequent and great contributions from our community, we are also announcing our brand new Haystack Discord server to help us interact better with the people that make Haystack what it is! 🥳

Here's what you'll find in Haystack 1.7:

Support for OpenAI GPT-3

If you always wanted to know how OpenAI's famous GPT-3 model compares to other models, now your time has come. It's been fully integrated into Haystack, so you can use it as any other model. Just sign up to OpenAI, copy your API key from here and run the following code.To compare it to other models, check out our evaluation guide.

from haystack.nodes import OpenAIAnswerGenerator
from haystack import Document

reader = OpenAIAnswerGenerator(api_key="<your-api-token>", max_tokens=15, temperature=0.3)

docs = [Document(content="""The Big Bang Theory is an American sitcom.
                            The four main characters are all avid fans of nerd culture. 
                            Among their shared interests are science fiction, fantasy, comic books and collecting memorabilia. 
                            Star Trek in particular is frequently referenced""")]
res = reader.predict(query="Do the main characters of big bang theory like Star Trek?", documents=docs)
print(res)

#2605
#3036

Zero-Shot Query Classification

Till now, TransformersQueryClassifier was very closely built around the excellent binary query-type classifier model of hahrukhx01. Although it was already possible to use other Transformer models, the choice was restricted to the models that output binary labels. One of our amazing community contributions now lifted this restriction.
But that's not all: @anakin87 added support for zero-shot classification models as well!
So now that you're completely free to choose the classification categories you want, you can let your creativity run wild. One thing you could do is customize the behavior of your pipeline based on the semantic category of the query, like this:

from haystack.nodes import TransformersQueryClassifier

# In zero-shot-classification, you are free to choose the labels
labels = ["music", "cinema", "food"]

query_classifier = TransformersQueryClassifier(
    model_name_or_path="typeform/distilbert-base-uncased-mnli",
    use_gpu=True,
    task="zero-shot-classification",
    labels=labels,
)

queries = [
    "In which films does John Travolta appear?",  # query about cinema
    "What is the Rolling Stones first album?",  # query about music
    "Who was Sergio Leone?",  # query about cinema
]

for query in queries:
    result = query_classifier.run(query=query)
    print(f'Query "{query}" was sent to {result[1]}')

#2965

Adding Page Numbers to Document Meta

Sometimes it's not enough to find the right answer or paragraph inside a document and just print it on the screen. Context matters and thus, for search applications, it's essential to send the user exactly to the place where the information came from. For huge documents, we're just halfway there if the user clicks a result and the document opens. To get to the right position, they still need to search the document using the document viewer. To make it easier, we added the parameter add_page_number to ParsrConverter, AzureConverter and PreProcessor. If you set it to True, it adds a meta field "page" to documents containing the page number of the text snippet or a table within the original file.

from haystack.nodes import PDFToTextConverter, PreProcessor
from haystack.document_stores import InMemoryDocumentStore

converter = PDFToTextConverter()
preprocessor = PreProcessor(add_page_number=True)
document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_node(component=converter, name="Converter", inputs=["File"])
pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Converter"])
pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

#2932

Gradient Accumulation for FARMReader

Training big Transformer models in low-resource environments is hard. Batch size plays a significant role when it comes to hyper-parameter tuning during the training process. The number of batches you can run on your machine is restricted by the amount of memory that fits into your GPUs. Gradient accumulation is a well-known technique to work around that restriction: adding up the gradients across iterations and running the backward pass only once after a certain number of iterations.
We tested it when we fine-tuned roberta-base on SQuAD, which led to nearly the same results as using a higher batch size. We also used it for training deepset/deberta-v3-large, which significantly outperformed its predecessors (see Question Answering on SQuAD).

from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
data_dir = "data/squad20"
reader.train(
    data_dir=data_dir, 
    train_filename="dev-v2.0.json", 
    use_gpu=True, n_epochs=1, 
    save_dir="my_model", 
    grad_acc_steps=8
)

#2925

Extended Ray Support

Another great contribution from our community comes from @zoltan-fedor: it's now possible to run more complex pipelines with dual-retriever setup on Ray. Also, we now support ray serve deployment arguments in Pipeline YAMLs so that you can fully control your ray deployments.

pipelines:
  - name: ray_query_pipeline
    nodes:
      - name: EmbeddingRetriever
        replicas: 2
        inputs: [ Query ]
        serve_deployment_kwargs:
          num_replicas: 2
          version: Twenty
          ray_actor_options:
            num_gpus: 0.25
            num_cpus: 0.5
          max_concurrent_queries: 17
      - name: Reader
        inputs: [ EmbeddingRetriever ]

#2981
#2918

Support for Custom Sentence Tokenizers in Preprocessor

On some specific domains (for example, legal with lots of custom abbreviations), the default sentence tokenizer can be improved by some extra training on the domain data. To support a custom model for sentence splitting, @danielbichuetti added the tokenizer_model_folder parameter to Preprocessor.

from haystack.nodes import PreProcessor

preprocessor = PreProcessor(
        split_length=10,
        split_overlap=0,
        split_by="sentence",
        split_respect_sentence_boundary=False,
        language="pt",
        tokenizer_model_folder="/home/user/custom_tokenizer_models",
    )

#2783

Making it Easier to Switch Document Stores

We had yet another amazing community contribution by @zoltan-fedor about the support for BM25 with the Weaviate document store.
Besides that we streamlined methods of BaseDocumentStore and added update_document_meta() to InMemoryDocumentStore. These are all steps to make it easier for you to run the same pipeline with different document stores (for example, for quick prototyping, use in-memory, then head to something more production-ready).
#2860
#2689

Almost 2x Performance Gain for Electra Reader Models

We did a major refactoring of our language_modeling module resolving a bug that caused Electra models to execute the forward pass twice.
#2703.

⚠️ Breaking Changes

⚠️ Breaking Changes for Contributors

Default Branch will be Renamed to main on Tuesday, 16th of August

We will rename the default branch from master to main after this release. For a nice recap about good reasons for doing this, have a look at the Software Freedom Conservancy's blog.
Whether coming from this repository or from a fork, local clones of the Haystack repository will need to be updated by running the following commands:

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

Pre-Commit Hooks Instead of CI Jobs

To give you full control over your changes, we switched from CI jobs that automatically reformat files, generate schemas, and so on, to pre-commit hooks. To install them, run:

pre-commit install

For more information, check our contributing guidelines.
#2819

Other Changes

Pipeline

  • Fix _debug info getting lost for previous nodes when using join nodes by @tstadel in #2776
  • fix pipeline run loop on joined pipelines whithout debug flag by @tstadel in #2777
  • Fix crawler long file names by @danielbichuetti in #2723
  • Prevent PDFToTextConverter from failing on PDFs with spaces in their names by @danielbichuetti in #2786
  • Passing the meta-data in the summarizer response by @SjSnowball in #2179
  • Fix YAML validation for ElasticsearchDocumentStore.custom_query by @ZanSara in #2789
  • Fix gold_contexts_similarity for table retrieval evaluation by @tstadel in #2815
  • Fix validation for dynamic outgoing edges by @tstadel in #2850
  • Print eval reports improvements by @vblagoje in #2941
  • Add progress bar to batch run component ops by @vblagoje in #2864
  • feat: warn users if they're calling get_all_labels on a document index and vice-versa (Elasticsearch & Opensearch only) by @ZanSara in #2990
  • Make MultiLabel preserve order by @anakin87 in #2956
  • bug: fix UnboundLocalError in Pipeline.run_batch() by @anakin87 in #3016
  • feat: enable the JoinDocuments node to work with documents with score=None by @zoltan-fedor in #2984
  • Resolving issue 2853: no answer logic in FARMReader by @sjrl in #2856
  • bug: Make TranslationWrapperPipeline work with QuestionAnswerGenerationPipeline by @bogdankostic in #3034

Models

DocumentStores

Documentation

Tutorials

Other Changes

  • API key check in OpenAIAnswerGenerator by @ZanSara in #2791
  • API tests by @masci in #2738
  • Allow values that are not dictionaries in the request params in the /search endpoint by @masci in #2720
  • fix healtcheck cmds for annotation tool postgres by @tstadel in #2840
  • Remove deprecated method prepare_seq2seq_batch by @anakin87 in #2852
  • Fix corrupted csv from EvaluationResult.save() by @tstadel in #2854
  • Fix audio dependency chain issue on Python 3.10 by @danielbichuetti in #2900
  • Add switch for BiAdaptive and TriAdaptiveModel in Evaluator by @ZanSara in #2908
  • Fix serialization of numpy arrays and pandas dataframes in REST API by @tstadel in #2838
  • Update minimum selenium version supported for crawler by @sjrl in #2921
  • Enable Opensearch unit tests in Windows CI by @masci in #2936
  • Remove unused variable by @sjrl in #2974
  • Bump streamlit version to latest by @masci in #3002
  • Testing order in test_multilabel by @jamescalam in #3015
  • fix: move azure-core pin into the dev dependency list by @ZanSara in #3022
  • Fix broken MultiLabel serialization by @tstadel in #3037

New Contributors

❤️ Big thanks to all contributors and the whole community!

Full Changelog: v1.6.0...v1.7.0

Don't miss a new haystack release

NewReleases is sending notifications on new releases.