⭐ Highlights

This time we have a couple of smaller yet important feature highlights: lots of them coming from you, our amazing community!
🥂 Alongside that, as we notice more frequent and great contributions from our community, we are also announcing our brand new Haystack Discord server to help us interact better with the people that make Haystack what it is! 🥳

Here's what you'll find in Haystack 1.7:

Support for OpenAI GPT-3

If you always wanted to know how OpenAI's famous GPT-3 model compares to other models, now your time has come. It's been fully integrated into Haystack, so you can use it as any other model. Just sign up to OpenAI, copy your API key from here and run the following code.To compare it to other models, check out our evaluation guide.

from haystack.nodes import OpenAIAnswerGenerator
from haystack import Document

reader = OpenAIAnswerGenerator(api_key="<your-api-token>", max_tokens=15, temperature=0.3)

docs = [Document(content="""The Big Bang Theory is an American sitcom.
                            The four main characters are all avid fans of nerd culture. 
                            Among their shared interests are science fiction, fantasy, comic books and collecting memorabilia. 
                            Star Trek in particular is frequently referenced""")]
res = reader.predict(query="Do the main characters of big bang theory like Star Trek?", documents=docs)
print(res)

#2605
#3036

Zero-Shot Query Classification

Till now, TransformersQueryClassifier was very closely built around the excellent binary query-type classifier model of hahrukhx01. Although it was already possible to use other Transformer models, the choice was restricted to the models that output binary labels. One of our amazing community contributions now lifted this restriction.
But that's not all: @anakin87 added support for zero-shot classification models as well!
So now that you're completely free to choose the classification categories you want, you can let your creativity run wild. One thing you could do is customize the behavior of your pipeline based on the semantic category of the query, like this:

from haystack.nodes import TransformersQueryClassifier

# In zero-shot-classification, you are free to choose the labels
labels = ["music", "cinema", "food"]

query_classifier = TransformersQueryClassifier(
    model_name_or_path="typeform/distilbert-base-uncased-mnli",
    use_gpu=True,
    task="zero-shot-classification",
    labels=labels,
)

queries = [
    "In which films does John Travolta appear?",  # query about cinema
    "What is the Rolling Stones first album?",  # query about music
    "Who was Sergio Leone?",  # query about cinema
]

for query in queries:
    result = query_classifier.run(query=query)
    print(f'Query "{query}" was sent to {result[1]}')

#2965

Adding Page Numbers to Document Meta

Sometimes it's not enough to find the right answer or paragraph inside a document and just print it on the screen. Context matters and thus, for search applications, it's essential to send the user exactly to the place where the information came from. For huge documents, we're just halfway there if the user clicks a result and the document opens. To get to the right position, they still need to search the document using the document viewer. To make it easier, we added the parameter add_page_number to ParsrConverter, AzureConverter and PreProcessor. If you set it to True, it adds a meta field "page" to documents containing the page number of the text snippet or a table within the original file.

from haystack.nodes import PDFToTextConverter, PreProcessor
from haystack.document_stores import InMemoryDocumentStore

converter = PDFToTextConverter()
preprocessor = PreProcessor(add_page_number=True)
document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_node(component=converter, name="Converter", inputs=["File"])
pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Converter"])
pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

#2932

Gradient Accumulation for FARMReader

Training big Transformer models in low-resource environments is hard. Batch size plays a significant role when it comes to hyper-parameter tuning during the training process. The number of batches you can run on your machine is restricted by the amount of memory that fits into your GPUs. Gradient accumulation is a well-known technique to work around that restriction: adding up the gradients across iterations and running the backward pass only once after a certain number of iterations.
We tested it when we fine-tuned roberta-base on SQuAD, which led to nearly the same results as using a higher batch size. We also used it for training deepset/deberta-v3-large, which significantly outperformed its predecessors (see Question Answering on SQuAD).

from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
data_dir = "data/squad20"
reader.train(
    data_dir=data_dir, 
    train_filename="dev-v2.0.json", 
    use_gpu=True, n_epochs=1, 
    save_dir="my_model", 
    grad_acc_steps=8
)

#2925

Extended Ray Support

Another great contribution from our community comes from @zoltan-fedor: it's now possible to run more complex pipelines with dual-retriever setup on Ray. Also, we now support ray serve deployment arguments in Pipeline YAMLs so that you can fully control your ray deployments.

pipelines:
  - name: ray_query_pipeline
    nodes:
      - name: EmbeddingRetriever
        replicas: 2
        inputs: [ Query ]
        serve_deployment_kwargs:
          num_replicas: 2
          version: Twenty
          ray_actor_options:
            num_gpus: 0.25
            num_cpus: 0.5
          max_concurrent_queries: 17
      - name: Reader
        inputs: [ EmbeddingRetriever ]

#2981
#2918

Support for Custom Sentence Tokenizers in Preprocessor

On some specific domains (for example, legal with lots of custom abbreviations), the default sentence tokenizer can be improved by some extra training on the domain data. To support a custom model for sentence splitting, @danielbichuetti added the tokenizer_model_folder parameter to Preprocessor.

from haystack.nodes import PreProcessor

preprocessor = PreProcessor(
        split_length=10,
        split_overlap=0,
        split_by="sentence",
        split_respect_sentence_boundary=False,
        language="pt",
        tokenizer_model_folder="/home/user/custom_tokenizer_models",
    )

#2783

Making it Easier to Switch Document Stores

We had yet another amazing community contribution by @zoltan-fedor about the support for BM25 with the Weaviate document store.
Besides that we streamlined methods of BaseDocumentStore and added update_document_meta() to InMemoryDocumentStore. These are all steps to make it easier for you to run the same pipeline with different document stores (for example, for quick prototyping, use in-memory, then head to something more production-ready).
#2860
#2689

Almost 2x Performance Gain for Electra Reader Models

We did a major refactoring of our language_modeling module resolving a bug that caused Electra models to execute the forward pass twice.
#2703.

⚠️ Breaking Changes

Add update_document_meta to InMemoryDocumentStore by @bogdankostic in #2689
Add support for BM25 with the Weaviate document store by @zoltan-fedor in #2860
Extending the Ray Serve integration to allow attributes for Serve deployments by @zoltan-fedor in #2918
bug: make MultiLabel ids consistent across python interpreters by @camillepradel in #2998

⚠️ Breaking Changes for Contributors

Default Branch will be Renamed to `main` on Tuesday, 16th of August

We will rename the default branch from master to main after this release. For a nice recap about good reasons for doing this, have a look at the Software Freedom Conservancy's blog.
Whether coming from this repository or from a fork, local clones of the Haystack repository will need to be updated by running the following commands:

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

Pre-Commit Hooks Instead of CI Jobs

To give you full control over your changes, we switched from CI jobs that automatically reformat files, generate schemas, and so on, to pre-commit hooks. To install them, run:

pre-commit install

For more information, check our contributing guidelines.
#2819

Other Changes

Pipeline

Fix _debug info getting lost for previous nodes when using join nodes by @tstadel in #2776
fix pipeline run loop on joined pipelines whithout debug flag by @tstadel in #2777
Fix crawler long file names by @danielbichuetti in #2723
Prevent PDFToTextConverter from failing on PDFs with spaces in their names by @danielbichuetti in #2786
Passing the meta-data in the summarizer response by @SjSnowball in #2179
Fix YAML validation for ElasticsearchDocumentStore.custom_query by @ZanSara in #2789
Fix gold_contexts_similarity for table retrieval evaluation by @tstadel in #2815
Fix validation for dynamic outgoing edges by @tstadel in #2850
Print eval reports improvements by @vblagoje in #2941
Add progress bar to batch run component ops by @vblagoje in #2864
feat: warn users if they're calling get_all_labels on a document index and vice-versa (Elasticsearch & Opensearch only) by @ZanSara in #2990
Make MultiLabel preserve order by @anakin87 in #2956
bug: fix UnboundLocalError in Pipeline.run_batch() by @anakin87 in #3016
feat: enable the JoinDocuments node to work with documents with score=None by @zoltan-fedor in #2984
Resolving issue 2853: no answer logic in FARMReader by @sjrl in #2856
bug: Make TranslationWrapperPipeline work with QuestionAnswerGenerationPipeline by @bogdankostic in #3034

Models

Simplify language_modeling.py and tokenization.py by @ZanSara in #2703
Validate OpenAI response by @anakin87 in #2844
remove unnecessary if else block #2835 by @kekayan in #2842
Explicitly specify all parameters to forward call by @vblagoje in #2886
Use batch_size in QuestionGenerator by @GianiStatie in #2870
Generalize , and tokens of QuestionGenerator node by @francescocastelli in #2769
Component batch_size should be defined rather than Optional by @vblagoje in #2958
Better check for "DebertaV2" architecture in Trainer.train by @sjrl in #2966

DocumentStores

Fix confusing elasticsearch exception by @tstadel in #2763
added mock pinecone client by @jamescalam in #2770
changed mock pinecone to use dict rather than list index by @jamescalam in #2845
Handle invalid metadata for SQLDocumentStore by @anakin87 in #2868
Use opensearch-py in OpenSearchDocumentStore by @masci in #2691
Wrap opensearch imports into safe_import by @ZanSara in #2907
Bug fix Weaviate document deletion by @stevenhaley in #2899
switch label variables in test_labels by @jamescalam in #3011
Adding support for additional distance/similarity metrics for Weaviate by @zoltan-fedor in #3001
test: add meta fields for meta_config to be used during testing by @jamescalam in #3021
Fix embeddings_field_supports_similarity of OpenSearchDocumentStore when creating index by @tstadel in #3030
Forbid the key id from Documents to be written in WeaviateDocumentStore by @thenewera-ru in #2846

Documentation

Trying out some smaller images for docs by @TuanaCelik in #2772
Clean OpenAIAnswerGenerator docstrings by @brandenchan in #2797
Add a custom pydoc renderer for Readme.io by @masci in #2825
Typo README.md by @danielfleischer in #2895
Fix typos in Contributing.md by @stevenhaley in #2897
Fix docs code format for sentence transformers by @bilgeyucel in #2957
Update Seq2SeqGenerator API documentation by @vblagoje in #2970
Add API page for util functions by @brandenchan in #2863
docs: update File Classifier Docstring by @brandenchan in #3018

Tutorials

Fix load_from_yaml example in the Pipelines tutorial by @agnieszka-m in #2774
Tutorial 12: add introduction by @vblagoje in #2798
Exclude docker from Tutorial 15 by @anakin87 in #2861
Remove logging config from Haystack by @julian-risch in #2848
docs: extend tutorial14 about query classification by @anakin87 in #3013
Tutorial 06: Replace DPR with EmbeddingRetriever by @bglearning in #2910

Other Changes

API key check in OpenAIAnswerGenerator by @ZanSara in #2791
API tests by @masci in #2738
Allow values that are not dictionaries in the request params in the /search endpoint by @masci in #2720
fix healtcheck cmds for annotation tool postgres by @tstadel in #2840
Remove deprecated method prepare_seq2seq_batch by @anakin87 in #2852
Fix corrupted csv from EvaluationResult.save() by @tstadel in #2854
Fix audio dependency chain issue on Python 3.10 by @danielbichuetti in #2900
Add switch for BiAdaptive and TriAdaptiveModel in Evaluator by @ZanSara in #2908
Fix serialization of numpy arrays and pandas dataframes in REST API by @tstadel in #2838
Update minimum selenium version supported for crawler by @sjrl in #2921
Enable Opensearch unit tests in Windows CI by @masci in #2936
Remove unused variable by @sjrl in #2974
Bump streamlit version to latest by @masci in #3002
Testing order in test_multilabel by @jamescalam in #3015
fix: move azure-core pin into the dev dependency list by @ZanSara in #3022
Fix broken MultiLabel serialization by @tstadel in #3037

New Contributors

@kekayan made their first contribution in #2842
@sjrl made their first contribution in #2884
@zoltan-fedor made their first contribution in #2860
@danielfleischer made their first contribution in #2895
@stevenhaley made their first contribution in #2897
@GianiStatie made their first contribution in #2870
@bglearning made their first contribution in #2910
@bilgeyucel made their first contribution in #2957
@wochinge made their first contribution in #2883
@camillepradel made their first contribution in #2998
@thenewera-ru made their first contribution in #2846

❤️ Big thanks to all contributors and the whole community!

Full Changelog: v1.6.0...v1.7.0

deepset-ai/haystack v1.7.0 on GitHub

⭐ Highlights

Support for OpenAI GPT-3

Zero-Shot Query Classification

Adding Page Numbers to Document Meta

Gradient Accumulation for FARMReader

Extended Ray Support

Support for Custom Sentence Tokenizers in Preprocessor

Making it Easier to Switch Document Stores

Almost 2x Performance Gain for Electra Reader Models

⚠️ Breaking Changes

⚠️ Breaking Changes for Contributors

Default Branch will be Renamed to main on Tuesday, 16th of August

Pre-Commit Hooks Instead of CI Jobs

Other Changes

Pipeline

Models

DocumentStores

Documentation

Tutorials

Other Changes

New Contributors

deepset-ai/haystack v1.7.0
on GitHub

Default Branch will be Renamed to `main` on Tuesday, 16th of August