⭐ Highlights

Make Your QA Pipelines Talk with Audio Nodes! (#2584)

Indexing pipelines can use a new DocumentToSpeech node, which generates an audio file for each indexed document and stores it alongside the text content in a SpeechDocument. A GPU is recommended for this step to increase indexing speed. During querying, SpeechDocuments allow accessing the stored audio version of the documents the answers are extracted from. There is also a new AnswerToSpeech node that can be used in QA pipelines to generate the audio of an answer on the fly. See the new tutorial for a step by step guide on how to make your QA pipelines talk!

Save Models to Remote (#2618)

A new save_to_remote method was introduced to the FARMReader, so that you can easily upload a trained model to the Hugging Face Model Hub. More of this to come in the following releases!

from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="roberta-base")
reader.train(data_dir="my_squad_data", train_filename="squad2.json", n_epochs=1, save_dir="my_model")

reader.save_to_remote(repo_id="your-user-name/roberta-base-squad2", private=True, commit_message="First version of my qa model trained with Haystack")

Note that you need to be logged in with transformers-cli login. Otherwise there will be an error message with instructions how to log in. Further, if you make your model private by setting private=True, others won't be able to use it and you will need to pass an authentication token when you reload the model from the Model Hub, which is created also via transformers-cli login.

new_reader = FARMReader(model_name_or_path="your-user-name/roberta-base-squad2", use_auth_token=True)

Multi-Hop Dense Retrieval (#2571)

There is a new MultihopEmbeddingRetriever node that applies iterative retrieval steps and a shared encoder for the query and the documents. Used together with a reader node in a QA pipeline, it is suited for answering complex open-domain questions that require "hopping" multiple relevant documents. See the original paper by Xiong et al. for more details: "Answering complex open-domain questions with multi-hop dense retrieval".

from haystack.nodes import MultihopEmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
retriever = MultihopEmbeddingRetriever(
            document_store=document_store,
            embedding_model="deutschmann/mdr_roberta_q_encoder",
        )

Big thanks to our community member @deutschmn for the PR!

InMemoryKnowledgeGraph (#2678)

Besides querying texts and tables, Haystack also allows querying knowledge graphs with the help of pre-trained models that translate text queries to graph queries. The latest Haystack release adds an InMemoryKnowledgeGraph allowing to store knowledge graphs without setting up complex graph databases. Try out the tutorial as a notebook on colab!

from pathlib import Path
from haystack.nodes import Text2SparqlRetriever
from haystack.document_stores import InMemoryKnowledgeGraph
from haystack.utils import fetch_archive_from_http

# Fetch data represented as triples of subject, predicate, and object statements
fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/triples_and_config.zip", output_dir="data/tutorial10")

# Fetch a pre-trained BART model that translates text queries to SPARQL queries
fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/saved_models/hp_v3.4.zip", output_dir="../saved_models/tutorial10/")

# Initialize knowledge graph and import triples from a ttl file
kg = InMemoryKnowledgeGraph(index="tutorial10")
kg.create_index()
kg.import_from_ttl_file(index="tutorial10", path=Path("data/tutorial10/triples.ttl"))

# Initialize retriever from pre-trained model
kgqa_retriever = Text2SparqlRetriever(knowledge_graph=kg, model_name_or_path=Path("../saved_models/tutorial10/hp_v3.4"))

# Translate a text query to a SPARQL query and execute it on the knowledge graph
print(kgqa_retriever.retrieve(query="In which house is Harry Potter?"))

Big thanks to our community member @anakin87 for the PR!

Torch 1.12 and Transformers 4.20.1 Support

Haystack is now compatible with last week's PyTorch v1.12 release so that you can take advantage of Apple silicon GPUs (Apple M1) for accelerated training and evaluation. PyTorch shared an impressive analysis of speedups over CPU-only here.
Haystack is also compatible with the latest Transformers v4.20.1 release and we will continuously ensure that you can benefit from the latest features in Haystack!

Other Changes

Pipeline

Fix JoinAnswer/JoinNode by @MichelBartels in #2612
Reduce logging messages and simplify logging by @julian-risch in #2682
Correct docstring parameter name by @julian-risch in #2757
AnswerToSpeech by @ZanSara in #2584
Fix params being changed during pipeline.eval() by @tstadel in #2638
Make crawler extract also hidden text by @anakin87 in #2642
Update document scores based on ranker node by @mathislucka in #2048
Improved crawler support for dynamically loaded pages by @danielbichuetti in #2710
Replace deprecated Selenium methods by @ZanSara in #2724
Fix EvaluationSetCliet.get_labels() by @tstadel in #2690
Show warning in reader.eval() about differences compared to pipeline.eval() by @tstadel in #2477
Fix using id_hash_keys as pipeline params by @tstadel in #2717
Fix loading of tokenizers in DPR by @bogdankostic in #2755
Add support for Multi-Hop Dense Retrieval by @deutschmn in #2571
Create target folder if not exists in EvalResult.save() by @tstadel in #2647
Validate max_seq_length in SquadProcessor by @francescocastelli in #2740

Models

Use AutoTokenizer by default, to easily adapt to new models and token… by @apohllo in #1902
first version of save_to_remote for HF from FarmReader by @TuanaCelik in #2618

DocumentStores

Move Opensearch document store in its own module by @masci in #2603
Extract common code for ES and OS into a base class by @masci in #2664
Fix bugs in loading code from yaml by @masci in #2705
fix error in log message by @anakin87 in #2719
Pin es client to include bugfixes by @masci in #2735
Make check of document & embedding count optional in FAISS and Pinecone by @julian-risch in #2677
In memory knowledge graph by @anakin87 in #2678
Pinecone unary queries upgrade by @jamescalam in #2657
wait for postgres to be ready before data migrations by @masci in #2654

Documentation & Tutorials

Update docstrings for GPL by @agnieszka-m in #2633
Add GPL API docs, unit tests update by @vblagoje in #2634
Add GPL adaptation tutorial by @vblagoje in #2632
GPL tutorial - add GPU header and open in colab button by @vblagoje in #2736
Add execute_eval_run example to Tutorial 5 by @tstadel in #2459
Tutorial 14 edit by @robpasternak in #2663

Misc

Replace question issue with link to discussions by @masci in #2697
Upgrade transformers to 4.20.1 by @julian-risch in #2702
Upgrade torch to 1.12 by @julian-risch in #2741
Remove rapidfuzz version pin by @tstadel in #2730

New Contributors

@ryanrussell made their first contribution in #2617
@apohllo made their first contribution in #1902
@robpasternak made their first contribution in #2663
@danielbichuetti made their first contribution in #2710
@francescocastelli made their first contribution in #2740
@deutschmn made their first contribution in #2571

❤️ Big thanks to all contributors and the whole community!

Full Changelog: v1.5.0...v1.6.0

deepset-ai/haystack v1.6.0 on GitHub