⭐ Highlights

Flexible Pipelines powered by DAGs (#596)

In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together.
While we always had great building blocks in Haystack, we didn't have a good way to stick them together so far. That's why we put a lof thought into it in the last weeks and came up with a new Pipeline class that enables many new search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...). Here's a simple example for a "standard" Open-Domain QA Pipeline:

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)

You can draw the DAG to better inspect what you are building:

p.draw(path="custom_pipe.png")

Multiple retrievers

You can now also use multiple Retrievers and join their results:

p = Pipeline()
p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)

Custom nodes

You can easily build your own custom nodes. Just respect the following requirements:

Add a method run(self, **kwargs) to your class. **kwargs will contain the output from the previous node in your graph.
Do whatever you want within run() (e.g. reformatting the query)
Return a tuple that contains your output data (for the next node) and the name of the outgoing edge output_dict, "output_1
Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).

Decision nodes

Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules:

    class QueryClassifier():
        outgoing_edges = 2

        def run(self, **kwargs):
            if "?" in kwargs["query"]:
                return (kwargs, "output_1")

            else:
                return (kwargs, "output_2")

    pipe = Pipeline()
    pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
    pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
    pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
    pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
                  inputs=["ESRetriever", "DPRRetriever"])
    pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
    res = p.run(query="What did Einstein work on?", top_k_retriever=1)

Default Pipelines (replacing the "Finder")

Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code.
This is replacing the Finder class which is now deprecated.

from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments

# Extractive QA
qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)

# Document Search
doc_pipe = DocumentSearchPipeline(retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)

# Generative QA
doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)

# FAQ based QA
doc_pipe = FAQPipeline(retriever=retriever)
res = doc_pipe.run(query="How can I change my address?", top_k_retriever=3)

We plan many more features around the new pipelines incl. parallelized execution, distributed execution, definition via YAML files, dry runs ...

New DocumentStore for the Open Distro of Elasticsearch (#676)

From now on we also support the Open Distro of Elasticsearch. This allows you to use many of the hosted Elasticsearch services (e.g. from AWS) more easily with Haystack. Usage is similar to the regular ElasticsearchDocumentStore:

document_store = OpenDistroElasticsearchDocumentStore(host="localhost", port="9200", ...)

⚠️ Breaking Changes

As Haystack is extending from QA to further search types, we decided to rename all parameters from question to query.
This includes for example the predict() methods of the Readers but also several other places. See #614 for details.

🤓 Detailed Changes

Preprocessing / File Conversion

Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries #641
Add needed whitespace before sentence start #582

DocumentStore

Scale dot product into probabilities #667
Add refresh_type param for Elasticsearch update_embeddings() #630
Add return_embedding parameter for get_all_documents() #615
Adding support for update_existing_documents to sql and faiss document stores #584
Add filters for delete_all_documents() #591

Retriever

Fix saving tokenizers in DPR training + unify save and load dirs #682
fix a typo, num_negatives -> num_positives #681
Refactor DensePassageRetriever._get_predictions #642
Move DPR embeddings from GPU to CPU straight away #618
Add MAP retriever metric for open-domain case #572

Reader / Generator

add GPU support for rag #669
Enable dynamic parameter updates for the FARMReader #650
Add option in API Config to configure if reader can return "No Answer" #609
Fix various generator issues #590

Pipeline

Add support for building custom Search Pipelines #596
Add set_node() for Pipeline #659
Add support for aggregating scores in JoinDocuments node #683
Add pipelines for GenerativeQA & FAQs #645

Other

Cleanup Pytest Fixtures #639
Add latest benchmark run #652
Fix image links in tutorial #663
Update query arg in Tutorial 7 #656
Fix benchmarks #648
Add link to FAISS Info in documentation #643
Improve User Feedback Documentation #539
Add formatting checks for shell scripts #627
Update md files for API docs #631
Clean API docs and increase coverage #621
Add boxes for recommendations #629
Automate benchmarks via CML #518
Add contributor hall of fame #628
README: Fix link to roadmap #626
Fix docstring examples #604
Cleaning the api docs #616
Fix link to DocumentStore page #613
Make more changes to documentation #578
Remove column in benchmark website #608
Make benchmarks clearer #606
Fixing defaults configs for rest_apis #583
Allow list of filter values in REST API #568
Fix CI bug due to new Elasticsearch release and new model release #579
Update Colab Torch Version #576

❤️ Big thanks to all contributors!

@sadakmed @Krak91 @icy @lalitpagaria @guillim @tanaysoni @tholor @Timoeller @PiffPaffM @bogdankostic

deepset-ai/haystack v0.6.0 on GitHub