github deepset-ai/haystack v0.6.0

latest releases: v2.1.0-rc2, v2.2.0-rc0, v2.1.0-rc0...
3 years ago

⭐ Highlights

Flexible Pipelines powered by DAGs (#596)

In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together.
While we always had great building blocks in Haystack, we didn't have a good way to stick them together so far. That's why we put a lof thought into it in the last weeks and came up with a new Pipeline class that enables many new search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...). Here's a simple example for a "standard" Open-Domain QA Pipeline:

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)

You can draw the DAG to better inspect what you are building:

p.draw(path="custom_pipe.png")

image

Multiple retrievers

You can now also use multiple Retrievers and join their results:

p = Pipeline()
p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)

image

Custom nodes

You can easily build your own custom nodes. Just respect the following requirements:

  1. Add a method run(self, **kwargs) to your class. **kwargs will contain the output from the previous node in your graph.
  2. Do whatever you want within run() (e.g. reformatting the query)
  3. Return a tuple that contains your output data (for the next node) and the name of the outgoing edge output_dict, "output_1
  4. Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).

Decision nodes

Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules:
image

    class QueryClassifier():
        outgoing_edges = 2

        def run(self, **kwargs):
            if "?" in kwargs["query"]:
                return (kwargs, "output_1")

            else:
                return (kwargs, "output_2")

    pipe = Pipeline()
    pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
    pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
    pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
    pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
                  inputs=["ESRetriever", "DPRRetriever"])
    pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
    res = p.run(query="What did Einstein work on?", top_k_retriever=1)

Default Pipelines (replacing the "Finder")

Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code.
This is replacing the Finder class which is now deprecated.

from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments

# Extractive QA
qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)

# Document Search
doc_pipe = DocumentSearchPipeline(retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)

# Generative QA
doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)

# FAQ based QA
doc_pipe = FAQPipeline(retriever=retriever)
res = doc_pipe.run(query="How can I change my address?", top_k_retriever=3)

We plan many more features around the new pipelines incl. parallelized execution, distributed execution, definition via YAML files, dry runs ...

New DocumentStore for the Open Distro of Elasticsearch (#676)

From now on we also support the Open Distro of Elasticsearch. This allows you to use many of the hosted Elasticsearch services (e.g. from AWS) more easily with Haystack. Usage is similar to the regular ElasticsearchDocumentStore:

document_store = OpenDistroElasticsearchDocumentStore(host="localhost", port="9200", ...)

⚠️ Breaking Changes

As Haystack is extending from QA to further search types, we decided to rename all parameters from question to query.
This includes for example the predict() methods of the Readers but also several other places. See #614 for details.

🤓 Detailed Changes

Preprocessing / File Conversion

  • Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries #641
  • Add needed whitespace before sentence start #582

DocumentStore

  • Scale dot product into probabilities #667
  • Add refresh_type param for Elasticsearch update_embeddings() #630
  • Add return_embedding parameter for get_all_documents() #615
  • Adding support for update_existing_documents to sql and faiss document stores #584
  • Add filters for delete_all_documents() #591

Retriever

  • Fix saving tokenizers in DPR training + unify save and load dirs #682
  • fix a typo, num_negatives -> num_positives #681
  • Refactor DensePassageRetriever._get_predictions #642
  • Move DPR embeddings from GPU to CPU straight away #618
  • Add MAP retriever metric for open-domain case #572

Reader / Generator

  • add GPU support for rag #669
  • Enable dynamic parameter updates for the FARMReader #650
  • Add option in API Config to configure if reader can return "No Answer" #609
  • Fix various generator issues #590

Pipeline

  • Add support for building custom Search Pipelines #596
  • Add set_node() for Pipeline #659
  • Add support for aggregating scores in JoinDocuments node #683
  • Add pipelines for GenerativeQA & FAQs #645

Other

  • Cleanup Pytest Fixtures #639
  • Add latest benchmark run #652
  • Fix image links in tutorial #663
  • Update query arg in Tutorial 7 #656
  • Fix benchmarks #648
  • Add link to FAISS Info in documentation #643
  • Improve User Feedback Documentation #539
  • Add formatting checks for shell scripts #627
  • Update md files for API docs #631
  • Clean API docs and increase coverage #621
  • Add boxes for recommendations #629
  • Automate benchmarks via CML #518
  • Add contributor hall of fame #628
  • README: Fix link to roadmap #626
  • Fix docstring examples #604
  • Cleaning the api docs #616
  • Fix link to DocumentStore page #613
  • Make more changes to documentation #578
  • Remove column in benchmark website #608
  • Make benchmarks clearer #606
  • Fixing defaults configs for rest_apis #583
  • Allow list of filter values in REST API #568
  • Fix CI bug due to new Elasticsearch release and new model release #579
  • Update Colab Torch Version #576

❤️ Big thanks to all contributors!

@sadakmed @Krak91 @icy @lalitpagaria @guillim @tanaysoni @tholor @Timoeller @PiffPaffM @bogdankostic

Don't miss a new haystack release

NewReleases is sending notifications on new releases.