Highlights

💬 Generative Question Answering via RAG (#484)

Thanks to our community member @lalitpagaria, Haystack now also support generative QA via Retrieval Augmented Generation ("RAG").
Instead of "finding" the answer within a document, these models generate the answer. In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages for real-world applications:
a) it has a manageable model size
b) the answer generation is conditioned on retrieved documents, i.e. the model can easily adjust to domain documents even after training has finished (in contrast: GPT-3 relies on the web data seen during training)

Example:

    question = "who got the first nobel prize in physics?"

    # Retrieve related documents from retriever
    retrieved_docs = retriever.retrieve(query=question)

    # Now generate answer from question and retrieved documents
    predicted_result = generator.predict(
        question=question,
        documents=retrieved_docs,
        top_k=1
    )

You already play around with it in this minimal tutorial:

We are looking forward to improve this class of models further in the next months and already plan a tighter integration into the Finder class.

↗️ Better DPR (incl. training) (#527)

We migrated the existing DensePassageRetriever to an own pipeline based on FARM. This allows a better modularization and most importantly simple training of DPR models! You can either train models from scratch or take an existing DPR model and fine-tune it on your own domain data. The required training data consists of queries and positive passages (i.e. passages that are related to your query / contain the answer) and the format complies with the one in the original DPR codebase.

Example:

dense_passage_retriever.train(self,
                              data_dir: str,
                              train_filename: str,
                              dev_filename: str = None,
                              test_filename: str = None,
                              batch_size: int = 16,
                              embed_title: bool = True,
                              num_hard_negatives: int = 1,
                              n_epochs: int = 3)

Future improvements: At the moment training is only supported on single GPUs. We will add support for Multi-GPU Training via DDP soon.

📊 New Benchmarks

Happy to introduce a new benchmark section on our website!
Do you wonder if you should use BERT, RoBERTa or MiniLM for your reader? Is it worth to use DPR for retrieval instead of Elastic's BM25? How would this impact speed and accuracy?

See the relevant metrics here to guide your decision:
👉 https://haystack.deepset.ai/bm/benchmarks

We will extend this section over time with more models, metrics and key parameters.

⚠️ Breaking Changes

Consistent parameter naming for TransformersReader #510

# old: 
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..) 

# new
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..) 
TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad" ...)

FAISS: Remove phi normalization, support more index types #467

New default index type is "Flat" and params have changed slightly:

# old 
 FAISSDocumentStore(
        sql_url: str = "sqlite:///",
        index_buffer_size: int = 10_000,
        vector_size: int = 768,
        faiss_index: Optional[IndexHNSWFlat] = None,

# new
FAISSDocumentStore(
        sql_url: str = "sqlite:///",
        index_buffer_size: int = 10_000,
        vector_dim: int = 768,
        faiss_index_factory_str: str = "Flat",
        faiss_index: Optional[faiss.swigfaiss.Index] = None,
        return_embedding: Optional[bool] = True,
        **kwargs,

DPR signature

Splitting max_seq_len into two independent params.
Removing remove_sep_tok_from_untitled_passages param.

# old
DensePassageRetriever(
                 document_store: BaseDocumentStore,
                 query_embedding_model: str = "facebook/dpr-question_encoder-single-nq-base",
                 passage_embedding_model: str = "facebook/dpr-ctx_encoder-single-nq-base",
                 max_seq_len: int = 256,
                 use_gpu: bool = True,
                 batch_size: int = 16,
                 embed_title: bool = True,
                 remove_sep_tok_from_untitled_passages: bool = True
                 )

# new 
 DensePassageRetriever(
 		 document_store: BaseDocumentStore,
                 query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base",
                 passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base",
                 max_seq_len_query: int = 64,
                 max_seq_len_passage: int = 256,
                 use_gpu: bool = True,
                 batch_size: int = 16,
                 embed_title: bool = True,
                 use_fast_tokenizers: bool = True,
                 similarity_function: str = "dot_product"
                 ):

Detailed Changes

Preprocessing / File Conversion

Add preprocessing pipeline #473
Restructure checks in PreProcessor #504
Updated the example code to Indexing PDF / Docx files #502
Fix meta data = None in PreProcessor #496
add explicit encoding mode to file_converter/txt.py #478
Skip file conversion if file type is not supported #456

DocumentStore

Add support for MySQL database #556
Allow configuration of Elasticsearch Analyzer (e.g. for other languages) #554
Add support to return embedding #514
Fix scoring in Elasticsearch for dot product #517
Allow filters for get_document_count() #512
Make creation of label index optional #490
Fix update_embeddings function in FAISSDocumentStore #481
FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422
Enable bulk operations on vector IDs for FAISSDocumentStore #460
fixing ElasticsearchDocumentStore initialisation #415
bug: filters on a query_by_embedding #464

Retriever

DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527
Fix retriever evaluation metrics #547
Add save and load method for DPR #550
Typo in dense.py comment #545
Make returning predictions in Finder & Retriever eval() possible #524
Make title info optional when evaluating on QA data #494
Make sentence-transformers usage more user-friendly #439

Reader

Fix FARMReader.eval() handling of no_answers #531
Added automatic mixed precision (AMP) support for reader training from Haystack side #463
Update ONNX conversion for FARMReader #438

Other

Fix sentencepiece dependencies in Dockerfiles #553
Update Dockerfile #537
Removing (deprecated) warnings from the Haystack codebase. #530
Pytest fix memory leak and put pytest marker on slow tests #520
[enhancement] Create deploy_website.yml #450
Add Docker Images & Setup for the Annotation Tool #444

REST API

Make filter value optional in REST API #497
Add Elasticsearch Query DSL compliant Query API #471
Allow configuration of log level in REST API #541
Add create_index and similarity metric to api config #493
Add deepcopy for meta dicts in answers #485
Fix windows platform installation #480
Update GPU docker & fix race condition with multiple workers #436

Documentation / Benchmarks / Tutorials

New readme #534
Add public roadmap #432
Time and performance benchmarks for all readers and retrievers #339
Added new formatting for examples in docstrings #555
Update annotation docs for website #505
Add annotation tool manual to README.md #523
Change metric to queries per second on benchmarks webpage #529
Add --ci and --update-json to CLI for benchmarks #522
Add requirement to colab notebooks #509
Update doc string for ElasticsearchDocumentStore.write_documents() & sync markdown files #501
Add versioning docs #495
READ.me for Docstring Generation #468
Separate data and view for benchmarks #451
Update DPR docstring for embed_title #459
Update Tutorial4_FAQ_style_QA.py #416

❤️ Big thanks to all contributors!

@lalitpagaria @guillim @elyase @kolk @rsanjaykamath @antoniolanza1996 @Zenahr @Futurne @tanaysoni @tholor @Timoeller @PiffPaffM @bogdankostic

deepset-ai/haystack v0.5.0 on GitHub