deepset-ai/haystack 0.3.0 on GitHub

🔍 Dense Passage Retrieval

Glad to introduce the new Dense Passage Retriever (aka DPR).
Using dense embeddings of texts is a powerful alternative to score the similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. This Dual-Encoder architecture can deal much better with the different nature of query and texts (length, syntax ...). It's was published by Karpukhin et al and shows impressive performance - especially if there's no direct overlap between tokens in your queries and your texts.

retriever = DensePassageRetriever(document_store=document_store,
                                  embedding_model="dpr-bert-base-nq",
                                  do_lower_case=True, use_gpu=True)
retriever.retrieve(query="What is cosine similarity?")
# returns: [Document, Document]

See Tutorial 6 for more details

📊 Evaluation

We introduce the option to evaluate your reader, retriever, and the combination of both. While there's usually a good understanding of the reader's performance, the interplay with the retriever is what really matters in practice. You want to answer: Is my retriever a bottleneck? Is it worth increasing top_k for the retriever? How do different retrievers compare in performance? What is the effect on speed?
The new eval() is a first step towards answering those questions and gives a comprehensive picture of your pipeline. Stay tuned for more enhancements here.

document_store.add_eval_data("../data/nq/nq_dev_subset_v2.json")
...
retriever.eval(top_k=10)
reader.eval(document_store=document_store, device=device)
finder.eval(top_k_retriever=10, top_k_reader=10)

See Tutorial 5 for more details

📄 Basic Support for PDF and Docx Files

You can now index PDF and docx files more easily to your DocumentStore. We introduce a new BaseConverter class, that offers basic cleaning functions (e.g. removing footers or tables). It's file format specific child classes (e.g. PDFToTextConverter) handle the actual extraction of the text.

#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page

#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
#  => list of str, one per paragraph (as docx has no direct notion of pages)

And there's much more that happened ...

Preprocessing

Added Support for Docx Files #225
Add PDF parser for indexing #109
Adjust PDF conversion subprocess for Python v3.6 #194
Fix boundary condition in detection of header/footer in file converters #165

Retriever

Refactor DPR for latest transformers version & change init arg gpu -> use_gpu for DPR and EmbeddingRetriever #239
Add dummy retriever for benchmarking / reader-only settings #235
Fix id for documents returned by the TfidfRetriever #232
Tutorial for Dense Passage Retriever #186
Fix device arg for sentence transformers #124
Fix embeddings from sentence-transformers (type cast & gpu flags) #121
Adding metadata to be returned from tfidf retreiver #122

Reader

Add ONNXRuntime support #157
Fix multi gpu training via Dataparallel #234
Fix document id missing in farm inference output #174
Add document meta for Transformer Reader #114
Fix naming of offset in answers of TransformersReader (for consistency with FARMReader) #204
Adjust to farm handling of no answer #170

DocumentStores

Move document_name attribute to meta #217
Remove meta field when writing documents in Elasticsearch #240
Harmonize meta data handling across doc stores #214
Add filtering by tags for InMemoryDocumentStore #108
Make FAQ question field customizable #146
Increase timeout for Elasticsearch bulk indexing #119
Add embedding query for InMemoryDocumentStore #112
Increase timeout for bulk indexing in ES #130
Add custom port to ElasticsearchDocumentStore #129
Remove hard-coded embedding field #107

REST API

Move out REST API from PyPI package #160
Fix format of /export-doc-qa-feedback to comply with SQuAD #241
Create file upload directory in the REST API #166
Add API endpoint to upload files #154
Missing PORT and SCHEME for elasticsearch to run the API #134
Add EMBEDDING_MODEL_FORMAT in API config #152
Add success response for successful file upload API #195
Add response time in logs #201
Fix rest api in Docker image after refactoring #178

Other

Upgrade to new FARM / Transformers / PyTorch versions #212
Fix Evaluation Dataset #233
Remove mutation of documents in write_documents() #231
Remove mutation of results dict in print_answers() #230
Make doc name optional #100
Fix Dockerfile to build successfully without models directory #210
Docker available for TransformsReader Class #180
Fix embedding method in FAQ-QA Tutorial #220
Add more tests #213
Update docstring for embedding_field and embedding_dim #208
Make "meta" field generic for Document Schema #102
Update tutorials #200
Upgrade FARM version #172
Fix for installing PyTorch on Windows OS #159
Remove Literal type hint #156
Remove PyMuPDF dependency #148
Add missing type hints #138
Add a GitHub Action to start Elasticsearch instance for Build workflow #142
Correct field in evaluation tutorial #139
Update Haystack version in tutorials #136
Fix evaluation #132
Add stalebot #131
Add Reader/Retriever validations in Finder #113
Add document metadata for FAQ style QA #106
Add basic tutorial for FAQ-based QA & batch comp. of embeddings #98
Make saving more explicit in tutorial #95

Thanks to all contributors for working on this and shaping Haystack together: @skirdey @guillim @antoniolanza1996 @F4r1n @arthurbarros @elyase @anirbansaha96 @Timoeller @bogdankostic @tanaysoni @brandenchan