Highlights

💥 New Project Website & Documentation

As the project is growing, we have more and more content that doesn't fit in GitHub.
In this first version of the website, we focused on documentation incl. quick start, usage guides and the API reference.
In the future, we plan to extend this with benchmarks, FAQs, use cases, and other content that helps you to build your QA system.

👉 https://haystack.deepset.ai

📈 Scalable dense retrieval: FAISSDocumentStore

With recent performance gains of dense retrieval methods (learn more about it here), we need document stores that efficiently store vectors and find the most similar ones at query time. While Elasticsearch can also handle vectors, it quickly reaches its limits when dealing with larger datasets. We evaluated a couple of projects (FAISS, Scann, Milvus, Jina ...) that specialize on approximate nearest neighbour (ANN) algorithms for vector similarity. We decided to implement FAISS as it's easy to run in most environments.
We will likely add one of the heavier solutions (e.g. Jina or Milvus) later this year.

The FAISSDocumentStore uses FAISS to handle embeddings and SQL to store the actual texts and meta data.

Usage:

document_store = FAISSDocumentStore(sql_url: str = "sqlite:///",        # SQL DB for text + meta data
                                    vector_size: int = 768)             # Dimensionality of your embeddings

📃 More input file formats: Apache Tika File Converter (#314 )

Thanks to @dany-nonstop you can now extract text from many file formats (docx, pptx, html, epub, odf ...) via Apache Tika.

Usage:

Start Apache Tika Server

docker run -d -p 9998:9998 apache/tika

Do Conversion in Haystack

tika_converter = TikaConverter(
        tika_url = "http://localhost:9998/tika",
        remove_numeric_tables = False,
        remove_whitespace = False,
        remove_empty_lines = False,
        remove_header_footer = False,
        valid_languages = None,
    )
>>> dict = tika_converter.convert(file_path=Path("test/samples/pdf/sample_pdf_1.pdf"))
>>> dict
{ 
  "text": "everything on page one \f then page two \f ..."
  'meta': {'Content-Type': 'application/pdf', 'Creation-Date': '2020-06-02T12:27:28Z', ...}
}

Breaking changes

Restructuring / Renaming of modules (Breaking changes!) (#379)

We've restructured the package to make the usage more intuitive and terminology more consistent.

Rename database module -> document_store
Split indexing module into -> file_converter and preprocessor
Move Document, Label and Multilabel classes into -> schema and simplify import to from haystack import Document, Label, Multilabel

File converters (#393)

Refactoring major parts of the file converters. Not returning pages anymore, but rather adding page break symbols that can be accessed further down the pipeline.

Old:

>>> pages, meta = `Fileconverter.extract_pages(file_path=Path("..."))`

New:

>>> dict = `Fileconverter.convert(file_path="...", meta={"name": "some_name", "category": "news"})`
>>> dict
{ 
  "text": "everything on page one \f then page two \f ..."
  "meta": {"name": "..."}
}

DensePassageRetriever (#308)

Refactored from FB code to transformers code base and loading the models from their model hub now.
Signature has therefore changed to:

retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=True,
                                  embed_title=True,
                                  remove_sep_tok_from_untitled_passages=True)

Deprecate Tags for Document Stores (#286)

We removed the "tags" field that in the past could be associated with Documents and used for filtering your search.
Insead, we use now the more general concept of "meta", where you can supply any custom fields and filter for them at runtime

Old:

dict = {"text": "some", "tags": ["category1", "category2"]}

New

dict =   {"text": "some", "meta": {"category": ["1", "2"] }}

Details

Document Stores

Add FAISS Document Store #253
Fix type casting for vectors in FAISS #399
Fix duplicate vector ids in FAISS #395
Fix document filtering in SQLDocumentStore #396
Move retriever probability calculations to document_store #389
Add FAISS query scores #368
Raise Exception if filters used for FAISSDocumentStore query #338
Add refresh_type arg to ElasticsearchDocumentStore #326
Improve speed for SQLDocumentStore #330
Fix indexing of metadata for FAISS/SQL Document Store #310
Ensure exact match when filtering by meta in Elasticsearch #311
Deprecate Tags for Document Stores #286
Add option to update existing documents when indexing #285
Cast document_ids as strings #284
Add method to update meta fields for documents in Elasticsearch #242
Custom mapping write doc fix #297

Retriever

DPR (Dense Retriever) for InMemoryDocumentStore #316 #332
Refactor DPR from FB to Transformers codebase #308
Restructure update embeddings #304
Added title during DPR passage embedding && ElasticsearchDocumentStore #298
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback #243
Fix type of query_emb in DPR.retrieve() #247
Fix return type of EmbeddingRetriever to numpy array #245

Reader

More robust Reader eval by limiting max answers and creating no answer labels #331
Aggregate multiple no answers in MultiLabel #324
Add "no answer" aggregation to Transformersreader #259
Align TransformersReader with FARMReader #319
Datasilo use all cores for preprocessing #303
Batch prediction in evaluation #137
Aggregate label objects for same questions #292
Add num_processes to reader.train() to configure multiprocessing #271
Added support for unanswerable questions in TransformersReader #258

Preprocessing

Refactor file converter interface #393
Add Tika Converter #314

Finder

Add index arg to Finder.get_answers() and _via_similar_questions() #362

Documentation

Create documentation website #272
Use port 8000 in documentation #357
Documentation #343
Convert Documentation to markdown #386
Add logo to readme #384
Refactor the DPR tutorial to use FAISS #317
Make Tutorials Work on Colab GPUs #322

Other

Exclude embedding fields from the REST API #390
Fix test suite dependency issue on MacOS #374
Add Gunicorn timeout #364
Bump FARM version to 0.4.7 #340
Add Tests for MultiLabel #318
Modified search endpoints logs to dump json #290
Add export answers to CSV function #266

Big thanks to all contributors ♥️

@antoniolanza1996, @dany-nonstop, @philipp-bode, @lalitpagaria , @PiffPaffM , @brandenchan , @tanaysoni , @Timoeller , @tholor, @bogdankostic , @maxupp, @kolk , @venuraja79 , @karimjp

deepset-ai/haystack v0.4.0 on GitHub