github deepset-ai/haystack v0.4.0

latest releases: v1.26.3-rc1, v2.2.3, v2.2.2...
3 years ago

Highlights

💥 New Project Website & Documentation

As the project is growing, we have more and more content that doesn't fit in GitHub.
In this first version of the website, we focused on documentation incl. quick start, usage guides and the API reference.
In the future, we plan to extend this with benchmarks, FAQs, use cases, and other content that helps you to build your QA system.

👉 https://haystack.deepset.ai

📈 Scalable dense retrieval: FAISSDocumentStore

With recent performance gains of dense retrieval methods (learn more about it here), we need document stores that efficiently store vectors and find the most similar ones at query time. While Elasticsearch can also handle vectors, it quickly reaches its limits when dealing with larger datasets. We evaluated a couple of projects (FAISS, Scann, Milvus, Jina ...) that specialize on approximate nearest neighbour (ANN) algorithms for vector similarity. We decided to implement FAISS as it's easy to run in most environments.
We will likely add one of the heavier solutions (e.g. Jina or Milvus) later this year.

The FAISSDocumentStore uses FAISS to handle embeddings and SQL to store the actual texts and meta data.

Usage:

document_store = FAISSDocumentStore(sql_url: str = "sqlite:///",        # SQL DB for text + meta data
                                    vector_size: int = 768)             # Dimensionality of your embeddings
                                                                  

📃 More input file formats: Apache Tika File Converter (#314 )

Thanks to @dany-nonstop you can now extract text from many file formats (docx, pptx, html, epub, odf ...) via Apache Tika.

Usage:

  1. Start Apache Tika Server
docker run -d -p 9998:9998 apache/tika
  1. Do Conversion in Haystack
tika_converter = TikaConverter(
        tika_url = "http://localhost:9998/tika",
        remove_numeric_tables = False,
        remove_whitespace = False,
        remove_empty_lines = False,
        remove_header_footer = False,
        valid_languages = None,
    )
>>> dict = tika_converter.convert(file_path=Path("test/samples/pdf/sample_pdf_1.pdf"))
>>> dict
{ 
  "text": "everything on page one \f then page two \f ..."
  'meta': {'Content-Type': 'application/pdf', 'Creation-Date': '2020-06-02T12:27:28Z', ...}
}

Breaking changes

Restructuring / Renaming of modules (Breaking changes!) (#379)

We've restructured the package to make the usage more intuitive and terminology more consistent.

  1. Rename database module -> document_store
  2. Split indexing module into -> file_converter and preprocessor
  3. Move Document, Label and Multilabel classes into -> schema and simplify import to from haystack import Document, Label, Multilabel

File converters (#393)

Refactoring major parts of the file converters. Not returning pages anymore, but rather adding page break symbols that can be accessed further down the pipeline.

Old:

>>> pages, meta = `Fileconverter.extract_pages(file_path=Path("..."))`

New:

>>> dict = `Fileconverter.convert(file_path="...", meta={"name": "some_name", "category": "news"})`
>>> dict
{ 
  "text": "everything on page one \f then page two \f ..."
  "meta": {"name": "..."}
}

DensePassageRetriever (#308)

Refactored from FB code to transformers code base and loading the models from their model hub now.
Signature has therefore changed to:

retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=True,
                                  embed_title=True,
                                  remove_sep_tok_from_untitled_passages=True)

Deprecate Tags for Document Stores (#286)

We removed the "tags" field that in the past could be associated with Documents and used for filtering your search.
Insead, we use now the more general concept of "meta", where you can supply any custom fields and filter for them at runtime

Old:

dict = {"text": "some", "tags": ["category1", "category2"]}

New

dict =   {"text": "some", "meta": {"category": ["1", "2"] }}

Details

Document Stores

  • Add FAISS Document Store #253
  • Fix type casting for vectors in FAISS #399
  • Fix duplicate vector ids in FAISS #395
  • Fix document filtering in SQLDocumentStore #396
  • Move retriever probability calculations to document_store #389
  • Add FAISS query scores #368
  • Raise Exception if filters used for FAISSDocumentStore query #338
  • Add refresh_type arg to ElasticsearchDocumentStore #326
  • Improve speed for SQLDocumentStore #330
  • Fix indexing of metadata for FAISS/SQL Document Store #310
  • Ensure exact match when filtering by meta in Elasticsearch #311
  • Deprecate Tags for Document Stores #286
  • Add option to update existing documents when indexing #285
  • Cast document_ids as strings #284
  • Add method to update meta fields for documents in Elasticsearch #242
  • Custom mapping write doc fix #297

Retriever

  • DPR (Dense Retriever) for InMemoryDocumentStore #316 #332
  • Refactor DPR from FB to Transformers codebase #308
  • Restructure update embeddings #304
  • Added title during DPR passage embedding && ElasticsearchDocumentStore #298
  • Add eval for Dense Passage Retriever & Refactor handling of labels/feedback #243
  • Fix type of query_emb in DPR.retrieve() #247
  • Fix return type of EmbeddingRetriever to numpy array #245

Reader

  • More robust Reader eval by limiting max answers and creating no answer labels #331
  • Aggregate multiple no answers in MultiLabel #324
  • Add "no answer" aggregation to Transformersreader #259
  • Align TransformersReader with FARMReader #319
  • Datasilo use all cores for preprocessing #303
  • Batch prediction in evaluation #137
  • Aggregate label objects for same questions #292
  • Add num_processes to reader.train() to configure multiprocessing #271
  • Added support for unanswerable questions in TransformersReader #258

Preprocessing

  • Refactor file converter interface #393
  • Add Tika Converter #314

Finder

  • Add index arg to Finder.get_answers() and _via_similar_questions() #362

Documentation

  • Create documentation website #272
  • Use port 8000 in documentation #357
  • Documentation #343
  • Convert Documentation to markdown #386
  • Add logo to readme #384
  • Refactor the DPR tutorial to use FAISS #317
  • Make Tutorials Work on Colab GPUs #322

Other

  • Exclude embedding fields from the REST API #390
  • Fix test suite dependency issue on MacOS #374
  • Add Gunicorn timeout #364
  • Bump FARM version to 0.4.7 #340
  • Add Tests for MultiLabel #318
  • Modified search endpoints logs to dump json #290
  • Add export answers to CSV function #266

Big thanks to all contributors ♥️

@antoniolanza1996, @dany-nonstop, @philipp-bode, @lalitpagaria , @PiffPaffM , @brandenchan , @tanaysoni , @Timoeller , @tholor, @bogdankostic , @maxupp, @kolk , @venuraja79 , @karimjp

Don't miss a new haystack release

NewReleases is sending notifications on new releases.