⭐ Highlights

New Slack Channel

As many people in the community asked us for it, we decided to open a slack channel!
Join us and ask questions, show what you've built with Haystack, and simply exchange with like-minded folks!

👉 https://haystack.deepset.ai/community/join

Optimizing Memory + CPU consumption of documentstores for large datasets (#733)

Interacting with large datasets can be challenging for the local memory. Therefore, we ...

... add batch_size parameters for most methods of the document store that allow to only load smaller chunks of documents at a time
... add a get_all_documents_generator() method that "streams" documents one by one from your document store.
Both help to lower the memory footprint significantly- especially when calling methods like update_embeddings() on datasets > 1 Mio docs.

Add Simple Demo UI (#671)

Thanks to our community member @tanmaylaud, we now have a great and simple UI that allows you to easily try your search pipelines. Ask questions, see the results, change basic config params, debug the API response and give your colleagues a better flavor of what you are building ...

Support for summarization models (#698)

Thanks to another community contribution from @lalitpagaria we now also support summarization models like PEGASUS in Haystack. You can use them ...

... standalone:

docs = [Document(text="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
                    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by
                    the shutoffs which were expected to last through at least midday tomorrow.")]

summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
summary = summarizer.predict(documents=docs, generate_single_summary=False)

... as a node in your pipeline:

...
pipeline.add_node(component=summarizer, name="Summarizer", inputs=["Retriever"])

... by simply calling a predefined pipeline that first retrieves and then summarizes the resulting docs:

...
pipe = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever)
pipe.run()

We see many interesting use cases around search for it. For example, running semantic document search and displaying the summary of docs as a "preview" in the results.

New Tutorials

Wonder how to train a DPR retriever on your own domain dataset? Check out this new tutorial!
Proper preprocessing (Cleaning, Splitting etc.) of docs can have a big impact on your performance. Check out this new tutorial to learn more about it.

⚠️ Breaking Changes

Dropping `index_buffer_size` from FAISSDocumentStore

We removed the arg index_buffer_size from the init of FAISSDocumentStore. "Buffering" is now handled via the new batch_size arguments that you can pass to most methods like write_documents(), update_embeddings() and get_all_documents().

Renaming of Preprocessor arg

Old:

PreProcessor(..., split_stride=5)

New:

PreProcessor(..., split_overlap=5)

🤓 Detailed Changes

Preprocessing / File Conversion

Using PreProcessor functions on eval data #751

DocumentStore

Support filters for DensePassageRetriever + InMemoryDocumentStore #754
use Path class in add_eval_data of haystack.document_store.base.py #745
Make batchwise adding of evaluation data possible #717
Change signature and docstring for ca_certs parameter #730
Rename label id field for elastic & add UPDATE_EXISTING_DOCUMENTS to API config #728
Fix SQLite errors in tests #723
Add support for custom embedding field for InMemoryDocumentStore #640
Using Columns names instead of ORM to get all documents #620

Other

Generate docstrings and deploy to branches to Staging (Website) #731
Script for releasing docs #736
Increase FARM to Version 0.6.2 #755
Reduce memory consumption of fetch_archive_from_http #737
Add links to more resources #746
Fix Tutorial 9 #734
Adding a guard that prevents the tutorial from being executed in every subprocess on windows #729
Add ID to Label schema #727
Automate docstring and tutorial generation with every push to master #718
Pass custom label index name to REST API #724
Correcting pypi download badge #722
Fix GPU docker build #703
Remove sourcerer.io widget #702
Haystack logo is not visible on github mobile app #697
Update pipeline documentation and readme #693
Enable GPU args in tutorials #692
Add docs v0.6.0 #689

Big thanks to all contributors ❤️ !

@Rob192 @antoniolanza1996 @tanmaylaud @lalitpagaria @Timoeller @tanaysoni @bogdankostic @aantti @brandenchan @PiffPaffM @julian-risch

deepset-ai/haystack v0.7.0 on GitHub