github deepset-ai/haystack v0.7.0

latest releases: v2.1.2, v2.1.2-rc1, v2.1.2-rc0...
3 years ago

⭐ Highlights

New Slack Channel

As many people in the community asked us for it, we decided to open a slack channel!
Join us and ask questions, show what you've built with Haystack, and simply exchange with like-minded folks!

👉 https://haystack.deepset.ai/community/join

Optimizing Memory + CPU consumption of documentstores for large datasets (#733)

Interacting with large datasets can be challenging for the local memory. Therefore, we ...

  1. ... add batch_size parameters for most methods of the document store that allow to only load smaller chunks of documents at a time
  2. ... add a get_all_documents_generator() method that "streams" documents one by one from your document store.
    Both help to lower the memory footprint significantly- especially when calling methods like update_embeddings() on datasets > 1 Mio docs.

Add Simple Demo UI (#671)

Thanks to our community member @tanmaylaud, we now have a great and simple UI that allows you to easily try your search pipelines. Ask questions, see the results, change basic config params, debug the API response and give your colleagues a better flavor of what you are building ...

Image

Support for summarization models (#698)

Thanks to another community contribution from @lalitpagaria we now also support summarization models like PEGASUS in Haystack. You can use them ...

... standalone:

docs = [Document(text="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
                    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by
                    the shutoffs which were expected to last through at least midday tomorrow.")]

summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
summary = summarizer.predict(documents=docs, generate_single_summary=False)

... as a node in your pipeline:

...
pipeline.add_node(component=summarizer, name="Summarizer", inputs=["Retriever"])

... by simply calling a predefined pipeline that first retrieves and then summarizes the resulting docs:

...
pipe = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever)
pipe.run()

We see many interesting use cases around search for it. For example, running semantic document search and displaying the summary of docs as a "preview" in the results.

New Tutorials

  1. Wonder how to train a DPR retriever on your own domain dataset? Check out this new tutorial!
  2. Proper preprocessing (Cleaning, Splitting etc.) of docs can have a big impact on your performance. Check out this new tutorial to learn more about it.

⚠️ Breaking Changes

Dropping index_buffer_size from FAISSDocumentStore

We removed the arg index_buffer_size from the init of FAISSDocumentStore. "Buffering" is now handled via the new batch_size arguments that you can pass to most methods like write_documents(), update_embeddings() and get_all_documents().

Renaming of Preprocessor arg

Old:

PreProcessor(..., split_stride=5)

New:

PreProcessor(..., split_overlap=5)

🤓 Detailed Changes

Preprocessing / File Conversion

  • Using PreProcessor functions on eval data #751

DocumentStore

  • Support filters for DensePassageRetriever + InMemoryDocumentStore #754
  • use Path class in add_eval_data of haystack.document_store.base.py #745
  • Make batchwise adding of evaluation data possible #717
  • Change signature and docstring for ca_certs parameter #730
  • Rename label id field for elastic & add UPDATE_EXISTING_DOCUMENTS to API config #728
  • Fix SQLite errors in tests #723
  • Add support for custom embedding field for InMemoryDocumentStore #640
  • Using Columns names instead of ORM to get all documents #620

Other

  • Generate docstrings and deploy to branches to Staging (Website) #731
  • Script for releasing docs #736
  • Increase FARM to Version 0.6.2 #755
  • Reduce memory consumption of fetch_archive_from_http #737
  • Add links to more resources #746
  • Fix Tutorial 9 #734
  • Adding a guard that prevents the tutorial from being executed in every subprocess on windows #729
  • Add ID to Label schema #727
  • Automate docstring and tutorial generation with every push to master #718
  • Pass custom label index name to REST API #724
  • Correcting pypi download badge #722
  • Fix GPU docker build #703
  • Remove sourcerer.io widget #702
  • Haystack logo is not visible on github mobile app #697
  • Update pipeline documentation and readme #693
  • Enable GPU args in tutorials #692
  • Add docs v0.6.0 #689

Big thanks to all contributors ❤️ !

@Rob192 @antoniolanza1996 @tanmaylaud @lalitpagaria @Timoeller @tanaysoni @bogdankostic @aantti @brandenchan @PiffPaffM @julian-risch

Don't miss a new haystack release

NewReleases is sending notifications on new releases.