github deepset-ai/haystack v1.1.0

latest releases: v1.25.5, v1.25.4, v1.25.3...
2 years ago

⭐ Highlights

Model Distillation for Reader Models

With the new model distillation features, you don't need to choose between accuracy and speed! Now you can compress a large reader model (teacher) into a smaller model (student) while retaining most of the teacher's performance. For example, deepset/tinybert-6l-768d-squad2 is twice as fast as bert-base with an F1 reduction of only 2%.

To distil your own model, just follow these steps:

  1. Call python augment_squad.py --squad_path <your dataset> --output_path <output> --multiplication_factor 20 where augment_squad.py is our data augmentation script.
  2. Run student.distil_intermediate_layers_from(teacher, data_dir="dataset", train_filename="augmented_dataset.json") where student is a small model and teacher is a highly accurate, larger reader model.
  3. Run student.distil_prediction_layer_from(teacher, data_dir="dataset", train_filename="dataset.json") with the same teacher and student.

For more information on what kinds of students and teachers you can use and on model distillation in general, just take a look at this guide.

Integrated vs. Isolated Pipeline Evaluation Modes

When you evaluate a pipeline, you can now use two different evaluation modes and create an automatic report that shows the results of both. The integrated evaluation (default) shows what result quality users will experience when running the pipeline. The isolated evaluation mode additionally shows what the maximum result quality of a node could be if it received the perfect input from the preceeding node. Thereby, you can find out whether the retriever or the reader in an ExtractiveQAPipeline is the bottleneck.

eval_result_with = pipeline.eval(labels=eval_labels, add_isolated_node_eval=True)
pipeline.print_eval_report(eval_result)
================== Evaluation Report ==================
=======================================================
                      Query
                        |
                      Retriever
                        |
                        | recall_single_hit:   ...
                        |
                      Reader
                        |
                        | f1 upper  bound:   0.78
                        | f1:   0.65
                        |
                      Output

As the gap between the upper bound F1-score of the reader differs a lot from its actual F1-score in this report, you would need to improve the predictions of the retriever node to achieve the full performance of this pipeline. Our updated evaluation tutorial lists all the steps to generate an evaluation report with all the metrics you need and their upper bounds of each individual node. The guide explains the two evaluation modes in detail.

Row-Column-Intersection Model for TableQA

Now you can use a Row-Column-Intersection model on your own tabular data. To try it out, just replace the declaration of your TableReader:

reader = RCIReader(row_model_name_or_path="michaelrglass/albert-base-rci-wikisql-row",
                   column_model_name_or_path="michaelrglass/albert-base-rci-wikisql-col")

The RCIReader requires two separate models: One for rows and one for columns. Working on each column and row separately allows it to be used on much larger tables. It is also able to return meaningful confidence scores unlike the TableReader.
Please note, however, that it currently does not support aggregations over multiple cells and that it is a bit slower than other approaches.

Advanced File Converters

Given a file (PDF or DOCX), there are two file converters to extract text and tables in Haystack now:
The ParsrConverter based on the open-source Parsr tool by axa-group introduced into Haystack in this release and the AzureConverter, which we improved on. Both of them return a list of dictionaries containing one dictionary for each table detected in the file and one dictionary containing the text of the file. This format matches the document format and can be used right away for TableQA (see the guide).

converter = ParsrConverter()
docs = converter.convert(file_path="samples/pdf/sample_pdf_1.pdf")

⚠️ Breaking Changes

  • Custom id hashing on documentstore level by @ArzelaAscoIi in #1910
  • Implement proper FK in MetaDocumentORM and MetaLabelORM to work on PostgreSQL by @ZanSara in #1990

🤓 Detailed Changes

Pipeline

Models

DocumentStores

REST API

  • Rely api healthcheck on status code rather than json decoding by @fabiolab in #1871
  • Bump version in REST api by @tholor in #1875

UI / Demo

Documentation

Other Changes

New Contributors

❤️ Big thanks to all contributors and the whole community!

Don't miss a new haystack release

NewReleases is sending notifications on new releases.