⭐ Highlights

Model Distillation for Reader Models

With the new model distillation features, you don't need to choose between accuracy and speed! Now you can compress a large reader model (teacher) into a smaller model (student) while retaining most of the teacher's performance. For example, deepset/tinybert-6l-768d-squad2 is twice as fast as bert-base with an F1 reduction of only 2%.

To distil your own model, just follow these steps:

Call python augment_squad.py --squad_path <your dataset> --output_path <output> --multiplication_factor 20 where augment_squad.py is our data augmentation script.
Run student.distil_intermediate_layers_from(teacher, data_dir="dataset", train_filename="augmented_dataset.json") where student is a small model and teacher is a highly accurate, larger reader model.
Run student.distil_prediction_layer_from(teacher, data_dir="dataset", train_filename="dataset.json") with the same teacher and student.

For more information on what kinds of students and teachers you can use and on model distillation in general, just take a look at this guide.

Integrated vs. Isolated Pipeline Evaluation Modes

When you evaluate a pipeline, you can now use two different evaluation modes and create an automatic report that shows the results of both. The integrated evaluation (default) shows what result quality users will experience when running the pipeline. The isolated evaluation mode additionally shows what the maximum result quality of a node could be if it received the perfect input from the preceeding node. Thereby, you can find out whether the retriever or the reader in an ExtractiveQAPipeline is the bottleneck.

eval_result_with = pipeline.eval(labels=eval_labels, add_isolated_node_eval=True)
pipeline.print_eval_report(eval_result)

================== Evaluation Report ==================
=======================================================
                      Query
                        |
                      Retriever
                        |
                        | recall_single_hit:   ...
                        |
                      Reader
                        |
                        | f1 upper  bound:   0.78
                        | f1:   0.65
                        |
                      Output

As the gap between the upper bound F1-score of the reader differs a lot from its actual F1-score in this report, you would need to improve the predictions of the retriever node to achieve the full performance of this pipeline. Our updated evaluation tutorial lists all the steps to generate an evaluation report with all the metrics you need and their upper bounds of each individual node. The guide explains the two evaluation modes in detail.

Row-Column-Intersection Model for TableQA

Now you can use a Row-Column-Intersection model on your own tabular data. To try it out, just replace the declaration of your TableReader:

reader = RCIReader(row_model_name_or_path="michaelrglass/albert-base-rci-wikisql-row",
                   column_model_name_or_path="michaelrglass/albert-base-rci-wikisql-col")

The RCIReader requires two separate models: One for rows and one for columns. Working on each column and row separately allows it to be used on much larger tables. It is also able to return meaningful confidence scores unlike the TableReader.
Please note, however, that it currently does not support aggregations over multiple cells and that it is a bit slower than other approaches.

Advanced File Converters

Given a file (PDF or DOCX), there are two file converters to extract text and tables in Haystack now:
The ParsrConverter based on the open-source Parsr tool by axa-group introduced into Haystack in this release and the AzureConverter, which we improved on. Both of them return a list of dictionaries containing one dictionary for each table detected in the file and one dictionary containing the text of the file. This format matches the document format and can be used right away for TableQA (see the guide).

converter = ParsrConverter()
docs = converter.convert(file_path="samples/pdf/sample_pdf_1.pdf")

⚠️ Breaking Changes

Custom id hashing on documentstore level by @ArzelaAscoIi in #1910
Implement proper FK in MetaDocumentORM and MetaLabelORM to work on PostgreSQL by @ZanSara in #1990

🤓 Detailed Changes

Pipeline

Extend TranslationWrapper to work with QA Generation by @julian-risch in #1905
Add nDCG to pipeline.eval()'s document metrics by @tstadel in #2008
change column order for evaluatation dataframe by @ju-gu in #1957
Add isolated node eval mode in pipeline eval by @julian-risch in #1962
introduce node_input param by @tstadel in #1854
Add ParsrConverter by @bogdankostic in #1931
Add improvements to AzureConverter by @bogdankostic in #1896

Models

Prevent wrapping DataParallel in second DataParallel by @bogdankostic in #1855
Enable batch mode for SAS cross encoders by @tstadel in #1987
Add RCIReader for TableQA by @bogdankostic in #1909
distinguish intermediate layer & prediction layer distillation phases with different parameters by @MichelBartels in #2001
Add TinyBERT data augmentation by @MichelBartels in #1923
Adding distillation loss functions from TinyBERT by @MichelBartels in #1879

DocumentStores

Raise exception if Elasticsearch search_fields have wrong datatype by @tstadel in #1913
Support custom headers per request in pipeline by @tstadel in #1861
Fix retrieving documents in WeaviateDocumentStore with content_type=None by @bogdankostic in #1938
Fix Numba TypingError in normalize_embedding for cosine similarity by @bogdankostic in #1933
Fix loading a saved FAISSDocumentStore by @bogdankostic in #1937
Propagate duplicate_documents to base class initialization by @yorickvanzweeden in #1936
Fix vector_id collision in FAISS by @yorickvanzweeden in #1961
Unify vector_dim and embedding_dim parameter in Document Store by @mathew55 in #1922
Align similarity scores across document stores by @MichelBartels in #1967
Bugfix - save_to_yaml for OpenSearchDocumentStore by @ArzelaAscoIi in #2017
Fix elasticsearch scores if they are 0.0 by @tstadel in #1980

REST API

Rely api healthcheck on status code rather than json decoding by @fabiolab in #1871
Bump version in REST api by @tholor in #1875

UI / Demo

Replace SessionState with Streamlit built-in by @yorickvanzweeden in #2006
Fix demo deployment by @askainet in #1877
Add models to demo docker image by @ZanSara in #1978

Documentation

Update pydoc-markdown-file-classifier.yml by @brandenchan in #1856
Create v1.0 docs by @brandenchan in #1862
Fix typo by @amotl in #1869
Correct bug with encoding when generating Markdown documentation issue #1880 by @albertovilla in #1881
Minor typo by @javier in #1900
Fixed the grammatical issue in optimization guides #1940 by @eldhoittangeorge in #1941
update link to annotation tool docu by @julian-risch in #2005
Extend Tutorial 5 with Upper Bound Reader Eval Metrics by @julian-risch in #1995
Add distillation to finetuning tutorial by @MichelBartels in #2025
Add ndcg and eval_mode to docs by @tstadel in #2038
Remove hard-coded variables from the Tutorial 15 by @dmigo in #1984

Other Changes

upgrade transformers to 4.13.0 by @julian-risch in #1659
Fix typo in the Windows CI UI deps by @ZanSara in #1876
Exchanged minimal with minimum in print_answers function call by @albertovilla in #1890
Improved version of print_answers by @albertovilla in #1891
Fix WIndows CI by @bogdankostic in #1899
Changed export to csv method to new answer format by @Johnny-KP in #1907
Unpin ray version by @dmigo in #1906
Fix Windows CI OOM by @tstadel in #1878
Text for contributor license agreement by @PiffPaffM in #1766
Fix issue #1925 - UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow by @AlonEirew in #1926
Update Ray to version 1.9.1 by @bogdankostic in #1934
Fix #1927 - RuntimeError when loading data using data_silo due to many open file descriptors from multiprocessing by @AlonEirew in #1928
Add GitHub Action for Docker Build for GPU by @oryx1729 in #1916
Use Commit ID for Docker tags by @oryx1729 in #1946
Upgrade torch version by @oryx1729 in #1960
check multiprocessing sharing strategy is available by @julian-risch in #1965
fix UserWarning from slow tensor conversion by @mathislucka in #1948
Fix Dockerfile-GPU by @oryx1729 in #1969
Use scikit-learn, not sklearn, in requirements.txt by @benjamin-klara in #1974
Upgrade pillow version to 9.0.0 by @mapapa in #1992
Disable pip cache for Dockerfiles by @oryx1729 in #2015

New Contributors

@amotl made their first contribution in #1869
@fabiolab made their first contribution in #1871
@albertovilla made their first contribution in #1881
@javier made their first contribution in #1900
@Johnny-KP made their first contribution in #1907
@dmigo made their first contribution in #1906
@eldhoittangeorge made their first contribution in #1941
@yorickvanzweeden made their first contribution in #1936
@benjamin-klara made their first contribution in #1974
@mathew55 made their first contribution in #1922
@mapapa made their first contribution in #1992

❤️ Big thanks to all contributors and the whole community!

deepset-ai/haystack v1.1.0 on GitHub