⭐ Highlights
Model Distillation for Reader Models
With the new model distillation features, you don't need to choose between accuracy and speed! Now you can compress a large reader model (teacher) into a smaller model (student) while retaining most of the teacher's performance. For example, deepset/tinybert-6l-768d-squad2 is twice as fast as bert-base with an F1 reduction of only 2%.
To distil your own model, just follow these steps:
- Call
python augment_squad.py --squad_path <your dataset> --output_path <output> --multiplication_factor 20
where augment_squad.py is our data augmentation script. - Run
student.distil_intermediate_layers_from(teacher, data_dir="dataset", train_filename="augmented_dataset.json")
wherestudent
is a small model and teacher is a highly accurate, larger reader model. - Run
student.distil_prediction_layer_from(teacher, data_dir="dataset", train_filename="dataset.json")
with the same teacher and student.
For more information on what kinds of students and teachers you can use and on model distillation in general, just take a look at this guide.
Integrated vs. Isolated Pipeline Evaluation Modes
When you evaluate a pipeline, you can now use two different evaluation modes and create an automatic report that shows the results of both. The integrated evaluation (default) shows what result quality users will experience when running the pipeline. The isolated evaluation mode additionally shows what the maximum result quality of a node could be if it received the perfect input from the preceeding node. Thereby, you can find out whether the retriever or the reader in an ExtractiveQAPipeline
is the bottleneck.
eval_result_with = pipeline.eval(labels=eval_labels, add_isolated_node_eval=True)
pipeline.print_eval_report(eval_result)
================== Evaluation Report ==================
=======================================================
Query
|
Retriever
|
| recall_single_hit: ...
|
Reader
|
| f1 upper bound: 0.78
| f1: 0.65
|
Output
As the gap between the upper bound F1-score of the reader differs a lot from its actual F1-score in this report, you would need to improve the predictions of the retriever node to achieve the full performance of this pipeline. Our updated evaluation tutorial lists all the steps to generate an evaluation report with all the metrics you need and their upper bounds of each individual node. The guide explains the two evaluation modes in detail.
Row-Column-Intersection Model for TableQA
Now you can use a Row-Column-Intersection model on your own tabular data. To try it out, just replace the declaration of your TableReader:
reader = RCIReader(row_model_name_or_path="michaelrglass/albert-base-rci-wikisql-row",
column_model_name_or_path="michaelrglass/albert-base-rci-wikisql-col")
The RCIReader requires two separate models: One for rows and one for columns. Working on each column and row separately allows it to be used on much larger tables. It is also able to return meaningful confidence scores unlike the TableReader
.
Please note, however, that it currently does not support aggregations over multiple cells and that it is a bit slower than other approaches.
Advanced File Converters
Given a file (PDF or DOCX), there are two file converters to extract text and tables in Haystack now:
The ParsrConverter
based on the open-source Parsr tool by axa-group introduced into Haystack in this release and the AzureConverter
, which we improved on. Both of them return a list of dictionaries containing one dictionary for each table detected in the file and one dictionary containing the text of the file. This format matches the document format and can be used right away for TableQA (see the guide).
converter = ParsrConverter()
docs = converter.convert(file_path="samples/pdf/sample_pdf_1.pdf")
⚠️ Breaking Changes
- Custom id hashing on documentstore level by @ArzelaAscoIi in #1910
- Implement proper FK in
MetaDocumentORM
andMetaLabelORM
to work on PostgreSQL by @ZanSara in #1990
🤓 Detailed Changes
Pipeline
- Extend TranslationWrapper to work with QA Generation by @julian-risch in #1905
- Add nDCG to
pipeline.eval()
's document metrics by @tstadel in #2008 - change column order for evaluatation dataframe by @ju-gu in #1957
- Add isolated node eval mode in pipeline eval by @julian-risch in #1962
- introduce node_input param by @tstadel in #1854
- Add ParsrConverter by @bogdankostic in #1931
- Add improvements to AzureConverter by @bogdankostic in #1896
Models
- Prevent wrapping DataParallel in second DataParallel by @bogdankostic in #1855
- Enable batch mode for SAS cross encoders by @tstadel in #1987
- Add RCIReader for TableQA by @bogdankostic in #1909
- distinguish intermediate layer & prediction layer distillation phases with different parameters by @MichelBartels in #2001
- Add TinyBERT data augmentation by @MichelBartels in #1923
- Adding distillation loss functions from TinyBERT by @MichelBartels in #1879
DocumentStores
- Raise exception if Elasticsearch search_fields have wrong datatype by @tstadel in #1913
- Support custom headers per request in pipeline by @tstadel in #1861
- Fix retrieving documents in
WeaviateDocumentStore
withcontent_type=None
by @bogdankostic in #1938 - Fix Numba TypingError in
normalize_embedding
for cosine similarity by @bogdankostic in #1933 - Fix loading a saved
FAISSDocumentStore
by @bogdankostic in #1937 - Propagate duplicate_documents to base class initialization by @yorickvanzweeden in #1936
- Fix vector_id collision in FAISS by @yorickvanzweeden in #1961
- Unify vector_dim and embedding_dim parameter in Document Store by @mathew55 in #1922
- Align similarity scores across document stores by @MichelBartels in #1967
- Bugfix - save_to_yaml for OpenSearchDocumentStore by @ArzelaAscoIi in #2017
- Fix elasticsearch scores if they are 0.0 by @tstadel in #1980
REST API
- Rely api healthcheck on status code rather than json decoding by @fabiolab in #1871
- Bump version in REST api by @tholor in #1875
UI / Demo
- Replace SessionState with Streamlit built-in by @yorickvanzweeden in #2006
- Fix demo deployment by @askainet in #1877
- Add models to demo docker image by @ZanSara in #1978
Documentation
- Update pydoc-markdown-file-classifier.yml by @brandenchan in #1856
- Create v1.0 docs by @brandenchan in #1862
- Fix typo by @amotl in #1869
- Correct bug with encoding when generating Markdown documentation issue #1880 by @albertovilla in #1881
- Minor typo by @javier in #1900
- Fixed the grammatical issue in optimization guides #1940 by @eldhoittangeorge in #1941
- update link to annotation tool docu by @julian-risch in #2005
- Extend Tutorial 5 with Upper Bound Reader Eval Metrics by @julian-risch in #1995
- Add distillation to finetuning tutorial by @MichelBartels in #2025
- Add ndcg and eval_mode to docs by @tstadel in #2038
- Remove hard-coded variables from the Tutorial 15 by @dmigo in #1984
Other Changes
- upgrade transformers to 4.13.0 by @julian-risch in #1659
- Fix typo in the Windows CI UI deps by @ZanSara in #1876
- Exchanged minimal with minimum in print_answers function call by @albertovilla in #1890
- Improved version of print_answers by @albertovilla in #1891
- Fix WIndows CI by @bogdankostic in #1899
- Changed export to csv method to new answer format by @Johnny-KP in #1907
- Unpin ray version by @dmigo in #1906
- Fix Windows CI OOM by @tstadel in #1878
- Text for contributor license agreement by @PiffPaffM in #1766
- Fix issue #1925 - UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow by @AlonEirew in #1926
- Update Ray to version 1.9.1 by @bogdankostic in #1934
- Fix #1927 - RuntimeError when loading data using data_silo due to many open file descriptors from multiprocessing by @AlonEirew in #1928
- Add GitHub Action for Docker Build for GPU by @oryx1729 in #1916
- Use Commit ID for Docker tags by @oryx1729 in #1946
- Upgrade torch version by @oryx1729 in #1960
- check multiprocessing sharing strategy is available by @julian-risch in #1965
- fix UserWarning from slow tensor conversion by @mathislucka in #1948
- Fix Dockerfile-GPU by @oryx1729 in #1969
- Use scikit-learn, not sklearn, in requirements.txt by @benjamin-klara in #1974
- Upgrade pillow version to 9.0.0 by @mapapa in #1992
- Disable pip cache for Dockerfiles by @oryx1729 in #2015
New Contributors
- @amotl made their first contribution in #1869
- @fabiolab made their first contribution in #1871
- @albertovilla made their first contribution in #1881
- @javier made their first contribution in #1900
- @Johnny-KP made their first contribution in #1907
- @dmigo made their first contribution in #1906
- @eldhoittangeorge made their first contribution in #1941
- @yorickvanzweeden made their first contribution in #1936
- @benjamin-klara made their first contribution in #1974
- @mathew55 made their first contribution in #1922
- @mapapa made their first contribution in #1992
❤️ Big thanks to all contributors and the whole community!