🎁 Haystack 1.0

We worked hard to bring you an early Christmas present: 1.0 is out! In the last months, we re-designed many essential parts of Haystack, introduced new features, and simplified many user-facing methods. We believe Haystack is now much easier to use and a solid base for many exciting upcoming features that we plan. This release is a major milestone on our journey with you, the community, and we want to thank you again for all the great contributions, discussions, questions, and bug reports that helped us to build a better Haystack. This journey has just started 🚀

⭐ Highlights

Improved Evaluation of Pipelines

Evaluation helps you find out how well your system is doing on your data. This includes Pipeline level evaluation to ensure that the system's output is really what you're after, but also Node level evaluation so that you can figure out whether it's your Reader or Retriever that is holding back the performance.

In this release, evaluation is much simpler and cleaner to perform. All the functionality is now baked into the Pipeline class and you can kick off the process by providing Label or MultiLabel objects to the Pipeline.eval() method.

eval_result = pipeline.eval(
    labels=labels,
    params={"Retriever": {"top_k": 5}},
)

The output is an EvaluationResult object which stores each Node's prediction for each sample in a Pandas DataFrame - so you can easily inspect granular predictions and potential mistakes without re-running the whole thing. There is a EvaluationResult.calculate_metrics() method which will return the relevant metrics for your evaluation and you can print a convenient summary report via the new .

metrics = eval_result.calculate_metrics()

pipeline.print_eval_report(eval_result)

If you'd like to start evaluating your own systems on your own data, check out our Evaluation Tutorial!

Table QA

A lot of valuable information is stored in tables - we've heard this again and again from the community. While they are an efficient structured data format, it hasn't been possible to search for table contents using traditional NLP techniques. But now, with the new TableTextRetriever and TableReader our users have all the tools they need to query for relevant tables and perform Question Answering.

The TableTextRetriever is the result of our team's research into table retrieval methods which you can read about in this paper that was presented at EMNLP 2021. Behind the scenes, it uses three transformer-based encoders - one for text passages, one for tables, and one for the query. However, in Haystack, you can swap it in for any other dense retrieval model and start working with tables. The TableReader is built upon the TAPAS model and when handed table containing Documents, it can return a single cell as an answer or perform an aggregation operation on a set of cells to form a final answer.

retriever = TableTextRetriever(
    document_store=document_store,
    query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
    passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
    table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
    embed_meta_fields=["title", "section_title"]
)

reader = TableReader(
		model_name_or_path="google/tapas-base-finetuned-wtq",
		max_seq_len=512
)

Have a look at the Table QA documentation if you'd like to learn more or dive into the Table QA tutorial to start unlocking the information in your table data.

Improved Debugging of Pipelines & Nodes

We've made debugging much simpler and also more informative! As long as your node receives a boolean debug argument, it can propagate its input, output or even some custom information to the output of the pipeline. It is now a built-in feature of all existing nodes and can also easily be inherited by your custom nodes.

result = pipeline.run(
        query="Who is the father of Arya Stark?",
        params={
            "debug": True
        }
    )

{'ESRetriever': {'input': {'debug': True,
                           'query': 'Who is the father of Arya Stark?',
                           'root_node': 'Query',
                           'top_k': 1},
                 'output': {'documents': [<Document: {'content': "\n===In the Riverlands===\nThe Stark army reaches the Twins, a bridge strong", ...}>]
                            ...}

To find out more about this feature, check out debugging. To learn how to define custom debug information, have a look at custom debugging.

FARM Migration

Those of you following Haystack from its first days will know that Haystack first evolved out of the FARM framework. While FARM is designed to handle diverse NLP models and tasks, Haystack gives full end-to-end support to search and question answering use cases with a focus on coordinating all components that take a proof-of-concept into production.

Haystack has always relied on FARM for much lower-level processing and modeling. To reduce the implementation overhead and simplify debugging, we have migrated the relevant parts of FARM into the new haystack/modeling package.

⚠️ Breaking Changes & Migration Guide

Migration to v1.0

With the release of v1.0, we decided to make some bold changes.
We believe this has brought a significant improvement in usability and makes the project more future-proof.
While this does come with a few breaking changes, and we do our best to guide you on how to go from v0.x to v1.0.
For more details see the Migration Guide and if you need more guidance, just reach out via Slack.

New Package Structure & Changed Imports

Due to the ever-increasing number of Nodes and Document Stores being integrated into Haystack,
we felt the need to implement a repository structure that makes it easier to navigate to what you're looking for. We've also shortened the length of the imports.

haystack.document_stores

All Document Stores can now be directly accessed from here
Note the pluralization of document_store to document_stores

haystack.nodes

This directory directly contains any class that can be used as a node
This includes File Converters and PreProcessors

haystack.pipelines

This contains all the base, custom and pre-made pipeline classes
Note the pluralization of pipeline to pipelines

haystack.utils

Any utility functions

➡️ For the large majority of imports, the old style still works but this will be deprecated in future releases!

Primitive Objects

Instead of relying on dictionaries, Haystack now standardizes more of the inputs and outputs of Nodes using the following primitive classes:

With these, there is now support for data structures beyond text and the REST API schema is built around their structure.
Using these classes also allows for the autocompletion of fields in your IDE.

Tip: To see examples of these primitive classes being returned, have a look at Ready-Made Pipelines.

Many of the fields in these classes have also been renamed or removed.
You can see a more comprehensive list of them in this Github issue.
Below, we will go through a few cases that are likely to impact established workflows.

Input Document Format

This dictionary schema used to be the recommended way to prepare your data to be indexed.
Now we strongly recommend using our dedicated Document class as a replacement.
The text field has been renamed content to accommodate for cases where it is used for another data format,
for example in Table QA.

Click here to see code example

v0.x:

doc = {
	'text': 'DOCUMENT_TEXT_HERE',
	'meta': {'name': DOCUMENT_NAME, ...}
}

v1.0:

doc = Document(
    content='DOCUMENT_TEXT_HERE',
    meta={'name': DOCUMENT_NAME, ...}
)

From here, you can take the same steps to write Documents into your Document Store.

document_store.write_documents([doc])

Response format of Reader

All Reader Nodes now return Answer objects instead of dictionaries.

Click here to see code example

v0.x:

[
    {
        'answer': 'Fang',
        'score': 13.26807975769043,
        'probability': 0.9657130837440491,
        'context': """Криволапик (Kryvolapyk, kryvi lapy "crooked paws")
            ===Fang (Hagrid's dog)===
            *Chinese (PRC): 牙牙 (ya2 ya) (from 牙 "tooth", 牙,"""
    }
]

v1.0:

[
    <Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9946763813495636, 'context': "s Nymeria after a legendary warrior queen. She travels...", 'offsets_in_document': [{'start': 147, 'end': 153}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f', 'meta': {'name': '43_Arya_Stark.txt'}}>,
    <Answer {'answer': 'King Robert', 'type': 'extractive', 'score': 0.9251320660114288, 'context': 'ordered by the Lord of Light. Melisandre later reveals to Gendry that...', 'offsets_in_document': [{'start': 1808, 'end': 1819}], 'offsets_in_context': [{'start': 70, 'end': 81}], 'document_id': '7b67b0e27571c2b2025a34b4db18ad49', 'meta': {'name': '349_List_of_Game_of_Thrones_characters.txt'}}>,
    <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.8103329539299011, 'context': " girl disguised as a boy all along and is surprised to learn she is Arya...", 'offsets_in_document': [{'start': 920, 'end': 923}], 'offsets_in_context': [{'start': 74, 'end': 77}], 'document_id': '7b67b0e27571c2b2025a34b4db18ad49', 'meta': {'name': '349_List_of_Game_of_Thrones_characters.txt'}}>,
    ...
]

Label Structure

The attributes of the Label object have gone through some changes.
To see their current structure see Label.

Click here to see code example

v0.x:

label = Label(
    question=QUESTION_TEXT_HERE,
    answer=ANSWER_STRING_HERE,
    ...
)

v1.0:

label = Label(
    query=QUERY_TEXT_HERE,
    answer=Answer(...),
    ...
)

REST API Format

The response format for the /query matches that of the primitive objects, only in JSON form.
This means, there are similar breaking changes as described above for the Answer format of a Reader.
Particularly, the names of the offset fields have changed and need to be aligned to the new format when coming from Haystack v0.x.
For detailed examples and guidance see the Migration Guide.

Other breaking changes

Save/load of FAISSDocumentstore @ZanSara in #1459
Add AzureConverter & change response format of FileConverter.convert() by @bogdankostic in #1813

🤓 Detailed Changes

Pipeline

Return intermediate nodes output in pipelines by @ZanSara in #1558
Add debug and debug_logs params to standard pipelines by @tholor in #1586
Pipeline node names validation by @ZanSara in #1601
Multi query eval by @tstadel in #1746
Pipelines now tolerate custom _debug content by @ZanSara in #1756
Adding yaml functionality to standard pipelines (save/load...) by @MichelBartels in #1735
Calculation of metrics and presentation of eval results by @tstadel in #1760
Fix loading and saving of EvaluationReszult by @tstadel in #1831
remove queries param from pipeline.eval() by @tstadel in #1836
Deprecate old pipeline eval nodes: EvalDocuments and EvalAnswers by @tstadel in #1778

Models

Farm merging base by @Timoeller in #1422
Add inferencer for QA only by @julian-risch in #1484
Remove mentions of FARM from Ranker comments by @julian-risch in #1535
Remove NER and text classification from model conversion by @julian-risch in #1536
TransformersDocumentClassifier replacing FARMClassifier by @julian-risch in #1540
LFQA: Remove InferenceProcessor dependency by @vblagoje in #1559
Add BatchEncoding flatten by @vblagoje in #1562
Enable GPU usage for question generator by @tholor in #1571
Create EntityExtractor by @ZanSara in #1573
Add more flexible options for model downloads (Proxies, resume_download, local_files_only...) by @tholor in #1256
Add checkpointing for reader.train() to allow stopping + resuming training by @gak97 in #1554
DPR training: Rename TransformersAdamW to AdamW by @ZanSara in #1613
Add TableTextRetriever by @bogdankostic in #1529
Truncate too large tables for TableReader by @bogdankostic in #1662
ensure tf-idf matrix calculation before retrieval by @julian-risch in #1665
Add TableTextRetriever to nodes' init.py by @bogdankostic in #1678
Fix TableReader when model does not select any cells by @bogdankostic in #1703
Standardize initialisation of device settings by @bogdankostic in #1683
fix issue #1687 - DPR training fails on multiple GPU's by @AlonEirew in #1688
Allow TableReader models without aggregation classifier by @bogdankostic in #1772
Huggingface private model support via API tokens (FARMReader) by @ArzelaAscoIi in #1775
private hugging face models for retrievers by @ArzelaAscoIi in #1785
Model Distillation by @MichelBartels in #1758
Added max_seq_length and batch_size params to embeddingretriever by @AhmedIdr in #1817
Fix bug ranker: wrong lambda function by @gabinguo in #1824

DocumentStores

Fix bug when loading FAISS from supplied config file path by @ZanSara in #1506
Standardize delete_documents(filter=...) across all document stores by @ZanSara in #1509
Update sql.py to ignore multi thread issues. by @adithyaur99 in #1442
[fix] MySQL connection 'check_same_thread' error by @CandiceYu8 in #1585
Delete documents by ID in all document stores by @ZanSara in #1606
Fix Opensearch field type (flattened -> nested) by @tholor in #1609
Add delete_labels() except for weaviate doc store by @julian-risch in #1604
Experimental changes to support Milvus 2.x by @lalitpagaria in #1473
Fix import in Milvus2DocumentStore by @ZanSara in #1646
Allow setting of scroll param in ElasticsearchDocumentStore by @Timoeller in #1645
Rename every occurrence of 'embed_passages' with 'embed_documents' by @ZanSara in #1667
Cosine similarity for the rest of DocStores. by @fingoldo in #1569
Make weaviate more compliant to other doc stores (UUIDs and dummy embedddings) by @julian-risch in #1656
Make FAISSDocumentStore work with yaml by @tstadel in #1727
Capitalize starting letter in params by @nishanthcgit in #1750
Support Tables in all DocumentStores by @bogdankostic in #1744
Facilitate concurrent query / indexing in Elasticsearch with dense retrievers (new skip_missing_embeddings param) by @cvgoudar in #1762
Introduced an arg to add synonyms - Elasticsearch by @SjSnowball in #1625
Allow SQLDocumentStore to filter by many filters by @ZanSara in #1776

REST API

Add rest api endpoint to delete documents by filter by @ZanSara in #1546
Fix circular import in the REST API by @ZanSara in #1556
Add /documents/get_by_filters endpoint by @ZanSara in #1580
Add a restart policy on-failure to all containers by @ZanSara in #1664
Add execute permissions to file upload folder by @Timoeller in #1666
disable file upload for InMemoryDocStore by @julian-risch in #1677
Improve open api spec by @tholor in #1700
Fix usage of filters in /query endpoint in REST API by @tholor in #1774
ignore empty filters parameter by @julian-risch in #1783
Fix the REST API tests by @ZanSara in #1791

UI / Demo

Add "API is loading" message in the UI by @ZanSara in #1493
Fix answer format in ui by @tholor in #1591
Change 'ESRetriever' with 'Retriever' in the Streamlit app by @ZanSara in #1620
Public demo by @ZanSara in #1747
Small fixes to the public demo by @ZanSara in #1781
Add missing dependency to the Streamlit container by @ZanSara in #1798
Improve the Random Question functionality by @ZanSara in #1808
Add description to the demo by @ZanSara in #1809
Fix UI demo feedback by @ZanSara in #1816
Remove feedback from no-answers by @ZanSara in #1827
Demo UI add env vars & other small fixes by @ZanSara in #1828
More demo bugfixes by @ZanSara in #1832
Add backlink below the context, if available in the doc's meta by @ZanSara in #1834

Documentation

changed delete_all_documents to delete_documents in Tutorial5 by @ju-gu in #1477
Regenerate API and Tutorial md files by @brandenchan in #1480
Define SAS model in notebook by @brandenchan in #1485
Update Tutorial1_Basic_QA_Pipeline.ipynb by @julian-risch in #1489
Clarify PDF conversion, languages and encodings by @MarkusSagen in #1570
Fix Tutorials by @tholor in #1594
Update Crawler documentation by @ju-gu in #1588
add note on gpu runtime to tutorial 13 by @julian-risch in #1614
Update jobs link to personio by @julian-risch in #1611
Update jobs link in readme by @julian-risch in #1629
Bugfix Tutorial 5 parameters, adjust default split length by @Timoeller in #1635
Fix parameter names in tutorial 5 and 12 by @julian-risch in #1639
Link the logo to the website by @aantti in #1649
Replace Haystack banner for readme by @brandenchan in #1654
Update README.md by @brandenchan in #1653
fix typo in docstring of crawler by @ju-gu in #1673
Add TableQA tutorial by @bogdankostic in #1670
Add collapsing sections to readme by @brandenchan in #1663
Fix links in readme.md by @brandenchan in #1682
fixed typo by @julian-risch in #1680
Standardize similarity argument description by @brandenchan in #1684
Fix Typo in TableQA Tutorial by @Timoeller in #1690
Improve tutorials' output by @ZanSara in #1694
Tutorial for DocumentClassifier at Index Time by @tstadel in #1697
Update API Reference Pages for v1.0 by @brandenchan in #1729
Add debugging example to tutorial by @brandenchan in #1731
Fix a few details of some tutorials by @ZanSara in #1733
initialize doc store with doc and label index in tutorial 5 by @julian-risch in #1730
Fix Tutorial 11 on Google Colab by @bogdankostic in #1795
Fix link to colab notebook in tutorial 16 by @julian-risch in #1802

Other Changes

Redesign primitives by @tholor in #1398
Adding prediction head, trainer, evaluator from FARM by @julian-risch in #1419
Farm merging base bogdan by @Timoeller in #1424
Add data, add tests for qa processor, add dpr tests (some failing) by @Timoeller in #1425
Fix DPR tests + add Tokenizer tests by @bogdankostic in #1429
Farm merging base fix test by @Timoeller in #1444
Automate updates docstrings tutorials by @PiffPaffM in #1461
fixed workflow conflict with intorducting new one by @PiffPaffM in #1472
feat: normalize embeddings for faiss cosine similarity by @mathislucka in #1352
Remove 'restart=always' from 'haystack-api' in both docker-compose files by @ZanSara in #1498
Add comment to tutorial notebooks about restarting runtime in colab by @bogdankostic in #1486
Feat: Download archive from url without temp file by @lalitpagaria in #1470
Add newline between paragraphs in DocxToTextConverter by @bogdankostic in #1500
Release Docs 0.10.0 by @PiffPaffM in #1460
Simplify tests & allow running on individual doc stores by @tholor in #1487
Replace FARM import statements; add dependencies by @julian-risch in #1492
Fix document_store_type flag for tests with multiple fixtures by @tholor in #1526
Remove double mentions from requirements by @bogdankostic in #1545
Format doc classifier usage example by @julian-risch in #1550
Adding TfidfRetriever to init.py of the retriever package by @mhamdan91 in #1575
Limit generator tests to memory doc store; split pipeline tests by @julian-risch in #1602
Add Table Reader by @bogdankostic in #1446
Switch from dataclass to pydantic dataclass & Fix Swagger API Docs by @tholor in #1598
Use smaller model for one generator test case by @julian-risch in #1622
Make EntityExtractor work when loaded from YAML by @ZanSara in #1636
Improve docker images: Add nltk download, add folder for file upload by @Timoeller in #1633
Refactoring of the haystack package by @ZanSara in #1624
Remove trailing comma in import statement by @julian-risch in #1655
Raise a warning if the 'query' param of the 'query' method of 'ElasticsearchDocumentStore' is not a string by @ZanSara in #1674
Add CI for windows runner by @lalitpagaria in #1458
rename text variable of document to content by @julian-risch in #1704
Simplify logs management by @ZanSara in #1696
Change answer aggregation key to (doc_id, query) instead of (label_id, query) by @julian-risch in #1726
Fix another self.device/s typo by @ZanSara in #1734
Fix print_answers by @ZanSara in #1743
Split pipeline tests into three suites by @ZanSara in #1755
Split summarizer tests in order to make windows CI work again by @tstadel in #1757
Update test_pipeline_extractive_qa.py by @julian-risch in #1763
Exclude test_summarizer_translation.py for windows_ci by @tstadel in #1759
Upgrade torch to v1.10.0 by @bogdankostic in #1789
Adapt docker-compose-gpu.yml to use DPR by default by @ZanSara in #1810
bugfix metadata extraction in form recognizer & split of surrounding content length by @ju-gu in #1829
Fix OOM in test_eval.py Windows CI by @tstadel in #1830
Update evaluation tutorial to cover the new pipeline.eval() by @julian-risch in #1765
Add config for github release notes by @tholor in #1840
Extend categories for release notes by @tholor in #1841

New Contributors

@mathislucka made their first contribution in #1352
@ZanSara made their first contribution in #1459
@ju-gu made their first contribution in #1477
@adithyaur99 made their first contribution in #1442
@mhamdan91 made their first contribution in #1575
@CandiceYu8 made their first contribution in #1585
@gak97 made their first contribution in #1554
@fingoldo made their first contribution in #1569
@AlonEirew made their first contribution in #1688
@tstadel made their first contribution in #1697
@nishanthcgit made their first contribution in #1750
@ArzelaAscoIi made their first contribution in #1775
@SjSnowball made their first contribution in #1625
@AhmedIdr made their first contribution in #1817
@gabinguo made their first contribution in #1824

❤️ Thanks to all contributors and the whole community!

deepset-ai/haystack v1.0.0 1.0.0 on GitHub

🎁 Haystack 1.0

⭐ Highlights

Improved Evaluation of Pipelines

Table QA

Improved Debugging of Pipelines & Nodes

FARM Migration

⚠️ Breaking Changes & Migration Guide

Migration to v1.0

New Package Structure & Changed Imports

Primitive Objects

Input Document Format

Response format of Reader

Label Structure

REST API Format

Other breaking changes

🤓 Detailed Changes

Pipeline

Models

DocumentStores

REST API

UI / Demo

Documentation

Other Changes

New Contributors

deepset-ai/haystack v1.0.0
1.0.0

on GitHub