⭐ Highlights

Brownfield Support of Existing Elasticsearch Indices

You have an existing Elasticsearch index from other projects and now want to try out Haystack? The newly added method es_index_to_document_store provides brownfield support of existing Elasticsearch indices by converting each of the records in the provided index to Haystack Document objects and writing them to the specified DocumentStore.

document_store = es_index_to_document_store(
    document_store=InMemoryDocumentStore(), #or any other Haystack DocumentStore
    original_index_name="existing_index",
    original_content_field="content",
    original_name_field="name",
    included_metadata_fields=["date_field"],
    index="new_index",
)

It can even be used on a regular basis in order to add new records of the Elasticsearch index to the DocumentStore! #2229

Tapas Reader With Scores

The new model class TapasForScoredQA introduced in #1997 supports Tapas Reader models that return confidence scores. When you load a Tapas Reader model, Haystack automatically infers whether the model supports confidence scores and chooses the correct model class under the hood. The returned answers are sorted first by a general table score and then by answer span scores. To try it out, just use one of the new TableReader models:

reader = TableReader(model_name_or_path="deepset/tapas-large-nq-reader", max_seq_len=512) #or
reader = TableReader(model_name_or_path="deepset/tapas-large-nq-hn-reader", max_seq_len=512)

Extended Meta Data Filtering

We extended the filter capabilities of all(*) document stores to support more complex filter expressions than previously. Besides simple selections on multiple fields you can now use more complex comparison expressions and connect these using boolean operators. For people having used mongodb the new syntax should look familiar. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name.

Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value.

If no logical operator is provided, "$and" is used as default operation.
If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

Therefore, we don't have any breaking changes and you can keep on using your existing filter expressions.

Example:

filters = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": {"$in": ["economy", "politics"]},
            "publisher": {"$eq": "nytimes"}
        }
    }
}
# or simpler using default operators
filters = {
    "type": "article",
    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
    "rating": {"$gte": 3},
    "$or": {
        "genre": ["economy", "politics"],
        "publisher": "nytimes"
    }
}

(*) FAISSDocumentStore and MilvusDocumentStore currently do not support filters during search.

Code Style and Linting

In addition to mypy we already had for static type checking, we now use pylint for linting and the Haystack code base does now comply with Black formatting standards. As a result, the code is formatted in a consistent way and easier to read. When you would like to contribute to Haystack you don't need to worry about that though - our CI will automatically format your code changes correctly. Our contributor guidelines give more details in case you would like to run the checks locally. #2115 #2130

Installation with fewer dependencies

Installing Haystack has become easier and faster thanks to optional dependencies. From now on, there is no need to install all dependencies if you don't need them. For example, pip3 install farm-haystack will install the latest release together with only a small subset of packages required for basic Pipelines with an ElasticsearchDocumentStore. As another example, if you are experimenting with FAISSDocumentStore in a colab notebook, you can install Haystack from the master branch together with FAISS dependency by running: !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]. The installation guide reflects these updates and the full list of subsets of dependencies can be found here. Keep in mind, though, that this system works best with pip versions above 22 #1994

⚠️ Known Issues

Installing haystack with all dependencies results in heavy pip backtracking that might never finish.
This is due to a dependency conflict that was introduced by a new release of one of our sub dependencies.
To circumvent this problem install haystack like this:

pip install farm-haystack[all] "azure-core<1.23"

This might also be needed for other non-default dependencies (e.g. farm-haystack[dev] "azure-core<1.23").
See #2280 for more information.

⚠️ Breaking Changes

Improve dependency management by @ZanSara in #1994
Make ui and rest proper packages by @ZanSara in #2098
Add aiorwlock to 'ray' extra & fix maximum version for some dependencies by @ZanSara in #2140

🤓 Detailed Changes

Pipeline

Add top_k_join parameter to JoinDocuments.run by @adri1wald in #2065
✨ Add JSON Schema autogeneration for Pipeline YAML files by @tiangolo in #2020
Make FileTypeClassifier more flexible by @ZanSara in #2101
Query response without answers by @ZanSara in #2161
Generate JSON schema index for Schemastore by @ZanSara in #2225
Fix Pipeline.components by @tstadel in #2215
Join node should allow reciprocal rank fusion as additional merging method by @mathislucka in #2133
Apply filter in eval only if no gold docs are given as input by @julian-risch in #2154
pipeline.save_to_deepset_cloud() by @tstadel in #2145
Fix typo in save_to_deepset_cloud() by @tstadel in #2189
Generate code from pipeline (pipeline.to_code()) by @tstadel in #2214
Allow different filters per query in pipeline evaluation by @julian-risch in #2068
List all pipeline(_configs) on Deepset Cloud by @tstadel in #2102
Evaluating a pipeline consisting only of a reader node by @julian-risch in #2132
DC SDK - load pipeline from deepset cloud by @ArzelaAscoIi in #2013
YAML versioning by @ZanSara in #2209

Models

Add Tapas reader with scores by @bogdankostic in #1997
Fix finetuning notebook augmentation by @MichelBartels in #2071
Fix Seq2SeqGenerator return type by @tstadel in #2099
Distribute intermediate layer distillation loss calculation over multiple GPUs by @MichelBartels in #2090
Do not apply DataParallel twice by @MichelBartels in #2095

DocumentStores

Pin Milvus to <2.0.0 by @ZanSara in #2063
fix: get_documents_by_id should return docs for all passed ids by @mathislucka in #2064
Supported Highlighting in Elasticsearch by @SjSnowball in #1930
pass faiss batch_size to sqldocumentstore by @AhmedIdr in #2061
Fixed the Search Field mapping in ElasticSearch DocumentStore by @SjSnowball in #2080
Provide option to recreate es doc store on initialization by @mathislucka in #2084
Fixed performance bug. Using a list where a set is needed. by @baregawi in #2125
Extend metadata filtering support in ElasticsearchDocumentStore by @bogdankostic in #2108
OpenSearchDocumentStore: Extend similarity support by @tstadel in #2070
Speed up query_by_embedding in InMemoryDocumentStore. by @baregawi in #2091
Fix dependency management in Tutorial 6 by @ZanSara in #2148
Enable use of dot_product OpenSearch Script Scoring by @tstadel in #2168
Changed document_store to ElasticsearchDocumentStore by @mkkuemmel in #2192
Support more data types and extended filters in WeaviateDocStore by @bogdankostic in #2143
Adding extended meta data filtering support for InMemoryDocumenStore by @MichelBartels in #2120
Fix ef_search param for hnsw in OpenSearchDocumentStore by @tstadel in #2227
Add Brownfield Support of existing Elasticsearch indices by @bogdankostic in #2229
Introduce readonly DCDocumentStore (without labels support) by @tstadel in #1991
Extend meta data support for SQLDocumentStore by @MichelBartels in #2199
Fix missing embeddings not skipped if filters are used by @MichelBartels in #2230

REST API

Convert doc embedding from ndarray to list of float for REST API by @julian-risch in #1901
Autogenerate OpenAPI specs file by @ZanSara in #2047
Make openapi.json multiline so the diff is parsable by @ZanSara in #2163
Align REST API and Haystack versions by @ZanSara in #2164
Add DELETE /feedback for testing and make the label's id generate server-side by @ZanSara in #2159
Add type check for meta & add tests by @ZanSara in #2184
Update url in POST /file-upload by @ZanSara in #2193
Versioning openapi.json by @ZanSara in #2228

Docker

Change docstores_gpu into docstores-gpu in Dockerfile-GPU by @ZanSara in #2129
Remove run_docker_gpu.sh by @ZanSara in #2003
Remove rest extra from Dockerfile-GPU by @ZanSara in #2122
Fix dependency related build issues in Dockerfiles by @ZanSara in #2135
Add docker-compose override file for Traffic Monitoring by @tstadel in #2224
Adding a minimal haystack gpu build by @ArzelaAscoIi in #2185

Documentation

Remove stray requirements.txt files and update README.md by @ZanSara in #2075
Make the docstring bot work only on master by @ZanSara in #2078
Add who uses Haystack section by @dmigo in #1975
Rename image to fix link in CONTRIBUTING.md by @ZanSara in #2211
Add ADR template for transparent architecture decisions by @tholor in #2072
Update Readme to reflect changes to installation procedure by @brandenchan in #2157
Add REST API and UI installation info to readme by @brandenchan in #2160
Upgrade pydoc-markdown by @ZanSara in #2117

CI

Introduce pylint & other improvements on the CI by @ZanSara in #2130
Apply black formatting by @ZanSara in #2115
Pylint: solve or silence locally rare warnings by @ZanSara in #2170
Revert "Make the docstring bot work only on master" by @ZanSara in #2114
Fix CI build-cache issue causing code changes to take no effect by @tstadel in #2082
Disable cache on the CI by @ZanSara in #2083
Reintroduce push on master trigger for Linux CI by @ZanSara in #2127
Allow Linux CI to push changes to forks by @ZanSara in #2182
Fix windows ci tests by @tstadel in #2144
Disable autoformat.yml on master by @ZanSara in #2198
Testing actions (@ZanSara) by @hegyibalint in #2200

Other Changes

Add UnlabeledTextProcessor by @MichelBartels in #2054
fix answer is not subscriptable error by @julian-risch in #2069
Add faiss dependency to tutorial 12 by @julian-risch in #2109
Simplify SQuAD data to df conversion by @mathislucka in #2124
Remove requirements for json schema by @ZanSara in #2128
Move pytest configuration into pyproject.toml by @ZanSara in #2141
Fix MultiLabel creation with aggregate_by_meta by @tstadel in #2165
Add tests on MultiLabel's meta and filter aggregation by @tstadel in #2169
Improve Label and MultiLabel __str__ and __repr__ by @ZanSara in #2202

New Contributors

@adri1wald made their first contribution in #2065
@tiangolo made their first contribution in #2020
@baregawi made their first contribution in #2125
@mkkuemmel made their first contribution in #2192
@hegyibalint made their first contribution in #2200

❤️ Big thanks to all contributors and the whole community!

deepset-ai/haystack v1.2.0 on GitHub