⭐ Highlights

Pipeline YAML Syntax Validation

The syntax of pipeline configurations as defined in YAML files can now be validated. If the validation fails, erroneous components/parameters are identified to make it simple to fix them. Here is a code snippet to manually validate a file:

from pathlib import Path
from haystack.pipelines.config import validate_yaml
validate_yaml(Path("rest_api/pipeline/pipelines.haystack-pipeline.yml"))

Your IDE can also take care of the validation when you edit a pipeline YAML file. The suffix *.haystack-pipeline.yml tells your IDE that this YAML contains a Haystack pipeline configuration and enables some checks and autocompletion features if the IDE is configured that way (YAML plugin for VSCode, Configuration Guide for PyCharm). The schema used for validation can be found in SchemaStore pointing to the schema files for the different Haystack versions. Note that an update of the Haystack version might sometimes require to do small changes to the pipeline YAML files. You can set version: 'unstable' in the pipeline YAML to circumvent the validation or set it to the latest Haystack version if the components and parameters that you use are compatible with the latest version. #2226

Pinecone DocumentStore

We added another DocumentStore to Haystack: PineconeDocumentStore! 🎉 Pinecone is a fully managed service for very large scale dense retrieval. To this end, embeddings and metadata are stored in a hosted Pinecone vector database while the document content is stored in a local SQL database. This separation simplifies infrastructure setup and maintenance. In order to use this new document store, all you need is an API key, which you can obtain by creating an account on the Pinecone website. #2254

import os
from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key=os.environ["PINECONE_API_KEY"])

BEIR Integration

Fresh from the 🍻 cellar, Haystack now has an integration with our favorite BEnchmarking Information Retrieval tool BEIR. It contains preprocessed datasets for zero-shot evaluation of retrieval models in 17 different languages, which you can use to benchmark your pipelines. For example, a DocumentSearchPipeline can now be evaluated by calling Pipeline.eval_beir() after having installed Haystack with the BEIR dependency via pip install farm-haystack[beir]. Cheers! #2333

from haystack.pipelines import DocumentSearchPipeline, Pipeline
from haystack.nodes import TextConverter, ElasticsearchRetriever
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

text_converter = TextConverter()
document_store = ElasticsearchDocumentStore(search_fields=["content", "name"], index="scifact_beir")
retriever = ElasticsearchRetriever(document_store=document_store, top_k=1000)

index_pipeline = Pipeline()
index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])

query_pipeline = DocumentSearchPipeline(retriever=retriever)

ndcg, _map, recall, precision = Pipeline.eval_beir(
    index_pipeline=index_pipeline, query_pipeline=query_pipeline, dataset="scifact"
)

Breaking Changes

Make Milvus2DocumentStore compatible with pymilvus>=2.0.0 by @MichelBartels in #2126
Set provider parameter when instantiating onnxruntime.InferenceSession and make device a torch.device in internal methods by @cjb06776 in #1976

Pipeline

Generate haystack-pipeline-1.2.0.schema.json by @ZanSara in #2239
Add RouteDocuments and JoinAnswers nodes by @bogdankostic in #2256
Refactor Pipeline peripherals by @tstadel in #2253
Allow to deploy and undeploy Pipelines on Deepset Cloud by @tstadel in #2285
Reintroduce debug as a valid global key for Pipeline's params by @ZanSara in #2298
Replace dpr with embeddingretriever tut11 by @mkkuemmel in #2287
Package JSON schemas properly in Haystack by @ZanSara in #2316
Fix dependency graph for indexing pipelines during codegen by @tstadel in #2311
Fix YAML pipeline paths in docker-compose.yml by @ZanSara in #2335
Improve error message for nodes failing validation by @ZanSara in #2313
Fix Pipeline.print_eval_report by @tstadel in #2271
save_to_deepset_cloud: automatically convert document stores by @tstadel in #2283
Sas gpu additions by @thimo72 in #2308

Models

Update LFQA with the latest LFQA seq2seq and retriever models by @vblagoje in #2210

DocumentStores

Bulk insert in sql document stores by @OmniscienceAcademy in #2264
'os' wrapper to function for brownfield support by @TuanaCelik in #2282
Using default OpenSearch parameters by @TuanaCelik in #2327
Fix docker launch scripts by @tstadel in #2341
Fix normalize_embedding using numba by @tstadel in #2347

Documentation

Update other.yml with new node names by @agnieszka-m in #2286
Bring back init defs to api in v1.2 and latest by @brandenchan in #2296
Remove unneeded files in docs directory by @brandenchan in #2237
change old text to content argument for translator examples by @ju-gu in #2240

Tutorials

Fix tutorial dataset paths by @julian-risch in #2340
Polish Evaluation Tutorial by @brandenchan in #2212
Comment out Milvus cell on Tutorial6 by @ZanSara in #2243
Change document attribute from text to content by @julian-risch in #2352
Replace dpr with embeddingretriever tut5 by @mkkuemmel in #2274
ipynb: inserted links to graph images by @mkkuemmel in #2309

Other Changes

Implement Context Matching by @tstadel in #2293
Fix surrounding context extraction in ParsrConverter by @bogdankostic in #2162
Fix table extraction in ParsrConverter by @bogdankostic in #2262
Api pages by @brandenchan in #2248
fix pip backtracking issue by @tstadel in #2281
Update reader/base.py to fix UnboundLocalError in #2273 by @thimo72 in #2275
Remove substrings basic implementation by @dmigo in #2152
adding quotes for zsh shell issue by @TuanaCelik in #2289
Prevent Preprocessor from changing existing documents by @tstadel in #2297
Fix install because of missing jsonschema dependency by @tstadel in #2315
Add basic telemetry features by @julian-risch in #2314
Let SquadData support data from Annotation Tool by @brandenchan in #2329

New Contributors

@thimo72 made their first contribution in #2275
@agnieszka-m made their first contribution in #2286
@TuanaCelik made their first contribution in #2289
@OmniscienceAcademy made their first contribution in #2264
@jamescalam made their first contribution in #2254
@cjb06776 made their first contribution in #1976

❤️ Big thanks to all contributors and the whole community!

deepset-ai/haystack v1.3.0 on GitHub