⭐ Highlights

Generative Pseudo Labeling

Dense retrievers excel when finetuned on a labeled dataset of the target domain. However, such datasets rarely exist and are costly to create from scratch with human annotators. Generative Pseudo Labeling solves this dilemma by creating labels automatically for you, which makes it a super fast and low-cost alternative to manual annotation. Technically speaking, it is an unsupervised approach for domain adaptation of dense retrieval models. Given a corpus of unlabeled documents from that domain, it automatically generates queries on that corpus and then uses a cross-encoder model to create pseudo labels for these queries. The pseudo labels can be used to adapt retriever models that domain. Here is a code example that shows how to do that in Haystack:

from haystack.nodes.retriever import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes.question_generator.question_generator import QuestionGenerator
from haystack.nodes.label_generator.pseudo_label_generator import PseudoLabelGenerator

# Initialize any document store and fill it with documents from your domain - no labels needed.
document_store = InMemoryDocumentStore()
document_store.write_documents(...) 

# Calculate and store a dense embedding for each document
retriever = EmbeddingRetriever(document_store=document_store, 
                               embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b", 
                               max_seq_len=200)
document_store.update_embeddings(retriever)

# Use the new PseudoLabelGenerator to automatically generate labels and train the retriever on them
qg = QuestionGenerator(model_name_or_path="doc2query/msmarco-t5-base-v1", max_length=64, split_length=200, batch_size=12)
psg = PseudoLabelGenerator(qg, retriever)
output, _ = psg.run(documents=document_store.get_all_documents()) 
retriever.train(output["gpl_labels"])

#2388

Batch Processing with Query Pipelines

Every query pipeline now has a run_batch() method, which allows to pass multiple queries to the pipeline at once.
Together with a list of queries, you can either provide a single list of documents or a list of lists of documents. In the first case, answers are returned for each query-document pair. In the second case, each query is applied to its corresponding list of documents based on same index in the list. A third option is to have a list containing a single query, which is then applied to each list of documents separately.
Here is an example with a pipeline:

from haystack.pipelines import ExtractiveQAPipeline
...
pipe = ExtractiveQAPipeline(reader, retriever)
predictions = pipe.pipeline.run_batch(
        queries=["Who is the father of Arya Stark?","Who is the mother of Arya Stark?"], params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
    )

And here is an example with a single reader node:

from haystack.nodes import FARMReader
from haystack.schema import Document

FARMReader.predict_batch(
    queries=["1st sample query", "2nd sample query"]
    documents=[[Document(content="sample doc1"), Document(content="sample doc2")], [Document(content="sample doc3"), Document(content="sample doc4")]]

{"queries": ["1st sample query", "2nd sample query"], "answers": [[Answers from doc1 and doc2], [Answers from doc3 and doc4]], ...]}

#2481 #2575

Pipeline Evaluation with Advanced Label Scopes

Typically, a predicted answer is considered correct if it matches the gold answer in the set of evaluation labels. Similarly, a retrieved document is considered correct if its ID matches the gold document ID in the labels. Sometimes however, these simple definitions of "correctness" are not sufficient and you want to further specify the "scope" within which an answer or a document is considered correct.
For this reason, EvaluationResult.calculate_metrics() accepts the parameters answer_scope and document_scope.

As an example, you might consider an answer to be correct only if it stems from a specific context of surrounding words. You can specify answer_scope="context" in calculate_metrics() in that case. See the updated docstrings with a description of the different label scopes or the updated tutorial on evaluation.

...
document_store.add_eval_data(
        filename="data/tutorial5/nq_dev_subset_v2.json",
        preprocessor=preprocessor,
    )
...
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=True)
eval_result = pipeline.eval(labels=eval_labels, params={"Retriever": {"top_k": 5}})
metrics = eval_result.calculate_metrics(answer_scope="context")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')

#2482

Support of DeBERTa Models

Haystack now supports DeBERTa models! These kind of models come with some smart architectural improvements over BERT and RoBERTa, such as encoding the relative and absolute position of a token in the input sequence. Only the following three lines are needed to train a DeBERTa reader model on the SQuAD 2.0 dataset. And compared to a RoBERTa model trained on that dataset, you can expect a boost in F1-score from ~84% to ~88% ("microsoft/deberta-v3-large" even gets you to an F1-score as high as ~92%).

from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="microsoft/deberta-v3-base")
reader.train(data_dir="data/squad20", train_filename="train-v2.0.json", dev_filename="dev-v2.0.json", save_dir="my_model")

#2097

⚠️ Breaking Changes

Validation for Ray pipelines by @ZanSara in #2545
Add run_batch method to all nodes and Pipeline to allow batch querying by @bogdankostic in #2481
Support context matching in pipeline.eval() by @tstadel in #2482

Other Changes

Pipeline

Add sort arg to JoinAnswers by @brandenchan in #2436
Update run() and run_batch() params descriptions in API by @agnieszka-m in #2542
[CI refactoring] Avoid ray==1.12.0 on Windows by @ZanSara in #2562
Prevent losing names of utilized components when loaded from config by @tstadel in #2525
Do not copy _component_config in get_components_definitions by @ZanSara in #2574
Add run_batch for standard pipelines by @bogdankostic in #2595
Fix Pipeline.get_config() for forked pipelines by @tstadel in #2616
Remove wrong retriever top_1 metrics from print_eval_report by @tstadel in #2510
Handle transformers pipeline flattening lists of length 1 by @MichelBartels in #2531
Fix pipeline.eval with context matching for Table-QA by @tstadel in #2597
set top_k to 5 in SAS to be consistent by @ClaMnc in #2550

DocumentStores

Make DeepsetCloudDocumentStore work with non-existing index by @bogdankostic in #2513
[Weaviate] Exit the while loop when we query less documents than available by @masci in #2537
Fix knn params for aws managed opensearch by @tstadel in #2581
Fix number of returned values in get_metadata_values_by_key by @bogdankostic in #2614

Retriever

Simplify loading of EmbeddingRetriever by @bogdankostic in #2619
Add training checkpoint in retriever trainer by @dimitrisna in #2543
Include meta data when computing embeddings in EmbeddingRetriever by @MichelBartels in #2559

Documentation

fix small typo in Document doc string by @galtay in #2520
rearrange contributing guidelines by @masci in #2515
Documenting output score of JoinDocuments when using concatenation by @MichelBartels in #2561
Minor lg updates to doc strings by @agnieszka-m in #2585
Adjust pydoc markdown config so methods shown with classes by @brandenchan in #2511
Update Ray pipeline docs with validation info by @agnieszka-m in #2590

Other Changes

Upgrade transformers version to 4.18.0 by @bogdankostic in #2514
Upgrade torch version to 1.11 by @bogdankostic in #2538
Fix tutorials 4, 7 and 8 by @bogdankostic in #2526
Tutorial1: convert_files_to_dicts --> convert_files_to_docs by @ZanSara in #2546
Fix docker image tag with semantic version for releases by @askainet in #2548
added launch_tika method by @anakin87 in #2567
Remove encoding option from PDFToTextOCRConverter by @julian-risch in #2553
Fix StaleElementReferenceException in Crawler by @bogdankostic in #2591

New Contributors

@galtay made their first contribution in #2520
@masci made their first contribution in #2515
@ClaMnc made their first contribution in #2550
@anakin87 made their first contribution in #2567
@dimitrisna made their first contribution in #2543

❤️ Big thanks to all contributors and the whole community!

deepset-ai/haystack v1.5.0 on GitHub

⭐ Highlights

Generative Pseudo Labeling

Batch Processing with Query Pipelines

Pipeline Evaluation with Advanced Label Scopes

Support of DeBERTa Models

⚠️ Breaking Changes

Other Changes

Pipeline

DocumentStores

Retriever

Documentation

Other Changes

New Contributors

deepset-ai/haystack v1.5.0
on GitHub