Highlights
💬 Generative Question Answering via RAG (#484)
Thanks to our community member @lalitpagaria, Haystack now also support generative QA via Retrieval Augmented Generation ("RAG").
Instead of "finding" the answer within a document, these models generate the answer. In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages for real-world applications:
a) it has a manageable model size
b) the answer generation is conditioned on retrieved documents, i.e. the model can easily adjust to domain documents even after training has finished (in contrast: GPT-3 relies on the web data seen during training)
Example:
question = "who got the first nobel prize in physics?"
# Retrieve related documents from retriever
retrieved_docs = retriever.retrieve(query=question)
# Now generate answer from question and retrieved documents
predicted_result = generator.predict(
question=question,
documents=retrieved_docs,
top_k=1
)
You already play around with it in this minimal tutorial:
We are looking forward to improve this class of models further in the next months and already plan a tighter integration into the Finder
class.
↗️ Better DPR (incl. training) (#527)
We migrated the existing DensePassageRetriever
to an own pipeline based on FARM. This allows a better modularization and most importantly simple training of DPR models! You can either train models from scratch or take an existing DPR model and fine-tune it on your own domain data. The required training data consists of queries and positive passages (i.e. passages that are related to your query / contain the answer) and the format complies with the one in the original DPR codebase.
Example:
dense_passage_retriever.train(self,
data_dir: str,
train_filename: str,
dev_filename: str = None,
test_filename: str = None,
batch_size: int = 16,
embed_title: bool = True,
num_hard_negatives: int = 1,
n_epochs: int = 3)
Future improvements: At the moment training is only supported on single GPUs. We will add support for Multi-GPU Training via DDP soon.
📊 New Benchmarks
Happy to introduce a new benchmark section on our website!
Do you wonder if you should use BERT, RoBERTa or MiniLM for your reader? Is it worth to use DPR for retrieval instead of Elastic's BM25? How would this impact speed and accuracy?
See the relevant metrics here to guide your decision:
👉 https://haystack.deepset.ai/bm/benchmarks
We will extend this section over time with more models, metrics and key parameters.
⚠️ Breaking Changes
Consistent parameter naming for TransformersReader #510
# old:
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..)
# new
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..)
TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad" ...)
FAISS: Remove phi normalization, support more index types #467
New default index type is "Flat" and params have changed slightly:
# old
FAISSDocumentStore(
sql_url: str = "sqlite:///",
index_buffer_size: int = 10_000,
vector_size: int = 768,
faiss_index: Optional[IndexHNSWFlat] = None,
# new
FAISSDocumentStore(
sql_url: str = "sqlite:///",
index_buffer_size: int = 10_000,
vector_dim: int = 768,
faiss_index_factory_str: str = "Flat",
faiss_index: Optional[faiss.swigfaiss.Index] = None,
return_embedding: Optional[bool] = True,
**kwargs,
DPR signature
Splitting max_seq_len
into two independent params.
Removing remove_sep_tok_from_untitled_passages
param.
# old
DensePassageRetriever(
document_store: BaseDocumentStore,
query_embedding_model: str = "facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model: str = "facebook/dpr-ctx_encoder-single-nq-base",
max_seq_len: int = 256,
use_gpu: bool = True,
batch_size: int = 16,
embed_title: bool = True,
remove_sep_tok_from_untitled_passages: bool = True
)
# new
DensePassageRetriever(
document_store: BaseDocumentStore,
query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base",
max_seq_len_query: int = 64,
max_seq_len_passage: int = 256,
use_gpu: bool = True,
batch_size: int = 16,
embed_title: bool = True,
use_fast_tokenizers: bool = True,
similarity_function: str = "dot_product"
):
Detailed Changes
Preprocessing / File Conversion
- Add preprocessing pipeline #473
- Restructure checks in PreProcessor #504
- Updated the example code to Indexing PDF / Docx files #502
- Fix meta data = None in PreProcessor #496
- add explicit encoding mode to file_converter/txt.py #478
- Skip file conversion if file type is not supported #456
DocumentStore
- Add support for MySQL database #556
- Allow configuration of Elasticsearch Analyzer (e.g. for other languages) #554
- Add support to return embedding #514
- Fix scoring in Elasticsearch for dot product #517
- Allow filters for get_document_count() #512
- Make creation of label index optional #490
- Fix update_embeddings function in FAISSDocumentStore #481
- FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422
- Enable bulk operations on vector IDs for FAISSDocumentStore #460
- fixing ElasticsearchDocumentStore initialisation #415
- bug: filters on a query_by_embedding #464
Retriever
- DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527
- Fix retriever evaluation metrics #547
- Add save and load method for DPR #550
- Typo in dense.py comment #545
- Make returning predictions in Finder & Retriever eval() possible #524
- Make title info optional when evaluating on QA data #494
- Make sentence-transformers usage more user-friendly #439
Reader
- Fix FARMReader.eval() handling of no_answers #531
- Added automatic mixed precision (AMP) support for reader training from Haystack side #463
- Update ONNX conversion for FARMReader #438
Other
- Fix sentencepiece dependencies in Dockerfiles #553
- Update Dockerfile #537
- Removing (deprecated) warnings from the Haystack codebase. #530
- Pytest fix memory leak and put pytest marker on slow tests #520
- [enhancement] Create deploy_website.yml #450
- Add Docker Images & Setup for the Annotation Tool #444
REST API
- Make filter value optional in REST API #497
- Add Elasticsearch Query DSL compliant Query API #471
- Allow configuration of log level in REST API #541
- Add create_index and similarity metric to api config #493
- Add deepcopy for meta dicts in answers #485
- Fix windows platform installation #480
- Update GPU docker & fix race condition with multiple workers #436
Documentation / Benchmarks / Tutorials
- New readme #534
- Add public roadmap #432
- Time and performance benchmarks for all readers and retrievers #339
- Added new formatting for examples in docstrings #555
- Update annotation docs for website #505
- Add annotation tool manual to README.md #523
- Change metric to queries per second on benchmarks webpage #529
- Add --ci and --update-json to CLI for benchmarks #522
- Add requirement to colab notebooks #509
- Update doc string for ElasticsearchDocumentStore.write_documents() & sync markdown files #501
- Add versioning docs #495
- READ.me for Docstring Generation #468
- Separate data and view for benchmarks #451
- Update DPR docstring for embed_title #459
- Update Tutorial4_FAQ_style_QA.py #416
❤️ Big thanks to all contributors!
@lalitpagaria @guillim @elyase @kolk @rsanjaykamath @antoniolanza1996 @Zenahr @Futurne @tanaysoni @tholor @Timoeller @PiffPaffM @bogdankostic