⭐ Highlights
This release comes with a bunch of new features, improvements and bug fixes. Let us know how you like it on our brand new Haystack Discord server! Here are the highlights of the release:
Pipeline Evaluation in Batch Mode #2942
The evaluation of pipelines often uses large datasets and with this new feature batches of queries can be processed at the same time on a GPU. Thereby, the time needed for an evaluation run is decreased and we are working on further speed improvements. To try it out, you only need to replace the call to pipeline.eval()
with pipeline.eval_batch()
when you evaluate your question answering pipeline:
...
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)
eval_result = pipeline.eval_batch(labels=eval_labels, params={"Retriever": {"top_k": 5}})
Early Stopping in Reader and Retriever Training #3071
When training a reader or retriever model, you need to specify the number of training epochs. If the model doesn't further improve after the first few epochs, the training usually still continues for the rest of the specified number of epochs. Early Stopping can now automatically monitor how much the model improves during training and stop the process when there is no significant improvement. Various metrics can be monitored, including loss
, EM
, f1
, and top_n_accuracy
for FARMReader
or loss
, acc
, f1
, and average_rank
for DensePassageRetriever
. For example, reader training can be stopped when loss
doesn't further decrease by at least 0.001 compared to the previous epoch:
from haystack.nodes import FARMReader
from haystack.utils.early_stopping import EarlyStopping
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-distilled")
reader.train(data_dir="data/squad20", train_filename="dev-v2.0.json", early_stopping=EarlyStopping(min_delta=0.001), use_gpu=True, n_epochs=8, save_dir="my_model")
PineconeDocumentStore Without SQL Database #2749
Thanks to @jamescalam the PineconeDocumentStore
does not depend on a local SQL database anymore. So when you initialize a PineconeDocumentStore
from now on, all you need to provide is a Pinecone API key:
from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key="...")
docs = [Document(content="...")]
document_store.write_documents(docs)
FAISS in OpenSearchDocumentStore: #3101 #3029
OpenSearch supports different approximate k-NN libraries for indexing and search. In Haystack's OpenSearchDocumentStore
you can now set the knn_engine
parameter to choose between nmslib
and faiss
. When loading an existing index you can also specify a knn_engine
and Haystack checks if the same engine was used to create the index. If not, it falls back to slow exact vector calculation.
Highlighted Bug Fixes
A bug was fixed that prevented users from loading private models in some components because the authentication token wasn't passed on correctly. A second bug was fixed in the schema files affecting parameters that are of type Optional[List[]]
, in which case the validation failed if the parameter was explicitly set to None
.
- fix: Use use_auth_token in all cases when loading from the HF Hub by @sjrl in #3094
- bug: handle
Optional
params in schema validation by @anakin87 in #2980
Other Changes
DocumentStores
Documentation
- refactor: rename
master
intomain
in documentation and links by @ZanSara in #3063 - docs:fixed typo (or old documentation) in ipynb tutorial 3 by @DavidGerva in #3033
- docs: Add OpenAI Answer Generator API by @brandenchan in #3050
Crawler
- fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter by @danielbichuetti in #3043
- fix: Crawler quits ChromeDriver on destruction by @danielbichuetti in #3070
Other Changes
- fix(translator): write translated text to output documents, while keeping input untouched by @danielbichuetti in #3077
- test: Use
random_sample
instead ofndarray
for random array inOpenSearchDocumentStore
test by @bogdankostic in #3083 - feat: add progressbar to upload_files() for deepset Cloud client by @tholor in #3069
- refactor: update package metadata by @ofek in #3079
New Contributors
- @DavidGerva made their first contribution in #3033
- @ofek made their first contribution in #3079
❤️ Big thanks to all contributors and the whole community!
Full Changelog: v1.7.1...v1.8.0