⭐ Highlights
Make Your QA Pipelines Talk with Audio Nodes! (#2584)
Indexing pipelines can use a new DocumentToSpeech
node, which generates an audio file for each indexed document and stores it alongside the text content in a SpeechDocument
. A GPU is recommended for this step to increase indexing speed. During querying, SpeechDocument
s allow accessing the stored audio version of the documents the answers are extracted from. There is also a new AnswerToSpeech
node that can be used in QA pipelines to generate the audio of an answer on the fly. See the new tutorial for a step by step guide on how to make your QA pipelines talk!
Save Models to Remote (#2618)
A new save_to_remote
method was introduced to the FARMReader
, so that you can easily upload a trained model to the Hugging Face Model Hub. More of this to come in the following releases!
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="roberta-base")
reader.train(data_dir="my_squad_data", train_filename="squad2.json", n_epochs=1, save_dir="my_model")
reader.save_to_remote(repo_id="your-user-name/roberta-base-squad2", private=True, commit_message="First version of my qa model trained with Haystack")
Note that you need to be logged in with transformers-cli login. Otherwise there will be an error message with instructions how to log in. Further, if you make your model private by setting private=True
, others won't be able to use it and you will need to pass an authentication token when you reload the model from the Model Hub, which is created also via transformers-cli login
.
new_reader = FARMReader(model_name_or_path="your-user-name/roberta-base-squad2", use_auth_token=True)
Multi-Hop Dense Retrieval (#2571)
There is a new MultihopEmbeddingRetriever
node that applies iterative retrieval steps and a shared encoder for the query and the documents. Used together with a reader node in a QA pipeline, it is suited for answering complex open-domain questions that require "hopping" multiple relevant documents. See the original paper by Xiong et al. for more details: "Answering complex open-domain questions with multi-hop dense retrieval".
from haystack.nodes import MultihopEmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
retriever = MultihopEmbeddingRetriever(
document_store=document_store,
embedding_model="deutschmann/mdr_roberta_q_encoder",
)
Big thanks to our community member @deutschmn for the PR!
InMemoryKnowledgeGraph (#2678)
Besides querying texts and tables, Haystack also allows querying knowledge graphs with the help of pre-trained models that translate text queries to graph queries. The latest Haystack release adds an InMemoryKnowledgeGraph allowing to store knowledge graphs without setting up complex graph databases. Try out the tutorial as a notebook on colab!
from pathlib import Path
from haystack.nodes import Text2SparqlRetriever
from haystack.document_stores import InMemoryKnowledgeGraph
from haystack.utils import fetch_archive_from_http
# Fetch data represented as triples of subject, predicate, and object statements
fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/triples_and_config.zip", output_dir="data/tutorial10")
# Fetch a pre-trained BART model that translates text queries to SPARQL queries
fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/saved_models/hp_v3.4.zip", output_dir="../saved_models/tutorial10/")
# Initialize knowledge graph and import triples from a ttl file
kg = InMemoryKnowledgeGraph(index="tutorial10")
kg.create_index()
kg.import_from_ttl_file(index="tutorial10", path=Path("data/tutorial10/triples.ttl"))
# Initialize retriever from pre-trained model
kgqa_retriever = Text2SparqlRetriever(knowledge_graph=kg, model_name_or_path=Path("../saved_models/tutorial10/hp_v3.4"))
# Translate a text query to a SPARQL query and execute it on the knowledge graph
print(kgqa_retriever.retrieve(query="In which house is Harry Potter?"))
Big thanks to our community member @anakin87 for the PR!
Torch 1.12 and Transformers 4.20.1 Support
Haystack is now compatible with last week's PyTorch v1.12 release so that you can take advantage of Apple silicon GPUs (Apple M1) for accelerated training and evaluation. PyTorch shared an impressive analysis of speedups over CPU-only here.
Haystack is also compatible with the latest Transformers v4.20.1 release and we will continuously ensure that you can benefit from the latest features in Haystack!
Other Changes
Pipeline
- Fix JoinAnswer/JoinNode by @MichelBartels in #2612
- Reduce logging messages and simplify logging by @julian-risch in #2682
- Correct docstring parameter name by @julian-risch in #2757
AnswerToSpeech
by @ZanSara in #2584- Fix params being changed during pipeline.eval() by @tstadel in #2638
- Make crawler extract also hidden text by @anakin87 in #2642
- Update document scores based on ranker node by @mathislucka in #2048
- Improved crawler support for dynamically loaded pages by @danielbichuetti in #2710
- Replace deprecated Selenium methods by @ZanSara in #2724
- Fix EvaluationSetCliet.get_labels() by @tstadel in #2690
- Show warning in reader.eval() about differences compared to pipeline.eval() by @tstadel in #2477
- Fix using id_hash_keys as pipeline params by @tstadel in #2717
- Fix loading of tokenizers in DPR by @bogdankostic in #2755
- Add support for Multi-Hop Dense Retrieval by @deutschmn in #2571
- Create target folder if not exists in EvalResult.save() by @tstadel in #2647
- Validate
max_seq_length
inSquadProcessor
by @francescocastelli in #2740
Models
- Use AutoTokenizer by default, to easily adapt to new models and token… by @apohllo in #1902
- first version of save_to_remote for HF from FarmReader by @TuanaCelik in #2618
DocumentStores
- Move Opensearch document store in its own module by @masci in #2603
- Extract common code for ES and OS into a base class by @masci in #2664
- Fix bugs in loading code from yaml by @masci in #2705
- fix error in log message by @anakin87 in #2719
- Pin es client to include bugfixes by @masci in #2735
- Make check of document & embedding count optional in FAISS and Pinecone by @julian-risch in #2677
- In memory knowledge graph by @anakin87 in #2678
- Pinecone unary queries upgrade by @jamescalam in #2657
- wait for postgres to be ready before data migrations by @masci in #2654
Documentation & Tutorials
- Update docstrings for GPL by @agnieszka-m in #2633
- Add GPL API docs, unit tests update by @vblagoje in #2634
- Add GPL adaptation tutorial by @vblagoje in #2632
- GPL tutorial - add GPU header and open in colab button by @vblagoje in #2736
- Add execute_eval_run example to Tutorial 5 by @tstadel in #2459
- Tutorial 14 edit by @robpasternak in #2663
Misc
- Replace question issue with link to discussions by @masci in #2697
- Upgrade transformers to 4.20.1 by @julian-risch in #2702
- Upgrade torch to 1.12 by @julian-risch in #2741
- Remove rapidfuzz version pin by @tstadel in #2730
New Contributors
- @ryanrussell made their first contribution in #2617
- @apohllo made their first contribution in #1902
- @robpasternak made their first contribution in #2663
- @danielbichuetti made their first contribution in #2710
- @francescocastelli made their first contribution in #2740
- @deutschmn made their first contribution in #2571
❤️ Big thanks to all contributors and the whole community!
Full Changelog: v1.5.0...v1.6.0