⭐️ Highlights
🪨 Amazon Bedrock support for PromptNode
(#6226)
Haystack now supports Amazon Bedrock models, including all existing and previously announced
models, like Llama-2-70b-chat. To use these models, simply pass the model ID in the
model_name_or_path parameter, like you do for any other model. For details, see
Amazon Bedrock Documentation.
For example, the following code loads the Llama 2 Chat 13B model:
from haystack.nodes import PromptNode
prompt_node = PromptNode(model_name_or_path="meta.llama2-13b-chat-v1")
🗺️ Support for MongoDB Atlas Document Store (#6471)
With this release, we introduce support for MongoDB Atlas as a Document Store. Try it with:
from haystack.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
document_store = MongoDBAtlasDocumentStore(
mongo_connection_string=f"mongodb+srv://USER:PASSWORD@HOST/?{'retryWrites': 'true', 'w': 'majority'}",
database_name="database",
collection_name="collection",
)
...
document_store.write_documents(...)
Note that you need MongoDB Atlas credentials to fill the connection string. You can get such credentials by registering here: https://www.mongodb.com/cloud/atlas/register
🔎 Document Stores filter specification for Haystack 2.x (#6001)
With proposal #6001, we introduced a better specification to declare filters in Haystack 2.x.
The new syntax is a bit more verbose but less confusing and ambiguous, as there are no implicit operators.
This will simplify conversion from this common syntax to a Document Store-specific filtering logic, so it will ease
the development of a new Document Store.
Since everything must be declared explicitly, it will also make it easier for the user to understand the filters just
by reading them.
See the full specification below.
Filters on the top level must be a dictionary.
There are two types of dictionaries:
- Comparison
- Logic
Top level can be either be a Comparison or Logic dictionary.
Comparison dictionaries must contain the keys:
field
operator
value
Logic dictionaries must contain the keys:
operator
conditions
conditions
key must be a list of dictionaries, either Comparison or Logic.
operator
values in Comparison dictionaries must be:
==
!=
>
>=
<
<=
in
not in
operator
values in Logic dictionaries must be:
NOT
OR
AND
A simple filter:
filters = {"field": "meta.type", "operator": "==", "value": "article"}
A more complex filter:
filters = {
"operator": "AND",
"conditions": [
{"field": "meta.type", "operator": "==", "value": "article"},
{"field": "meta.date", "operator": ">=", "value": 1420066800},
{"field": "meta.date", "operator": "<", "value": 1609455600},
{"field": "meta.rating", "operator": ">=", "value": 3},
{
"operator": "OR",
"conditions": [
{"field": "meta.genre", "operator": "in", "value": ["economy", "politics"]},
{"field": "meta.publisher", "operator": "==", "value": "nytimes"},
],
},
],
}
To avoid causing too much disruption for users using legacy filters, we'll keep supporting them for now.
We also provide a utility convert
function for developers implementing their Document Store to do the same.
⚙️ Function to convert legacy Document Store filters to the new style (#6314)
Following the proposal to introduce a new way of declaring filters
in Haystack 2.x for Document Stores and all Components that use them,
we introduce a utility function to convert the legacy style to the new style.
This will make life easier for developers when implementing new Document Stores,
as it will only be necessary for filtering logic for the new style filters, as
conversion will be completely handled by the utility function.
An example usage would be something similar to this:
legacy_filter = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {"genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"}},
}
}
assert convert(legacy_filter) == {
"operator": "AND",
"conditions": [
{"field": "type", "operator": "==", "value": "article"},
{"field": "date", "operator": ">=", "value": "2015-01-01"},
{"field": "date", "operator": "<", "value": "2021-01-01"},
{"field": "rating", "operator": ">=", "value": 3},
{
"operator": "OR",
"conditions": [
{"field": "genre", "operator": "in", "value": ["economy", "politics"]},
{"field": "publisher", "operator": "==", "value": "nytimes"},
],
},
],
}
For more information on the new filters technical specification, see proposal #6001
📃 Document serialization and backward compatibility for Haystack 2.0 (#6180)
The Document
serialization has changed quite a bit, and this will make it easier to implement
new Document Stores.
The most notable change is that the Document.flatten()
method has been removed.
Document.to_dict()
now has a flatten
parameter that defaults to True
for backward compatibility.
It will flatten metadata keys at the same level of other Document
fields when converting to dict
.
to_json
and from_json
have been removed, as to_dict
and from_dict
already handle serialisation
of dataframe
and blob
fields.
Now metadata
must only contain primitives that can be serialized to JSON with no custom encoders.
If any Document Store needs custom serialization, they can implement their own logic.
Document
has also been made backward compatible so that it can be created using dictionaries
structured as the legacy 1.x Document
. The legacy fields will be converted automatically to
their new counterparts or ignored if there are none.
⬆️ Upgrade Notes
- Remove deprecated
OpenAIAnswerGenerator
,BaseGenerator
,GenerativeQAPipeline
, and related tests.
GenerativeQA Pipelines should use PromptNode instead. See https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode.
🚀 New Features
-
Add PptxConverter: a node to convert pptx files to Haystack Documents.
-
Add
split_length
by token in PreProcessor. -
Support for dense embedding instructions used in retrieval models such as BGE and LLM-Embedder.
-
You can use Amazon Bedrock models in Haystack.
-
Add
MongoDBAtlasDocumentStore
, providing support for MongoDB Atlas as a document store.
⚡️ Enhancement Notes
-
Change
PromptModel
constructor parameterinvocation_layer_class
to accept astr
too.
If astr
is used the invocation layer class will be imported and used.
This should ease serialisation to YAML when usinginvocation_layer_class
withPromptModel
. -
Users can now define the number of pods and pod type directly when creating a PineconeDocumentStore instance.
-
Add batch_size to the init method of FAISS Document Store. This works as the default value for all methods of
FAISS Document Store that support batch_size. -
Introduces a new timeout keyword argument in PromptNode, addressing and fixing the issue #5380 for enhanced control over individual calls to OpenAI.
-
Upgrade Transformers to the latest version 4.35.2
This version adds support for DistilWhisper, Fuyu, Kosmos-2, SeamlessM4T, Owl-v2. -
Upgraded openai-whisper to version 20231106 and simplified installation through re-introduced audio extra.
The latest openai-whisper version unpins its tiktoken dependency, which resolved a version conflict with Haystack's dependencies. -
Make it possible to load additional fields from the SQUAD format file into the meta field of the Labels.
-
Add new variable model_kwargs to the ExtractiveReader so we can pass different loading options supported by
HuggingFace. -
Add new token limit for gpt-4-1106-preview model.
🐛 Bug Fixes
-
Fix
Pipeline.load_from_deepset_cloud
to work with the latest version of deepset Cloud. -
When using
JoinDocuments
withjoin_mode=concatenate
(default) and
passing duplicate documents, including some with a null score, this
node raised an exception.
Now this case is handled correctly and the documents are joined as expected. -
Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to
haystack/nodes/__init__.py
and thus
ensures that they are included in JSON schema generation. -
Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to
haystack/nodes/__init__.py
and thus
ensures that they are included in JSON schema generation.
🩵 Haystack 2.0 preview
-
Refactored
InMemoryDocumentStore
andMetadataRouter
filtering logic to support new filters declaration. -
Integrate SearchApi as an additional websearch provider.
-
Add CohereGenerator compatible with Cohere generate endpoint.
-
Make
PyPDFToDocument
accept aconverter_name
parameter instead of
aconverter
instance, to allow a smooth serialization of the component. -
Add
DynamicPromptBuilder
to dynamically generate prompts from either a list of ChatMessage instances or a string
template, leveraging Jinja2 templating for flexible and efficient prompt construction. -
Add
ConditionalRouter
component to enhance the conditional pipeline routing capabilities.
TheConditionalRouter
component orchestrates the flow of data by evaluating specified route conditions
to determine the appropriate route among a set of provided route alternatives. -
Improve the public interface of the Generators:
- make
generation_kwargs
a dictionary - rename
pipeline_kwargs
(inHuggingFaceLocalGenerator
) tohuggingface_pipeline_kwargs
- make
-
Extends input types of RemoteWhisperTranscriber from List[ByteStream] to List[Union[str,
Path, ByteStream]] to make possible to connect it to FileTypeRouter. -
Adds support for adding additional metadata and utilizing metadata received from ByteStream sources when creating documents using HTMLToDocument.
-
Introduce a function to convert legacy filters to the new style.
-
Change
write_documents()
method inDocumentStore
protocol to return the number
of document written. -
Added a new DocumentJoiner component so that hybrid retrieval pipelines can merge the document result lists of multiple retrievers.
Similarly, indexing pipelines can use DocumentJoiner to merge multiple lists of documents created by different file converters. -
Make
InMemoryDocumentStore.write_documents()
return the number of docs actually
written. -
Introduce a new Document representation, which includes meta, score and
embedding size. -
Remove most parameters from TextFileToDocument to make it match all other converters.
-
Add support for ByteStreams.
-
Upgrade Canals to 0.10.0.
-
Refactor
Document.__eq__()
so it compares theDocument
s dictionary
representation instead of only theirid
.
Previously this comparison would have unexpectedly worked:first_doc = Document(id="1", content="Hey!") second_doc = Document(id="1", content="Hello!") assert first_doc == second_doc first_doc.content = "Howdy!" assert first_doc == second_doc
With this change the last comparison would rightly fail.
-
Fix a failure that occurred when creating a document passing the 'meta' keyword
to the constructor. Raise a specific ValueError if the 'meta' keyword is passed
along with metadata as keyword arguments, the two options are mutually exclusive
now. -
Updated end-to-end test to use the DocumentLanguageClassifier with a MetadataRouter in a preprocessing pipeline.
-
Remove routing functionality from DocumentLanguageClassifier and rename TextLanguageClassifer to TextLanguageRouter.
Classifiers in Haystack 2.x change metadata values but do not route inputs to multiple outputs. The latter is reserved for routers.
Use DocumentLanguageClassifier in combination with MetaDataRouter to classify and route documents in indexing pipelines. -
Update Reader documentation to explain that top_k+1 answers are returned if no_answer is enabled (default).
-
Removes the unused query parameter from the run method of MetaFieldRanker.
-
Change the default value of
scale_score
to False for Retrievers.
Users can still explicitly setscale_score
to True to get relevance
scores in the range [0, 1]. -
Make Document's constructor fail when is passed fields that are not present in the dataclass. An exception is made for "content_type" and "id_hash_keys": they are accepted in order to keep backward compatibility.
-
Add callable hook to PyPDFToDocument to enable easier customization of pdf to Document conversion.
-
Adds MetaFieldRanker, a component that ranks a list of Documents based on the value of a metadata field of choice.
-
Adds GPTChatGenerator, a chat-based OpenAI LLM component, ChatMessage(s) are used for input and output
-
Change
Document.blob
field type frombytes
toByteStream
and removeDocument.mime_type
field. -
Adapt GPTGenerator to use strings for input and output
-
Move Text Language Classifier and Document Language Classifier to the
classifiers package. -
Adds
HuggingFaceTGIChatGenerator
for text and chat generation. This components support remote inferencing for
Hugging Face LLMs via text-generation-inference (TGI) protocol. -
Rename
TextDocumentSplitter
toDocumentSplitter
, to allow a better
distinction between Components that operate on text and those that operate
on Documents. -
Adds
HuggingFaceTGIGenerator
for text generation. This components support remote inferencing for
Hugging Face LLMs via text-generation-inference (TGI) protocol. -
Allow passing
generation_kwargs
in therun
method of theHuggingFaceLocalGenerator
.
This makes this common operation faster. -
Add MarkdownToTextDocument, a file converter that converts Markdown files into a text Documents.
-
Added DocumentLanguageClassifier component so that Documents can be routed to different components based on the detected language for example during preprocessing.
-
Refactor Document serialisation and make it backward compatible with Haystack 1.x