deepset-ai/haystack v1.23.0-rc1 on GitHub

⭐️ Highlights

🪨 Amazon Bedrock support for `PromptNode` (#6226)

Haystack now supports Amazon Bedrock models, including all existing and previously announced
models, like Llama-2-70b-chat. To use these models, simply pass the model ID in the
model_name_or_path parameter, like you do for any other model. For details, see
Amazon Bedrock Documentation.

For example, the following code loads the Llama 2 Chat 13B model:

from haystack.nodes import PromptNode

prompt_node = PromptNode(model_name_or_path="meta.llama2-13b-chat-v1")

🗺️ Support for MongoDB Atlas Document Store (#6471)

With this release, we introduce support for MongoDB Atlas as a Document Store. Try it with:

from haystack.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore

document_store = MongoDBAtlasDocumentStore(
    mongo_connection_string=f"mongodb+srv://USER:PASSWORD@HOST/?{'retryWrites': 'true', 'w': 'majority'}",
    database_name="database",
    collection_name="collection",
)
...
document_store.write_documents(...)

Note that you need MongoDB Atlas credentials to fill the connection string. You can get such credentials by registering here: https://www.mongodb.com/cloud/atlas/register

🔎 Document Stores filter specification for Haystack 2.x (#6001)

With proposal #6001, we introduced a better specification to declare filters in Haystack 2.x.
The new syntax is a bit more verbose but less confusing and ambiguous, as there are no implicit operators.
This will simplify conversion from this common syntax to a Document Store-specific filtering logic, so it will ease
the development of a new Document Store.
Since everything must be declared explicitly, it will also make it easier for the user to understand the filters just
by reading them.

See the full specification below.

Filters on the top level must be a dictionary.

There are two types of dictionaries:

Comparison
Logic

Top level can be either be a Comparison or Logic dictionary.

Comparison dictionaries must contain the keys:

field
operator
value

Logic dictionaries must contain the keys:

operator
conditions

conditions key must be a list of dictionaries, either Comparison or Logic.

operator values in Comparison dictionaries must be:

==
!=
>
>=
<
<=
in
not in

operator values in Logic dictionaries must be:

NOT
OR
AND

A simple filter:

filters = {"field": "meta.type", "operator": "==", "value": "article"}

A more complex filter:

filters = {
    "operator": "AND",
    "conditions": [
        {"field": "meta.type", "operator": "==", "value": "article"},
        {"field": "meta.date", "operator": ">=", "value": 1420066800},
        {"field": "meta.date", "operator": "<", "value": 1609455600},
        {"field": "meta.rating", "operator": ">=", "value": 3},
        {
            "operator": "OR",
            "conditions": [
                {"field": "meta.genre", "operator": "in", "value": ["economy", "politics"]},
                {"field": "meta.publisher", "operator": "==", "value": "nytimes"},
            ],
        },
    ],
}

To avoid causing too much disruption for users using legacy filters, we'll keep supporting them for now.
We also provide a utility convert function for developers implementing their Document Store to do the same.

⚙️ Function to convert legacy Document Store filters to the new style (#6314)

Following the proposal to introduce a new way of declaring filters
in Haystack 2.x for Document Stores and all Components that use them,
we introduce a utility function to convert the legacy style to the new style.

This will make life easier for developers when implementing new Document Stores,
as it will only be necessary for filtering logic for the new style filters, as
conversion will be completely handled by the utility function.

An example usage would be something similar to this:

legacy_filter = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {"genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"}},
    }
}
assert convert(legacy_filter) == {
    "operator": "AND",
    "conditions": [
        {"field": "type", "operator": "==", "value": "article"},
        {"field": "date", "operator": ">=", "value": "2015-01-01"},
        {"field": "date", "operator": "<", "value": "2021-01-01"},
        {"field": "rating", "operator": ">=", "value": 3},
        {
            "operator": "OR",
            "conditions": [
                {"field": "genre", "operator": "in", "value": ["economy", "politics"]},
                {"field": "publisher", "operator": "==", "value": "nytimes"},
            ],
        },
    ],
}

For more information on the new filters technical specification, see proposal #6001

📃 Document serialization and backward compatibility for Haystack 2.0 (#6180)

The Document serialization has changed quite a bit, and this will make it easier to implement
new Document Stores.

The most notable change is that the Document.flatten() method has been removed.
Document.to_dict() now has a flatten parameter that defaults to True for backward compatibility.
It will flatten metadata keys at the same level of other Document fields when converting to dict.

to_json and from_json have been removed, as to_dict and from_dict already handle serialisation
of dataframe and blob fields.
Now metadata must only contain primitives that can be serialized to JSON with no custom encoders.
If any Document Store needs custom serialization, they can implement their own logic.

Document has also been made backward compatible so that it can be created using dictionaries
structured as the legacy 1.x Document. The legacy fields will be converted automatically to
their new counterparts or ignored if there are none.

⬆️ Upgrade Notes

Remove deprecated OpenAIAnswerGenerator, BaseGenerator, GenerativeQAPipeline, and related tests.
GenerativeQA Pipelines should use PromptNode instead. See https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode.

🚀 New Features

Add PptxConverter: a node to convert pptx files to Haystack Documents.
Add split_length by token in PreProcessor.
Support for dense embedding instructions used in retrieval models such as BGE and LLM-Embedder.
You can use Amazon Bedrock models in Haystack.
Add MongoDBAtlasDocumentStore, providing support for MongoDB Atlas as a document store.

⚡️ Enhancement Notes

Change PromptModel constructor parameter invocation_layer_class to accept a str too.
If a str is used the invocation layer class will be imported and used.
This should ease serialisation to YAML when using invocation_layer_class with PromptModel.
Users can now define the number of pods and pod type directly when creating a PineconeDocumentStore instance.
Add batch_size to the init method of FAISS Document Store. This works as the default value for all methods of
FAISS Document Store that support batch_size.
Introduces a new timeout keyword argument in PromptNode, addressing and fixing the issue #5380 for enhanced control over individual calls to OpenAI.
Upgrade Transformers to the latest version 4.35.2
This version adds support for DistilWhisper, Fuyu, Kosmos-2, SeamlessM4T, Owl-v2.
Upgraded openai-whisper to version 20231106 and simplified installation through re-introduced audio extra.
The latest openai-whisper version unpins its tiktoken dependency, which resolved a version conflict with Haystack's dependencies.
Make it possible to load additional fields from the SQUAD format file into the meta field of the Labels.
Add new variable model_kwargs to the ExtractiveReader so we can pass different loading options supported by
HuggingFace.
Add new token limit for gpt-4-1106-preview model.

🐛 Bug Fixes

Fix Pipeline.load_from_deepset_cloud to work with the latest version of deepset Cloud.
When using JoinDocuments with join_mode=concatenate (default) and
passing duplicate documents, including some with a null score, this
node raised an exception.
Now this case is handled correctly and the documents are joined as expected.
Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus
ensures that they are included in JSON schema generation.
Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus
ensures that they are included in JSON schema generation.

🩵 Haystack 2.0 preview

Refactored InMemoryDocumentStore and MetadataRouter filtering logic to support new filters declaration.
Integrate SearchApi as an additional websearch provider.
Add CohereGenerator compatible with Cohere generate endpoint.
Make PyPDFToDocument accept a converter_name parameter instead of
a converter instance, to allow a smooth serialization of the component.
Add DynamicPromptBuilder to dynamically generate prompts from either a list of ChatMessage instances or a string
template, leveraging Jinja2 templating for flexible and efficient prompt construction.
Add ConditionalRouter component to enhance the conditional pipeline routing capabilities.
The ConditionalRouter component orchestrates the flow of data by evaluating specified route conditions
to determine the appropriate route among a set of provided route alternatives.
Improve the public interface of the Generators:
- make generation_kwargs a dictionary
- rename pipeline_kwargs (in HuggingFaceLocalGenerator) to huggingface_pipeline_kwargs
Extends input types of RemoteWhisperTranscriber from List[ByteStream] to List[Union[str,
Path, ByteStream]] to make possible to connect it to FileTypeRouter.
Adds support for adding additional metadata and utilizing metadata received from ByteStream sources when creating documents using HTMLToDocument.
Introduce a function to convert legacy filters to the new style.
Change write_documents() method in DocumentStore protocol to return the number
of document written.
Added a new DocumentJoiner component so that hybrid retrieval pipelines can merge the document result lists of multiple retrievers.
Similarly, indexing pipelines can use DocumentJoiner to merge multiple lists of documents created by different file converters.
Make InMemoryDocumentStore.write_documents() return the number of docs actually
written.
Introduce a new Document representation, which includes meta, score and
embedding size.
Remove most parameters from TextFileToDocument to make it match all other converters.
Add support for ByteStreams.
Upgrade Canals to 0.10.0.
Refactor Document.__eq__() so it compares the Documents dictionary
representation instead of only their id.
Previously this comparison would have unexpectedly worked:
```
first_doc = Document(id="1", content="Hey!")
second_doc = Document(id="1", content="Hello!")
assert first_doc == second_doc
first_doc.content = "Howdy!"
assert first_doc == second_doc
```
With this change the last comparison would rightly fail.
Fix a failure that occurred when creating a document passing the 'meta' keyword
to the constructor. Raise a specific ValueError if the 'meta' keyword is passed
along with metadata as keyword arguments, the two options are mutually exclusive
now.
Updated end-to-end test to use the DocumentLanguageClassifier with a MetadataRouter in a preprocessing pipeline.
Remove routing functionality from DocumentLanguageClassifier and rename TextLanguageClassifer to TextLanguageRouter.
Classifiers in Haystack 2.x change metadata values but do not route inputs to multiple outputs. The latter is reserved for routers.
Use DocumentLanguageClassifier in combination with MetaDataRouter to classify and route documents in indexing pipelines.
Update Reader documentation to explain that top_k+1 answers are returned if no_answer is enabled (default).
Removes the unused query parameter from the run method of MetaFieldRanker.
Change the default value of scale_score to False for Retrievers.
Users can still explicitly set scale_score to True to get relevance
scores in the range [0, 1].
Make Document's constructor fail when is passed fields that are not present in the dataclass. An exception is made for "content_type" and "id_hash_keys": they are accepted in order to keep backward compatibility.
Add callable hook to PyPDFToDocument to enable easier customization of pdf to Document conversion.
Adds MetaFieldRanker, a component that ranks a list of Documents based on the value of a metadata field of choice.
Adds GPTChatGenerator, a chat-based OpenAI LLM component, ChatMessage(s) are used for input and output
Change Document.blob field type from bytes to ByteStream and remove Document.mime_type field.
Adapt GPTGenerator to use strings for input and output
Move Text Language Classifier and Document Language Classifier to the
classifiers package.
Adds HuggingFaceTGIChatGenerator for text and chat generation. This components support remote inferencing for
Hugging Face LLMs via text-generation-inference (TGI) protocol.
Rename TextDocumentSplitter to DocumentSplitter, to allow a better
distinction between Components that operate on text and those that operate
on Documents.
Adds HuggingFaceTGIGenerator for text generation. This components support remote inferencing for
Hugging Face LLMs via text-generation-inference (TGI) protocol.
Allow passing generation_kwargs in the run method of the HuggingFaceLocalGenerator.
This makes this common operation faster.
Add MarkdownToTextDocument, a file converter that converts Markdown files into a text Documents.
Added DocumentLanguageClassifier component so that Documents can be routed to different components based on the detected language for example during preprocessing.
Refactor Document serialisation and make it backward compatible with Haystack 1.x

deepset-ai/haystack v1.23.0-rc1 v.1.23.0-rc1 on GitHub