deepset-ai/haystack v1.21.0 on GitHub

⭐ Highlights

🚀 Support for `gpt-3.5-turbo-instruct`

We are happy to announce that Haystack now supports OpenAI's new gpt-3.5-turbo-instruct model! Simply provide the model name in the PromptNode to use it:

pn = PromptNode("gpt-3.5-turbo-instruct", api_key=os.environ.get("OPENAI_API_KEY"))

2️⃣ Preview Installation Extra

Excited about the upcoming Haystack 2.0? We have introduced a new installation extra called preview which you can install to try out the Haystack 2.0 preview! This extra also makes Haystack's core dependencies leaner and thus speeds up installation. If you would like to start experiencing the new Haystack 2.0 components and pipeline design right away, run:

pip install farm-haystack[preview]

⚡️ WeaviateDocumentStore Performance

We fixed a bottleneck in WeaviateDocumentStore which was slowing down the indexing. The fix led to a notable performance improvement, reducing the indexing time of one million documents by 6 times!

🐣 PineconeDocumentStore Robustness

The PineconeDocumentStore now uses metadata instead of namespaces for the distinction between documents with embeddings, documents without embeddings, and labels. This is a breaking change and it makes the PineconeDocumentStore more robust to use in Haystack pipelines. If you want to retrieve all documents with an embedding, specify the metadata instead of the namespace as follows:

from haystack.document_stores.pinecone import DOCUMENT_WITH_EMBEDDING
# docs = doc_store.get_all_documents(namespace="vectors") # old way using namespaces
docs = doc_store.get_all_documents(type_metadata=DOCUMENT_WITH_EMBEDDING)

Additionally, if you want to retrieve all documents without an embedding, specify the metadata instead of the namespace:

# docs = doc_store.get_all_documents(namespace="no-vectors") # old way using namespaces
docs = doc_store_.get_all_documents(type_metadata="no-vector")

⬆️ Upgrade Notes

SklearnQueryClassifier is removed and users should switch to the more powerful TransformersQueryClassifier instead. #5447
Refactor PineconeDocumentStore to use metadata instead of namespaces for the distinction between documents with embeddings, documents without embeddings, and labels.

✨ Enhancements

ci: Fix typos discovered by codespell running in pre-commit.
Support OpenAI's new gpt-3.5-turbo-instruct model

🐛 Bug Fixes

Fix EntityExtractor output not JSON serializable.
Fix model_max_length not being set in the Tokenizer in DefaultPromptHandler.
Fixed a bottleneck in Weaviate document store which was slowing down the indexing.
gpt-35-turbo-16k model from Azure can integrate correctly.
Upgrades tiktoken to 0.5.1 to account for a breaking release.

👁️ Haystack 2.0 preview

Add the AnswerBuilder component for Haystack 2.0 that creates Answer objects from the string output of Generators.
Adds LinkContentFetcher component to Haystack 2.0. LinkContentFetcher fetches content from a given URL and
converts it into a Document object, which can then be used within the Haystack 2.0 pipeline.
Add MetadataRouter, a component that routes documents to different edges based on the content of their fields.
Adds support for PDF files to the Document converter via pypdf library.
Adds SerperDevWebSearch component to retrieve URLs from the web. See https://serper.dev/ for more information.
Add TikaDocumentConverter component to convert files of different types to Documents.
This adds an ExtractiveReader for v2. It should be a replacement where
FARMReader would have been used before for inference.
The confidence scores are calculated differently from FARMReader because
each span is considered to be an independent binary classification task.
Introduce GPTGenerator, a class that can generate completions using OpenAI Chat models like GPT3.5 and GPT4.
Remove id parameter from Document constructor as it was ignored and a new one was generated anyway.
This is a backwards incompatible change.
Add generators module for LLM generator components.
Adds GPT4Generator, an LLM component based on GPT35Generator.
Add embedding_retrieval method to MemoryDocumentStore,
which allows to retrieve the relevant Documents, given a query embedding.
It will be called the MemoryEmbeddingRetriever.
Rename MemoryRetriever to MemoryBM25Retriever
Add MemoryEmbeddingRetriever, which takes as input a query embedding and
retrieves the most relevant Documents from a MemoryDocumentStore.
Adds proposal for an extended Document class in Haystack 2.0.
Adds the implementation of said class.
Add OpenAI Text Embedder.
It is a component that uses OpenAI models to embed strings into vectors.
Revert #5826 and optionally take the id in the Document
class constructor.
Create a dedicated dependency list for the preview package, farm-haystack[preview].
Using haystack-ai is still the recommended way to test Haystack 2.0.
Add PromptBuilder component to render prompts from template strings.
Add prefix and suffix attributes to SentenceTransformersDocumentEmbedder.
They can be used to add a prefix and suffix to the Document text before
embedding it. This is necessary to take full advantage of modern embedding
models, such as E5.
Add support for dates in filters.
Add UrlCacheChecker to support Web retrieval pipelines.
Check if documents coming from a given list of URLs are already present in the store and if so, returns them.
All URLs with no matching documents are returned on a separate connection.