⭐ Highlights
🚀 Support for gpt-3.5-turbo-instruct
We are happy to announce that Haystack now supports OpenAI's new gpt-3.5-turbo-instruct
model! Simply provide the model name in the PromptNode to use it:
pn = PromptNode("gpt-3.5-turbo-instruct", api_key=os.environ.get("OPENAI_API_KEY"))
2️⃣ Preview Installation Extra
Excited about the upcoming Haystack 2.0? We have introduced a new installation extra called preview
which you can install to try out the Haystack 2.0 preview! This extra also makes Haystack's core dependencies leaner and thus speeds up installation. If you would like to start experiencing the new Haystack 2.0 components and pipeline design right away, run:
pip install farm-haystack[preview]
⚡️ WeaviateDocumentStore Performance
We fixed a bottleneck in WeaviateDocumentStore which was slowing down the indexing. The fix led to a notable performance improvement, reducing the indexing time of one million documents by 6 times!
🐣 PineconeDocumentStore Robustness
The PineconeDocumentStore now uses metadata instead of namespaces for the distinction between documents with embeddings, documents without embeddings, and labels. This is a breaking change and it makes the PineconeDocumentStore more robust to use in Haystack pipelines. If you want to retrieve all documents with an embedding, specify the metadata instead of the namespace as follows:
from haystack.document_stores.pinecone import DOCUMENT_WITH_EMBEDDING
# docs = doc_store.get_all_documents(namespace="vectors") # old way using namespaces
docs = doc_store.get_all_documents(type_metadata=DOCUMENT_WITH_EMBEDDING)
Additionally, if you want to retrieve all documents without an embedding, specify the metadata instead of the namespace:
# docs = doc_store.get_all_documents(namespace="no-vectors") # old way using namespaces
docs = doc_store_.get_all_documents(type_metadata="no-vector")
⬆️ Upgrade Notes
-
SklearnQueryClassifier is removed and users should switch to the more powerful TransformersQueryClassifier instead. #5447
-
Refactor PineconeDocumentStore to use metadata instead of namespaces for the distinction between documents with embeddings, documents without embeddings, and labels.
✨ Enhancements
-
ci: Fix typos discovered by codespell running in pre-commit.
-
Support OpenAI's new
gpt-3.5-turbo-instruct
model
🐛 Bug Fixes
-
Fix EntityExtractor output not JSON serializable.
-
Fix model_max_length not being set in the Tokenizer in DefaultPromptHandler.
-
Fixed a bottleneck in Weaviate document store which was slowing down the indexing.
-
gpt-35-turbo-16k model from Azure can integrate correctly.
-
Upgrades tiktoken to 0.5.1 to account for a breaking release.
👁️ Haystack 2.0 preview
-
Add the
AnswerBuilder
component for Haystack 2.0 that creates Answer objects from the string output of Generators. -
Adds LinkContentFetcher component to Haystack 2.0. LinkContentFetcher fetches content from a given URL and
converts it into a Document object, which can then be used within the Haystack 2.0 pipeline. -
Add
MetadataRouter
, a component that routes documents to different edges based on the content of their fields. -
Adds support for PDF files to the Document converter via pypdf library.
-
Adds SerperDevWebSearch component to retrieve URLs from the web. See https://serper.dev/ for more information.
-
Add TikaDocumentConverter component to convert files of different types to Documents.
-
This adds an ExtractiveReader for v2. It should be a replacement where
FARMReader would have been used before for inference.
The confidence scores are calculated differently from FARMReader because
each span is considered to be an independent binary classification task. -
Introduce
GPTGenerator
, a class that can generate completions using OpenAI Chat models like GPT3.5 and GPT4. -
Remove
id
parameter fromDocument
constructor as it was ignored and a new one was generated anyway.
This is a backwards incompatible change. -
Add generators module for LLM generator components.
-
Adds
GPT4Generator
, an LLM component based onGPT35Generator
. -
Add
embedding_retrieval
method toMemoryDocumentStore
,
which allows to retrieve the relevant Documents, given a query embedding.
It will be called theMemoryEmbeddingRetriever
. -
Rename
MemoryRetriever
toMemoryBM25Retriever
AddMemoryEmbeddingRetriever
, which takes as input a query embedding and
retrieves the most relevant Documents from aMemoryDocumentStore
. -
Adds proposal for an extended Document class in Haystack 2.0.
-
Adds the implementation of said class.
-
Add OpenAI Text Embedder.
It is a component that uses OpenAI models to embed strings into vectors. -
Revert #5826 and optionally take the
id
in the Document
class constructor. -
Create a dedicated dependency list for the preview package,
farm-haystack[preview]
.
Usinghaystack-ai
is still the recommended way to test Haystack 2.0. -
Add PromptBuilder component to render prompts from template strings.
-
Add
prefix
andsuffix
attributes toSentenceTransformersDocumentEmbedder
.
They can be used to add a prefix and suffix to the Document text before
embedding it. This is necessary to take full advantage of modern embedding
models, such as E5. -
Add support for dates in filters.
-
Add
UrlCacheChecker
to support Web retrieval pipelines.
Check if documents coming from a given list of URLs are already present in the store and if so, returns them.
All URLs with no matching documents are returned on a separate connection.