⭐ Highlights

Stop words for `PromptNode`

Implements stop words on the level of the PromptNode (for all models). Users can specify ‘stop_words’ as PromptNode list parameter, and thus stop LLM text generation once any of the stop words is encountered. Stop words will not be included in the response.

A dedicated Github repository for Haystack demo(s)

The source code for Haystack' Explore the World demo has been moved to a dedicated repository: https://github.com/deepset-ai/haystack-demos. Use this repository to check out the code, run it locally, fork, customize, and contribute!

New nodes: `ImageToText` and `CsvTextConverter`

This release sees two new nodes, both contributed by community members!

The first one is ImageToText (courtesy of our well-known @anakin87): an image captioning node that can generate description of image files and create Haystack documents from them.

The second one is CsvTextConverter, from @Benvii: a small utility node that can load a CSV of FAQ question-answer pairs and correctly send them to your DocumentStore, making it super handy for FAQ matching pipelines.

Check out the docs to know more about them and try them out!

Faster tokenization for GPT models with `tiktoken`

Haystack now supports faster tokenization with OpenAI's tiktoken library, which can dramatically improve tokenization speed for GPT models. For unsupported architectures (Py < 3.8, arm64 and MacOS) fallbacks are in place and regular HuggingFace tokenizers are used. Thanks to @danielbichuetti for yet another amazing contribution!

What's Changed

Breaking Changes

Migrating to use native Pytorch AMP by @sjrl in #2827
bug: consistent batch_size parameter names in distillation by @julian-risch in #3811
refactor: Move invocation_context from meta to own pipeline variable by @vblagoje in #3888

Pipeline

feat: Update cohere embedding models by @vblagoje in #3704
feat: add index parameter to TfidfRetriever by @anakin87 in #3666
feat: Use torch.inference_mode() for TableQA by @sjrl in #3731
feat: Enable text-embedding-ada-002 for EmbeddingRetriever by @vblagoje in #3721
refactor: improve monkey patch for SklearnQueryClassifier by @anakin87 in #3732
refactor: remove unused code in TfidfRetriever by @anakin87 in #3733
refactor: Remove duplicate code in TableReader by @sjrl in #3708
fix: Make InferenceProcessor thread safe by @bogdankostic in #3709
chore: adding template for prompt node by @TuanaCelik in #3738
fix: Fixed local reader model loading by @mayankjobanputra in #3663
fix: Fix predict_batch in TransformersReader for single nested Document list by @bogdankostic in #3748
feat: change PipelineConfigError to DocumentStoreError with more details by @julian-risch in #3783
bug: skip empty documents in reader by @julian-risch in #3773
fix: linefeeds in custom_query by @tstadel in #3813
fix: Convert table cells to strings for compatibility with TableReader by @sjrl in #3762
fix: Ensure eval mode for TableReader model for predictions by @sjrl in #3743
fix: gracefully handle FileExistsError during Preprocessor resource download by @wochinge in #3816
fix: make the crawler runnable and testable on Windows by @anakin87 in #3830
fix: ignore non-serializable params when hashing pipeline objects by @masci in #3842
feat: preprocessor raises warning when doc length exceeds threshold by @ZanSara in #3837
fix: remove string validation in YAML by @ZanSara in #3854
feat: Use truncate option for Cohere.embed by @sjrl in #3865
feat: ImageToText (caption generator) by @anakin87 in #3859
fix: Remove double super class init from ParsrConverter init by @silvanocerza in #3896
feat: store id_hash_keys in Document objects to make documents clonable by @ZanSara in #3697
feat: adding the ability to use Ray Serve async functionality by @zoltan-fedor in #3769
feat: support cl100k_base tokenization and increase performance for GPT2 by @danielbichuetti in #3897
fix: Fix number of concurrent requests in RequestLimiter by @bogdankostic in #3705
feat: Run commands inside docker container as a non root user by @vblagoje in #3702
fix: Removed overlooked torch scatter references by @sjrl in #3719
build: upgrade torch and let transformers pick the version by @julian-risch in #3727
feat: Expand LLM support with PromptModel, PromptNode, and PromptTemplate by @vblagoje in #3667
refactor: remove deprecated parameters from Summarizer by @anakin87 in #3740
refactor: Using with open() to read files by @sjrl in #3787
feat: Bump python to 3.10 for gpu docker image, use nvidia/cuda by @vblagoje in #3701
fix: pin protobuf version by @masci in #3789
fix(docker): Use IMAGE_NAME in api image by @FabianHertwig in #3786
bug: Fix launch_milvus() by cd'ing to milvus_dir by @t0r0id in #3795
refactor: Change PromptNode registered templates from per class to per instance by @vblagoje in #3810
bug: The PromptNode handles all parameters as lists without checking if they are in fact lists by @zoltan-fedor in #3820
feat: update the docker image for haystack-api service by @bilgeyucel in #3835
refactor: Simplify PromptTemplate substitution in PromptNode by @vblagoje in #3876
feat: PromptNode - implement stop words by @vblagoje in #3884
feat: Add retry with exponential back-off to PromptNode's OpenAI models by @vblagoje in #3886
chore: Add timeouts to external requests calls by @silvanocerza in #3895
feat: Add CsvTextConverter by @Benvii in #3587
refactor: Improve stop_words handling, add unit test cases by @vblagoje in #3918
refactor: Updated rest_api schema for tables to be consistent with Document.to_dict #3872

Models

fix: adjust max token size for openai ADA-v2 embeddings by @LeoGitGuy in #3793
feat: make new sklearn models default in QueryClassifier by @julian-risch in #3777

DocumentStores

Fixing broken BM25 support with Weaviate - fixes #3720 by @zoltan-fedor in #3723
feat: make score_script first class citizen via knn_engine param by @tstadel in #3284
bug: skip validating empty embeddings by @julian-risch in #3774
fix: Despite return_embedding=False SearchEngineDocumentStore.query retrieves embedding_field by @tstadel in #3662
fix: upgrade launch_es() to the version used in CI by @ZanSara in #3858
Adding condition to pinecone object. by @AI-Ahmed in #3768
fix: Allowing InMemStore and FAISSDocStore for indexing using single worker by @mayankjobanputra in #3868
fix: authenticate with aws4auth if set in OpenSearchDocumentStore by @FabianHertwig in #3741
Fixing the query_batch method of the deepsetcloud document store - … by @zoltan-fedor in #3724
feat: add HA support for Weaviate by @zoltan-fedor in #3764

UI / Demo

refactor: remove haystack demo along with deprecated Dockerfiles by @masci in #3829

Documentation

docs: Add info where the feedback is stored by @agnieszka-m in #3772
bug: fix the docs rest api reference url by @bilgeyucel in #3775
Docs: Update FAISSDocStore load and save descriptions by @agnieszka-m in #3808
fix: Add missing docstrings to PromptNode, PromptTemplate and PromptModel by @vblagoje in #3821
docs: OpensearchDocumentStore docstring by @ZanSara in #3904

Other Changes

proposal: Create a dedicated Github repository for Haystack demos by @masci in #3695
fix: build pdftotext from sources by @masci in #3746
fix: Trigger pipeline schema update on tagged releases by @askainet in #3752
ci: Add newline when generating OpenAPI specs by @bogdankostic in #3782
test: Improve robustness of PromptNode unit tests by @vblagoje in #3747
feat: utility function to explicitly invoke JSON schema generation by @masci in #3798
fix: prevent posthog from sending errors to stderr by @julian-risch in #4008

New Contributors

@FabianHertwig made their first contribution in #3786
@t0r0id made their first contribution in #3795
@LeoGitGuy made their first contribution in #3793
@Benvii made their first contribution in #3638

Full Changelog: v1.12.2...v1.13.0rc1

deepset-ai/haystack v1.13.0 on GitHub