⭐ Highlights

Expanding Haystack’s LLM support further with the new `CohereEmbeddingEncoder` (#3356)

Now you can easily create document and query embeddings using Cohere’s large language models: if you have a Cohere account, all you have to do is set the name of one of the supported models (small, medium, or large) and add your API key to the EmbeddingRetriever component in your pipelines (see docs).

Extracting headlines from Markdown and PDF files (#3445 #3488)

Using the MarkdownConverter or the ParsrConverter you can set the parameter extract_headlines to True to extract the headlines out of your files together with their start start position in the file and their level. Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:

{
    "headline": <THE HEADLINE STRING>,
    "start_idx": <IDX OF HEADLINE START IN document.content >,
    "level": <LEVEL OF THE HEADLINE>
}

Introducing the proposals design process (#3333)

We've introduced the proposal design process for substantial changes. A proposal is a single Markdown file that explains why a change is needed and how it would be implemented. You can find a detailed explanation of the process and a proposal template in the proposals directory.

⚠️ Breaking change: removing `Milvus1DocumentStore`

From this version onwards, Haystack no longer supports version 1 of Milvus. We still support Milvus version 2. We removed Milvus1DocumentStore and renamed Milvus2DocumentStore to MilvusDocumentStore.

What's Changed

Breaking Changes

bug: removed duplicated meta "name" field addition to content before embedding in update_embeddings workflow by @mayankjobanputra in #3368
BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x by @masci in #3552

Pipeline

fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
bug: change type of split_by to Literal including None by @julian-risch in #3389
Fix: update pyworld pin by @anakin87 in #3435
feat: send event if number of queries exceeds threshold by @vblagoje in #3419
Feat: allow decreasing size of datasets loaded from BEIR by @ugm2 in #3392
feat: add __cointains__ to Span by @ZanSara in #3446
Bug: Fix prompt length computation by @Timoeller in #3448
Add indexing pipeline type by @vblagoje in #3461
fix: warning if doc store similarity function is incompatible with Sentence Transformers model by @anakin87 in #3455
feat: Add CohereEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3453
feat: Extraction of headlines in markdown files by @bogdankostic in #3445
bug: replace decorator with counter attribute for pipeline event by @julian-risch in #3462
feat: add document_store to all BaseRetriever.retrieve() and BaseRetriever.retrieve_batch() implementations by @ZanSara in #3379
refactor: TableReader by @sjrl in #3456
fix: do not reference package directory in PDFToTextOCRConverter.convert() by @ZanSara in #3478
feat: Create the TextIndexingPipeline by @brandenchan in #3473
refactor: remove YAML save/load methods for subclasses of BaseStandardPipeline by @ZanSara in #3443
fix: strip whitespaces safely from FARMReader's answers by @ZanSara in #3526

DocumentStores

Document Store test refactoring by @masci in #3449
fix: support long texts for labels in ElasticsearchDocumentStore by @anakin87 in #3346
feat: add SQLDocumentStore tests by @masci in #3517
refactor: Refactor Weaviate tests by @masci in #3541
refactor: Pinecone tests by @masci in #3555
fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" by @anakin87 in #3548
fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta by @tstadel in #3572
fix: discard metadata fields if not set in Weaviate by @masci in #3578

UI / Demo

refactor: update package strategy in ui by @anakin87 in #3396

Documentation

docs: Extend utils API docs coverage by @brandenchan in #3402
refactor: simplify Summarizer, add Document Merger by @anakin87 in #3452
feat: introduce proposal design process by @masci in #3333

Other Changes

fix: Update env variable for model caching timeout by @sjrl in #3405
feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
fix: improve Document __repr__ by @anakin87 in #3385
fix: disabling telemetry prevents writing config by @julian-risch in #3465
refactor: Change no_answer attribute by @anakin87 in #3411
feat: Speed up reader tests by @sjrl in #3476
fix: pattern to match tags push by @masci in #3469
fix: using onnx converter on XLMRoberta architecture by @sjrl in #3470
feat: Add headline extraction to ParsrConverter by @bogdankostic in #3488
refactor: upgrade actions version by @ZanSara in #3506
docs: Update docker readme by @brandenchan in #3531
refactor: refactor FAISS tests by @masci in #3537
feat: include error message in HaystackError telemetry events by @vblagoje in #3543
fix: [rest_api] support TableQA in the endpoint /documents/get_by_filters by @ju-gu in #3551
bug: fix release number by @mayankjobanputra in #3559
refactor: Generate JSON schema when missing by @masci in #3533

New Contributors

@brunnurs made their first contribution in #3330
@mayankjobanputra made their first contribution in #3368

Full Changelog: v1.10.0...v1.11.0rc1

deepset-ai/haystack v1.11.0rc1 on GitHub