⭐ Highlights
Expanding Haystack’s LLM support further with the new CohereEmbeddingEncoder
(#3356)
Now you can easily create document and query embeddings using Cohere’s large language models: if you have a Cohere account, all you have to do is set the name of one of the supported models (small
, medium
, or large
) and add your API key to the EmbeddingRetriever
component in your pipelines (see docs).
Extracting headlines from Markdown and PDF files (#3445 #3488)
Using the MarkdownConverter
or the ParsrConverter
you can set the parameter extract_headlines
to True
to extract the headlines out of your files together with their start start position in the file and their level. Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:
{
"headline": <THE HEADLINE STRING>,
"start_idx": <IDX OF HEADLINE START IN document.content >,
"level": <LEVEL OF THE HEADLINE>
}
Introducing the proposals design process (#3333)
We've introduced the proposal design process for substantial changes. A proposal is a single Markdown file that explains why a change is needed and how it would be implemented. You can find a detailed explanation of the process and a proposal template in the proposals directory.
⚠️ Breaking change: removing Milvus1DocumentStore
From this version onwards, Haystack no longer supports version 1 of Milvus. We still support Milvus version 2. We removed Milvus1DocumentStore
and renamed Milvus2DocumentStore
to MilvusDocumentStore
.
What's Changed
Breaking Changes
- bug: removed duplicated meta "name" field addition to content before embedding in
update_embeddings
workflow by @mayankjobanputra in #3368 - BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x by @masci in #3552
Pipeline
- fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
- bug: change type of split_by to Literal including None by @julian-risch in #3389
- Fix: update pyworld pin by @anakin87 in #3435
- feat: send event if number of queries exceeds threshold by @vblagoje in #3419
- Feat: allow decreasing size of datasets loaded from BEIR by @ugm2 in #3392
- feat: add
__cointains__
toSpan
by @ZanSara in #3446 - Bug: Fix prompt length computation by @Timoeller in #3448
- Add indexing pipeline type by @vblagoje in #3461
- fix: warning if doc store similarity function is incompatible with Sentence Transformers model by @anakin87 in #3455
- feat: Add CohereEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3453
- feat: Extraction of headlines in markdown files by @bogdankostic in #3445
- bug: replace decorator with counter attribute for pipeline event by @julian-risch in #3462
- feat: add
document_store
to allBaseRetriever.retrieve()
andBaseRetriever.retrieve_batch()
implementations by @ZanSara in #3379 - refactor: TableReader by @sjrl in #3456
- fix: do not reference package directory in
PDFToTextOCRConverter.convert()
by @ZanSara in #3478 - feat: Create the TextIndexingPipeline by @brandenchan in #3473
- refactor: remove YAML save/load methods for subclasses of
BaseStandardPipeline
by @ZanSara in #3443 - fix: strip whitespaces safely from
FARMReader
's answers by @ZanSara in #3526
DocumentStores
- Document Store test refactoring by @masci in #3449
- fix: support long texts for labels in
ElasticsearchDocumentStore
by @anakin87 in #3346 - feat: add SQLDocumentStore tests by @masci in #3517
- refactor: Refactor Weaviate tests by @masci in #3541
- refactor: Pinecone tests by @masci in #3555
- fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" by @anakin87 in #3548
- fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta by @tstadel in #3572
- fix: discard metadata fields if not set in Weaviate by @masci in #3578
UI / Demo
Documentation
- docs: Extend utils API docs coverage by @brandenchan in #3402
- refactor: simplify Summarizer, add Document Merger by @anakin87 in #3452
- feat: introduce proposal design process by @masci in #3333
Other Changes
- fix: Update env variable for model caching timeout by @sjrl in #3405
- feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
- fix: improve Document
__repr__
by @anakin87 in #3385 - fix: disabling telemetry prevents writing config by @julian-risch in #3465
- refactor: Change
no_answer
attribute by @anakin87 in #3411 - feat: Speed up reader tests by @sjrl in #3476
- fix: pattern to match tags push by @masci in #3469
- fix: using onnx converter on XLMRoberta architecture by @sjrl in #3470
- feat: Add headline extraction to
ParsrConverter
by @bogdankostic in #3488 - refactor: upgrade actions version by @ZanSara in #3506
- docs: Update docker readme by @brandenchan in #3531
- refactor: refactor FAISS tests by @masci in #3537
- feat: include error message in HaystackError telemetry events by @vblagoje in #3543
- fix: [rest_api] support TableQA in the endpoint
/documents/get_by_filters
by @ju-gu in #3551 - bug: fix release number by @mayankjobanputra in #3559
- refactor: Generate JSON schema when missing by @masci in #3533
New Contributors
- @brunnurs made their first contribution in #3330
- @mayankjobanputra made their first contribution in #3368
Full Changelog: v1.10.0...v1.11.0rc1