UKPLab/sentence-transformers v5.1.0 on GitHub

This release introduces 2 new efficient computing backends for SparseEncoder embedding models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; a new "n-tuple-score" output format for hard negative mining for distillation; gathering across devices for free lunch on multi-gpu training; trackio support; MTEB documentation; any many small fixes and features.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.1.0

# Inference only, use one of:
pip install sentence-transformers==5.1.0
pip install sentence-transformers[onnx-gpu]==5.1.0
pip install sentence-transformers[onnx]==5.1.0
pip install sentence-transformers[openvino]==5.1.0

Faster ONNX and OpenVINO backends for SparseEncoder models (#3475)

Introducing a new backend keyword argument to the SparseEncoder initialization, allowing values of "torch" (default), "onnx", and "openvino".
These require installing sentence-transformers with specific extras:

pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]

It's as simple as:

from sentence_transformers import SparseEncoder

# Load a SparseEncoder model with the ONNX backend
model = SparseEncoder("naver/splade-v3", backend="onnx")

query = "Which planet is known as the Red Planet?"
documents = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# torch.Size([30522]) torch.Size([4, 30522])

similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[12.1450, 26.1040, 22.0025, 23.3877]])

decoded_query = model.decode(query_embeddings, top_k=5)
decoded_documents = model.decode(document_embeddings, top_k=5)
print(decoded_query)
# [('red', 3.0222), ('planet', 2.5001), ('planets', 1.9412), ('known', 1.8126), ('nasa', 0.9347)]
print(decoded_documents)
# [
#     [('venus', 3.1980), ('twin', 2.7036), ('earth', 2.4310), ('twins', 2.0957), ('planet', 1.9462)],
#     [('mars', 3.1443), ('planet', 2.4924), ('red', 2.4514), ('reddish', 2.2234), ('planets', 2.1976)],
#     [('jupiter', 2.9604), ('red', 2.5507), ('planet', 2.3774), ('planets', 2.1641), ('spot', 2.1138)],
#     [('saturn', 2.9354), ('red', 2.4548), ('planet', 2.3962), ('mistaken', 2.3361), ('cass', 2.2100)]
# ]

If you specify a backend and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub or model.save_pretrained into the same model repository or directory to avoid having to re-export the model every time.

All keyword arguments passed via model_kwargs will be passed on to ORTModelForMaskedLM.from_pretrained or ORTModelForMaskedLM.from_pretrained. The most useful arguments are:

provider: (Only if backend="onnx") ONNX Runtime provider to use for loading the model, e.g. "CPUExecutionProvider" . See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g. "CUDAExecutionProvider") will be used.
file_name: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.
export: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.

Benchmarks

We ran benchmarks for CPU and GPU, averaging findings across 3 datasets, and numerous batch sizes. Here are the findings:

These findings resulted in these recommendations:

For GPU, you can expect 1.81x speedup with bf16 at no cost, and for CPU you can expect up to ~3x speedup at minimal cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.

Read the Speeding up Inference documentation for more details.

New `n-tuple-scores` output format from `mine_hard_negatives` (#3430, #3481)

The mine_hard_negatives utility function has been extended to support the n-tuple-scores output format, which outputs negatives into num_negatives + 3 columns:

'query', 'answer', 'negative_1', 'negative_2', ..., 'score'

where the 'score' is a list of scores for the query-answer plus each query-negative pair.

from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

# Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")

# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train")

# Mine hard negatives into num_negatives + 3 columns: 'query', 'answer', 'negative_1', 'negative_2', ..., 'score'
# where 'score' is a list of scores for the query-answer plus each query-negative pair.
dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    num_negatives=5,
    sampling_strategy="top",
    batch_size=128,
    use_faiss=True,
    output_format="n-tuple-scores",
)
print(dataset)
print(dataset[14])
"""
{
    'query': 'when did jack and the beanstalk take place',
    'answer': "Jack and the Beanstalk According to researchers at the universities in Durham and Lisbon, the story originated more than 5,000 years ago, based on a widespread archaic story form which is now classified by folklorists as ATU 328 The Boy Who Stole Ogre's Treasure.[7]",
    'negative_1': 'Jack and the Beanstalk "Jack and the Beanstalk" is an English fairy tale. It appeared as "The Story of Jack Spriggins and the Enchanted Bean" in 1734[1] and as Benjamin Tabart\'s moralised "The History of Jack and the Bean-Stalk" in 1807.[2] Henry Cole, publishing under pen name Felix Summerly popularised the tale in The Home Treasury (1845),[3] and Joseph Jacobs rewrote it in English Fairy Tales (1890).[4] Jacobs\' version is most commonly reprinted today and it is believed to be closer to the oral versions than Tabart\'s because it lacks the moralising.[5]',
    'negative_2': 'Jack and the Beanstalk Jack climbs the beanstalk twice more. He learns of other treasures and steals them when the giant sleeps: first a goose that lays golden eggs, then a magic harp that plays by itself. The giant wakes when Jack leaves the house with the harp and chases Jack down the beanstalk. Jack calls to his mother for an axe and before the giant reaches the ground, cuts down the beanstalk, causing the giant to fall to his death.',
    'negative_3': 'Jack in the Box Jack in the Box is an American fast-food restaurant chain founded February 21, 1951, by Robert O. Peterson in San Diego, California, where it is headquartered. The chain has 2,200 locations, primarily serving the West Coast of the United States and selected large urban areas in the eastern portion of the US including Texas. Food items include a variety of hamburger and cheeseburger sandwiches along with selections of internationally themed foods such as tacos and egg rolls. The company also operates the Qdoba Mexican Grill chain.[4][5]',
    'negative_4': 'Jack in the Box Jack in the Box is an American fast-food restaurant chain founded February 21, 1951, by Robert O. Peterson in San Diego, California, where it is headquartered. The chain has 2,200 locations, primarily serving the West Coast of the United States and selected large urban areas in the eastern portion of the US including Texas and the Charlotte metropolitan area. The company also formerly operated the Qdoba Mexican Grill chain until Apollo Global Management bought the chain in December 2017.[4]',
    'negative_5': "Jack Box Jack Box (full name Jack I. Box; or simply known as Jack) is the mascot of American restaurant chain Jack in the Box. In the advertisements, he is the founder, CEO, and ad spokesman for the chain. According to the company's web site, he has the appearance of a typical male, with the exception of his huge spherical white head, blue dot eyes, conical black pointed nose, and a curvilinear red smile. He is most of the time seen wearing his yellow clown cap, and a business suit driving a red Viper convertible.",
    'score': [0.7949077486991882, 0.8010389804840088, 0.6466549634933472, 0.5222680568695068, 0.5216285586357117, 0.47328776121139526]
}
"""

This format is directly usable in various distillation losses:

MarginMSELoss for SentenceTransformer models
DistillKLDivLoss for SentenceTransformer models
SparseDistillKLDivLoss for SparseEncoder models
SparseMarginMSELoss for SparseEncoder models
MarginMSELoss for CrossEncoder models

Note that without applying any absolute_margin, relative_margin, max_score, etc., you can mine negatives that actually score better than your positive. With a distillation loss, this is totally fine. It will learn using the (margins between the) scores, so you don't have to worry about false negatives as much as when using e.g. MultipleNegativesRankingLoss.

This release also adds support for 1) n-tuples instead of just triplets and 2) num_negatives + 1 scores where the first score is the query-positive score for MarginMSELoss for CrossEncoder models.

Gathering Across Devices (#3442, #3453)

Various loss functions in Sentence Transformers take advantage of so-called "in-batch negatives". With these losses, for each sample in a batch, all data for the other samples will be considered as negatives, because random inputs are likely unrelated to the sample. This pushes them further apart, resulting ideally only in higher similarity scores for inputs that really are similar.

This release introduces a new gather_across_devices parameter for each of these losses. This parameter only works in a multi-GPU setting, and will pull the other samples from other devices into the computation. In short: if you have the following setup:

loss: CachedMultipleNegativesRankingLoss (a.k.a. InfoNCE with GradCache) with mini_batch_size=16
per_device_train_batch_size=128 in the SentenceTransformersTrainingArguments
8 GPUs
Training with triplets: query, positive, negative

Then each device will have the memory usage corresponding to a batch size of 16, while each sample has 1 positive and 255 negatives (1 hard negative for that sample, 127 other positive values as in-batch negatives and 127 other negative values as in-batch negatives). Your global batch size will be 128 * 8 = 1024, and your learning rate should be set according to that value.

Now, if you use the exact same setup, but with gather_across_devices=True, then your setting is suddenly:
Each device will have the memory usage corresponding to a batch size of 16, while each sample has 1 positive and 2047 negatives (1 hard negative for that sample, 1023 other positive values as in-batch negatives and 1023 other negative values as in-batch negatives). Your global batch size will be 128 * 8 = 1024, and your learning rate should be set according to that value.

The difference is that the in-batch negatives will now pull from other devices too! Because a larger batch size often results in stronger models with in-batch negatives losses, this should give stronger models at almost no overhead.

Here are the results from one of my simple experiments with finetuning mpnet-base on natural-questions with 8 GPUs:
baseline:

Evaluation: 0.5111 NDCG@10
Runtime: 89.4335 seconds

gather_across_devices=True:

Evaluation: 0.5359 NDCG@10
Runtime: 89.3699 seconds

Trackio support (#3467)

If your transformers version is high enough, and you have trackio installed (pip install trackio), then Sentence Transformers will also export logs to Trackio. It'll allow you to browse to localhost to track your experiments for free.

MTEB Documentation (#3477)

If you're interested in evaluating your SentenceTransformer models on common benchmarks, then MTEB is your friend. However, there wasn't yet any documentation to guide you in the right direction. This release, we added some:

Sentence Transformers > Usage > Evaluation with MTEB

Minor Notable Changes

Fix crashes with MarginMSELoss and SparseMarginMSELoss training when using anchor, positive, negative triplets with 1 score (i.e. the difference between negative and positive). (#3421)
Update temperature parameter default value in DistillKLDivLoss to 1.0. The SparseDistillKLDivLoss default temperature stays at 2.0. (#3428)
Avoid unneeded warning when calling encode_query/encode_document with prompt (#3444)
Fix compatibility issues with Datasets v4.0 (#3445, #3455)
Fix Router torch initialization, resulted in issues with DataParallel and memory usage (#3454)
No longer crash if CrossEncoder.predict is called with an empty list (#3466)
More consistent output types when calling with empty list as input (#3466)
Reintroduce FIPS compatibility (#3479)

All Changes

Add redirect for HPO training examples in .htaccess by @tomaarsen in #3412
[docs] Fix link in README for training script name by @tomaarsen in #3417
[docs] Fix arxiv link in SpladePooling docs by @tomaarsen in #3418
Update README.md by @CharlesCNorton in #3419
Adjust label shape handling in MarginMSELoss for single score inputs by @tomaarsen in #3421
[tests] Reuse models more where possible by @tomaarsen in #3432
Update temperature parameter default value in DistillKLDivLoss to 1.0 by @tomaarsen in #3428
[model card] Avoid pipe characters that mess up table formatting by @tomaarsen in #3429
[feat] Add "n-tuple-scores" output format to mine_hard_negatives function by @tomaarsen in #3430
Fix ONNX/OV export; Avoid .transformers_model by @tomaarsen in #3439
[feat] Avoid unneeded warning when calling encode_query/document with prompt by @tomaarsen in #3444
[compat] Fix compatibility issues with datasets v4 by @tomaarsen in #3445
Sync prompts type with documentation by @FremyCompany in #3427
[feat] Add gather_across_devices parameter to some contrastive losses by @tomaarsen in #3442
[chore] Redistribute util.py (and its tests) to separate directory by @tomaarsen in #3446
[tests] Reduce the number of hub requests for the model card tests by @tomaarsen in #3447
[fix] cast indexing numpy int to Python int by @emapco in #3455
[fix] Fix Router torch initialization, fixes DP by @tomaarsen in #3454
[fix] Patch gather_across_devices for in-batch negatives losses by @tomaarsen in #3453
Revert changes to multi-GPU evaluator calls by @tomaarsen in #3463
Update README.md (grammar mistakes) by @ddofer in #3458
[feat] Update the trackio default project if not already defined by @tomaarsen in #3467
Fix: prevent loading best model when PEFT adapters are active (#3056) by @sahibpreetsingh12 in #3470
[docs] Fix dead link in ContrastiveLoss references by @tomaarsen in #3476
[docs] Add splade_index semantic search example by @tomaarsen in #3473
[feat] Add ONNX, OV support for SparseEncoder; refactor ONNX/OV by @tomaarsen in #3475
chore: Handle error when predict is called with an empty sentence list by @nitin-nsp in #3466
[fix] FIPS compatibility - use SHA256 with usedforsecurity=False in hard negatives caching by @tomaarsen in #3479
docs: add MTEB evaluation guide and update usage.rst by @sahibpreetsingh12 in #3477
[feat] Allow n-tuples for CE MarginMSE training by @tomaarsen in #3481
[docs] Update main sbert.net page with v5.1 mention by @tomaarsen in #3482

New Contributors

@CharlesCNorton made their first contribution in #3419
@FremyCompany made their first contribution in #3427
@ddofer made their first contribution in #3458
@sahibpreetsingh12 made their first contribution in #3470
@nitin-nsp made their first contribution in #3466

Also thanks to @Samoed and @KennethEnevoldsen for their reviews on the MTEB documentation, and thanks to @NohTow for the inspiration on gathering across devices.

Full Changelog: v5.0.0...v5.1.0

UKPLab/sentence-transformers v5.1.0 v5.1.0 - ONNX and OpenVINO backends offering 2-3x speedups; more hard negatives mining formats on GitHub