This release introduces 2 new efficient computing backends for SparseEncoder embedding models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; a new "n-tuple-score" output format for hard negative mining for distillation; gathering across devices for free lunch on multi-gpu training; trackio support; MTEB documentation; any many small fixes and features.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==5.1.0
# Inference only, use one of:
pip install sentence-transformers==5.1.0
pip install sentence-transformers[onnx-gpu]==5.1.0
pip install sentence-transformers[onnx]==5.1.0
pip install sentence-transformers[openvino]==5.1.0
Faster ONNX and OpenVINO backends for SparseEncoder models (#3475)
Introducing a new backend
keyword argument to the SparseEncoder
initialization, allowing values of "torch"
(default), "onnx"
, and "openvino"
.
These require installing sentence-transformers
with specific extras:
pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]
It's as simple as:
from sentence_transformers import SparseEncoder
# Load a SparseEncoder model with the ONNX backend
model = SparseEncoder("naver/splade-v3", backend="onnx")
query = "Which planet is known as the Red Planet?"
documents = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# torch.Size([30522]) torch.Size([4, 30522])
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[12.1450, 26.1040, 22.0025, 23.3877]])
decoded_query = model.decode(query_embeddings, top_k=5)
decoded_documents = model.decode(document_embeddings, top_k=5)
print(decoded_query)
# [('red', 3.0222), ('planet', 2.5001), ('planets', 1.9412), ('known', 1.8126), ('nasa', 0.9347)]
print(decoded_documents)
# [
# [('venus', 3.1980), ('twin', 2.7036), ('earth', 2.4310), ('twins', 2.0957), ('planet', 1.9462)],
# [('mars', 3.1443), ('planet', 2.4924), ('red', 2.4514), ('reddish', 2.2234), ('planets', 2.1976)],
# [('jupiter', 2.9604), ('red', 2.5507), ('planet', 2.3774), ('planets', 2.1641), ('spot', 2.1138)],
# [('saturn', 2.9354), ('red', 2.4548), ('planet', 2.3962), ('mistaken', 2.3361), ('cass', 2.2100)]
# ]
If you specify a backend
and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub
or model.save_pretrained
into the same model repository or directory to avoid having to re-export the model every time.
All keyword arguments passed via model_kwargs
will be passed on to ORTModelForMaskedLM.from_pretrained
or ORTModelForMaskedLM.from_pretrained
. The most useful arguments are:
provider
: (Only ifbackend="onnx"
) ONNX Runtime provider to use for loading the model, e.g."CPUExecutionProvider"
. See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g."CUDAExecutionProvider"
) will be used.file_name
: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.export
: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.
Benchmarks
We ran benchmarks for CPU and GPU, averaging findings across 3 datasets, and numerous batch sizes. Here are the findings:
These findings resulted in these recommendations:
For GPU, you can expect 1.81x speedup with bf16 at no cost, and for CPU you can expect up to ~3x speedup at minimal cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.
Read the Speeding up Inference documentation for more details.
New n-tuple-scores
output format from mine_hard_negatives
(#3430, #3481)
The mine_hard_negatives
utility function has been extended to support the n-tuple-scores
output format, which outputs negatives into num_negatives
+ 3 columns:
- 'query', 'answer', 'negative_1', 'negative_2', ..., 'score'
where the 'score' is a list of scores for the query-answer plus each query-negative pair.
from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
# Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
# Mine hard negatives into num_negatives + 3 columns: 'query', 'answer', 'negative_1', 'negative_2', ..., 'score'
# where 'score' is a list of scores for the query-answer plus each query-negative pair.
dataset = mine_hard_negatives(
dataset=dataset,
model=model,
num_negatives=5,
sampling_strategy="top",
batch_size=128,
use_faiss=True,
output_format="n-tuple-scores",
)
print(dataset)
print(dataset[14])
"""
{
'query': 'when did jack and the beanstalk take place',
'answer': "Jack and the Beanstalk According to researchers at the universities in Durham and Lisbon, the story originated more than 5,000 years ago, based on a widespread archaic story form which is now classified by folklorists as ATU 328 The Boy Who Stole Ogre's Treasure.[7]",
'negative_1': 'Jack and the Beanstalk "Jack and the Beanstalk" is an English fairy tale. It appeared as "The Story of Jack Spriggins and the Enchanted Bean" in 1734[1] and as Benjamin Tabart\'s moralised "The History of Jack and the Bean-Stalk" in 1807.[2] Henry Cole, publishing under pen name Felix Summerly popularised the tale in The Home Treasury (1845),[3] and Joseph Jacobs rewrote it in English Fairy Tales (1890).[4] Jacobs\' version is most commonly reprinted today and it is believed to be closer to the oral versions than Tabart\'s because it lacks the moralising.[5]',
'negative_2': 'Jack and the Beanstalk Jack climbs the beanstalk twice more. He learns of other treasures and steals them when the giant sleeps: first a goose that lays golden eggs, then a magic harp that plays by itself. The giant wakes when Jack leaves the house with the harp and chases Jack down the beanstalk. Jack calls to his mother for an axe and before the giant reaches the ground, cuts down the beanstalk, causing the giant to fall to his death.',
'negative_3': 'Jack in the Box Jack in the Box is an American fast-food restaurant chain founded February 21, 1951, by Robert O. Peterson in San Diego, California, where it is headquartered. The chain has 2,200 locations, primarily serving the West Coast of the United States and selected large urban areas in the eastern portion of the US including Texas. Food items include a variety of hamburger and cheeseburger sandwiches along with selections of internationally themed foods such as tacos and egg rolls. The company also operates the Qdoba Mexican Grill chain.[4][5]',
'negative_4': 'Jack in the Box Jack in the Box is an American fast-food restaurant chain founded February 21, 1951, by Robert O. Peterson in San Diego, California, where it is headquartered. The chain has 2,200 locations, primarily serving the West Coast of the United States and selected large urban areas in the eastern portion of the US including Texas and the Charlotte metropolitan area. The company also formerly operated the Qdoba Mexican Grill chain until Apollo Global Management bought the chain in December 2017.[4]',
'negative_5': "Jack Box Jack Box (full name Jack I. Box; or simply known as Jack) is the mascot of American restaurant chain Jack in the Box. In the advertisements, he is the founder, CEO, and ad spokesman for the chain. According to the company's web site, he has the appearance of a typical male, with the exception of his huge spherical white head, blue dot eyes, conical black pointed nose, and a curvilinear red smile. He is most of the time seen wearing his yellow clown cap, and a business suit driving a red Viper convertible.",
'score': [0.7949077486991882, 0.8010389804840088, 0.6466549634933472, 0.5222680568695068, 0.5216285586357117, 0.47328776121139526]
}
"""
This format is directly usable in various distillation losses:
- MarginMSELoss for SentenceTransformer models
- DistillKLDivLoss for SentenceTransformer models
- SparseDistillKLDivLoss for SparseEncoder models
- SparseMarginMSELoss for SparseEncoder models
- MarginMSELoss for CrossEncoder models
Note that without applying any absolute_margin
, relative_margin
, max_score
, etc., you can mine negatives that actually score better than your positive. With a distillation loss, this is totally fine. It will learn using the (margins between the) scores, so you don't have to worry about false negatives as much as when using e.g. MultipleNegativesRankingLoss.
This release also adds support for 1) n-tuples instead of just triplets and 2) num_negatives + 1 scores where the first score is the query-positive score for MarginMSELoss for CrossEncoder models.
Gathering Across Devices (#3442, #3453)
Various loss functions in Sentence Transformers take advantage of so-called "in-batch negatives". With these losses, for each sample in a batch, all data for the other samples will be considered as negatives, because random inputs are likely unrelated to the sample. This pushes them further apart, resulting ideally only in higher similarity scores for inputs that really are similar.
This release introduces a new gather_across_devices
parameter for each of these losses. This parameter only works in a multi-GPU setting, and will pull the other samples from other devices into the computation. In short: if you have the following setup:
- loss: CachedMultipleNegativesRankingLoss (a.k.a. InfoNCE with GradCache) with
mini_batch_size=16
per_device_train_batch_size=128
in theSentenceTransformersTrainingArguments
- 8 GPUs
- Training with triplets: query, positive, negative
Then each device will have the memory usage corresponding to a batch size of 16, while each sample has 1 positive and 255 negatives (1 hard negative for that sample, 127 other positive values as in-batch negatives and 127 other negative values as in-batch negatives). Your global batch size will be 128 * 8 = 1024, and your learning rate should be set according to that value.
Now, if you use the exact same setup, but with gather_across_devices=True
, then your setting is suddenly:
Each device will have the memory usage corresponding to a batch size of 16, while each sample has 1 positive and 2047 negatives (1 hard negative for that sample, 1023 other positive values as in-batch negatives and 1023 other negative values as in-batch negatives). Your global batch size will be 128 * 8 = 1024, and your learning rate should be set according to that value.
The difference is that the in-batch negatives will now pull from other devices too! Because a larger batch size often results in stronger models with in-batch negatives losses, this should give stronger models at almost no overhead.
Here are the results from one of my simple experiments with finetuning mpnet-base
on natural-questions
with 8 GPUs:
baseline:
- Evaluation: 0.5111 NDCG@10
- Runtime: 89.4335 seconds
gather_across_devices=True:
- Evaluation: 0.5359 NDCG@10
- Runtime: 89.3699 seconds
Trackio support (#3467)
If your transformers
version is high enough, and you have trackio
installed (pip install trackio
), then Sentence Transformers will also export logs to Trackio. It'll allow you to browse to localhost to track your experiments for free.

MTEB Documentation (#3477)
If you're interested in evaluating your SentenceTransformer
models on common benchmarks, then MTEB is your friend. However, there wasn't yet any documentation to guide you in the right direction. This release, we added some:
Minor Notable Changes
- Fix crashes with MarginMSELoss and SparseMarginMSELoss training when using anchor, positive, negative triplets with 1 score (i.e. the difference between negative and positive). (#3421)
- Update temperature parameter default value in DistillKLDivLoss to 1.0. The SparseDistillKLDivLoss default temperature stays at 2.0. (#3428)
- Avoid unneeded warning when calling encode_query/encode_document with
prompt
(#3444) - Fix compatibility issues with Datasets v4.0 (#3445, #3455)
- Fix
Router
torch initialization, resulted in issues with DataParallel and memory usage (#3454) - No longer crash if
CrossEncoder.predict
is called with an empty list (#3466) - More consistent output types when calling with empty list as input (#3466)
- Reintroduce FIPS compatibility (#3479)
All Changes
- Add redirect for HPO training examples in .htaccess by @tomaarsen in #3412
- [
docs
] Fix link in README for training script name by @tomaarsen in #3417 - [
docs
] Fix arxiv link in SpladePooling docs by @tomaarsen in #3418 - Update README.md by @CharlesCNorton in #3419
- Adjust label shape handling in MarginMSELoss for single score inputs by @tomaarsen in #3421
- [
tests
] Reuse models more where possible by @tomaarsen in #3432 - Update temperature parameter default value in DistillKLDivLoss to 1.0 by @tomaarsen in #3428
- [
model card
] Avoid pipe characters that mess up table formatting by @tomaarsen in #3429 - [
feat
] Add "n-tuple-scores" output format to mine_hard_negatives function by @tomaarsen in #3430 - Fix ONNX/OV export; Avoid .transformers_model by @tomaarsen in #3439
- [
feat
] Avoid unneeded warning when calling encode_query/document with prompt by @tomaarsen in #3444 - [
compat
] Fix compatibility issues with datasets v4 by @tomaarsen in #3445 - Sync
prompts
type with documentation by @FremyCompany in #3427 - [
feat
] Add gather_across_devices parameter to some contrastive losses by @tomaarsen in #3442 - [
chore
] Redistribute util.py (and its tests) to separate directory by @tomaarsen in #3446 - [
tests
] Reduce the number of hub requests for the model card tests by @tomaarsen in #3447 - [
fix
] cast indexing numpy int to Python int by @emapco in #3455 - [
fix
] Fix Router torch initialization, fixes DP by @tomaarsen in #3454 - [
fix
] Patchgather_across_devices
for in-batch negatives losses by @tomaarsen in #3453 - Revert changes to multi-GPU evaluator calls by @tomaarsen in #3463
- Update README.md (grammar mistakes) by @ddofer in #3458
- [
feat
] Update the trackio default project if not already defined by @tomaarsen in #3467 - Fix: prevent loading best model when PEFT adapters are active (#3056) by @sahibpreetsingh12 in #3470
- [
docs
] Fix dead link in ContrastiveLoss references by @tomaarsen in #3476 - [
docs
] Add splade_index semantic search example by @tomaarsen in #3473 - [
feat
] Add ONNX, OV support for SparseEncoder; refactor ONNX/OV by @tomaarsen in #3475 - chore: Handle error when predict is called with an empty sentence list by @nitin-nsp in #3466
- [
fix
] FIPS compatibility - use SHA256 with usedforsecurity=False in hard negatives caching by @tomaarsen in #3479 - docs: add MTEB evaluation guide and update usage.rst by @sahibpreetsingh12 in #3477
- [
feat
] Allow n-tuples for CE MarginMSE training by @tomaarsen in #3481 - [
docs
] Update main sbert.net page with v5.1 mention by @tomaarsen in #3482
New Contributors
- @CharlesCNorton made their first contribution in #3419
- @FremyCompany made their first contribution in #3427
- @ddofer made their first contribution in #3458
- @sahibpreetsingh12 made their first contribution in #3470
- @nitin-nsp made their first contribution in #3466
Also thanks to @Samoed and @KennethEnevoldsen for their reviews on the MTEB documentation, and thanks to @NohTow for the inspiration on gathering across devices.
Full Changelog: v5.0.0...v5.1.0