This release introduces 2 new efficient computing backends for CrossEncoder (reranker) models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; improved hard negatives mining strategies, and minor improvements.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==4.1.0
# Inference only, use one of:
pip install sentence-transformers==4.1.0
pip install sentence-transformers[onnx-gpu]==4.1.0
pip install sentence-transformers[onnx]==4.1.0
pip install sentence-transformers[openvino]==4.1.0
Faster ONNX and OpenVINO Backends for CrossEncoder (#3319)
Introducing a new backend
keyword argument to the CrossEncoder
initialization, allowing values of "torch"
(default), "onnx"
, and "openvino"
.
These require installing sentence-transformers
with specific extras:
pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]
It's as simple as:
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
query = "Which planet is known as the Red Planet?"
passages = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
scores = model.predict([(query, passage) for passage in passages])
print(scores)
If you specify a backend
and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub
or model.save_pretrained
into the same model repository or directory to avoid having to re-export the model every time.
All keyword arguments passed via model_kwargs
will be passed on to ORTModelForSequenceClassification.from_pretrained
or OVModelForSequenceClassification.from_pretrained
. The most useful arguments are:
provider
: (Only ifbackend="onnx"
) ONNX Runtime provider to use for loading the model, e.g."CPUExecutionProvider"
. See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g."CUDAExecutionProvider"
) will be used.file_name
: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.export
: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.
For example:
from sentence_transformers import SentenceTransformer
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="onnx",
model_kwargs={
"file_name": "model_O3.onnx",
"provider": "CPUExecutionProvider",
}
)
query = "Which planet is known as the Red Planet?"
passages = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
scores = model.predict([(query, passage) for passage in passages])
print(scores)
Benchmarks
We ran benchmarks for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:
These findings resulted in these recommendations:
For GPU, you can expect 1.88x speedup with fp16 at no cost, and for CPU you can expect ~3x speedup at no cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.
Read the Speeding up Inference documentation for more details.
In addition to exporting default ONNX and OpenVINO models, you can also use one of the helper methods for optimizing and quantizing ONNX models:
The usage is like this:
After which you can load the model with:
or when it gets merged:
The usage is like this:
After which you can load the model with:
or when it gets merged:
OpenVINO models can be quantized to int8 precision using Optimum Intel to speed up inference. To do this, you can use the export_static_quantized_openvino_model() function, which saves the quantized model in a directory or model repository that you specify. Post-Training Static Quantization expects:
The usage is like this:
After which you can load the model with:
or when it gets merged:
Read the Speeding up Inference documentation for more details.
ONNX & OpenVINO Optimization and Quantization
ONNX Optimization
export_optimized_onnx_model
: This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options here. This function accepts:
model
A SentenceTransformer or CrossEncoder model loaded with backend="onnx"
.
optimization_config
: "O1", "O2", "O3", or "O4" from 🤗 Optimum or a custom OptimizationConfig
instance.
model_name_or_path
: The directory or model repository where the optimized model will be saved.
push_to_hub
: Whether the push the exported model to the hub with model_name_or_path
as the repository name. If False, the model will be saved in the directory specified with model_name_or_path
.
create_pr
: If push_to_hub
, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to.
file_suffix
: The suffix to add to the optimized model file name. Will use the optimization_config
string or "optimized"
if not set.
from sentence_transformers import SentenceTransformer, export_optimized_onnx_model
onnx_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_optimized_onnx_model(
model=onnx_model,
optimization_config="O4",
model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
push_to_hub=True,
create_pr=True,
)
from sentence_transformers import CrossEncoder
pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="onnx",
model_kwargs={"file_name": "onnx/model_O4.onnx"},
revision=f"refs/pr/{pull_request_nr}"
)
from sentence_transformers import CrossEncoder
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="onnx",
model_kwargs={"file_name": "onnx/model_O4.onnx"},
)
ONNX Quantization
export_dynamic_quantized_onnx_model
: This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function accepts
model
A SentenceTransformer or CrossEncoder model loaded with backend="onnx"
.
quantization_config
: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from AutoQuantizationConfig, or an QuantizationConfig instance.
model_name_or_path
: The directory or model repository where the optimized model will be saved.
push_to_hub
: Whether the push the exported model to the hub with model_name_or_path
as the repository name. If False, the model will be saved in the directory specified with model_name_or_path
.
create_pr
: If push_to_hub
, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to.
file_suffix
: The suffix to add to the optimized model file name. Will use the quantization_config
string or e.g. "int8_quantized"
if not set.
from sentence_transformers import CrossEncoder, export_dynamic_quantized_onnx_model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_dynamic_quantized_onnx_model(
model,
"avx512_vnni",
"sentence-transformers/cross-encoder/ms-marco-MiniLM-L6-v2",
push_to_hub=True,
create_pr=True,
)
from sentence_transformers import CrossEncoder
pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="onnx",
model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"},
revision=f"refs/pr/{pull_request_nr}",
)
from sentence_transformers import CrossEncoder
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="onnx",
model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"},
)
OpenVINO Quantization
model
: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend.
quantization_config
: (Optional) The quantization configuration. This parameter accepts either: None for the default 8-bit quantization, a dictionary representing quantization configurations, or an OVQuantizationConfig instance.
model_name_or_path
: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
dataset_name
: (Optional) The name of the dataset to load for calibration. If not specified, defaults to sst2 subset from the glue dataset.
dataset_config_name
: (Optional) The specific configuration of the dataset to load.
dataset_split
: (Optional) The split of the dataset to load (e.g., ‘train’, ‘test’).
column_name
: (Optional) The column name in the dataset to use for calibration.
push_to_hub
: (Optional) a boolean to push the quantized model to the Hugging Face Hub.
create_pr
: (Optional) a boolean to create a pull request when pushing to the Hugging Face Hub. Useful when you don’t have write access to the repository.
file_suffix
: (Optional) a string to append to the model name when saving it. If not specified, "qint8_quantized" will be used.
from sentence_transformers import CrossEncoder, export_static_quantized_openvino_model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino")
export_static_quantized_openvino_model(
model,
quantization_config=None,
model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
push_to_hub=True,
create_pr=True,
)
from sentence_transformers import CrossEncoder
pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="openvino",
model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"},
revision=f"refs/pr/{pull_request_nr}"
)
from sentence_transformers import CrossEncoder
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="openvino",
model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"},
)
Relative Margin in Hard Negatives Mining (#3321)
This PR softly deprecates the margin
option in mine_hard_negatives
in favor of absolute_margin
and relative_margin
. In short:
absolute_margin
: Discards negative candidates whoseanchor_negative_similarity
score is greater than or equal toanchor_positive_similarity - absolute_margin
. With anabsolute_margin
of 0.1 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.76.relative_margin
: Discards negative candidates whoseanchor_negative_similarity
score is greater than or equal toanchor_positive_similarity * (1 - relative_margin)
. With arelative_margin
of 0.05 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.817 (i.e. 95% of the anchor-positive similarity).
This means that we now support the recommended hard negatives mining strategy from the excellent NV-Retriever paper, a.k.a. the TopK-PercPos (95%) strategy:
from sentence_transformers.util import mine_hard_negatives
...
dataset = mine_hard_negatives(
dataset=dataset,
model=model,
relative_margin=0.05, # 0.05 means that the negative is at most 95% as similar to the anchor as the positive
num_negatives=num_negatives, # 10 or less is recommended
sampling_strategy="top", # "top" means that we sample the top candidates as negatives
batch_size=batch_size, # Adjust as needed
use_faiss=True, # Optional: Use faiss/faiss-gpu for faster similarity search
)
Minor Changes
- Add
margin
andmargin_strategy
to GISTEmbedLoss and CachedGISTEmbedLoss (#3299, #3323) - Support
activation_function=None
in Dense module (#3316) - Update how
all_layer_embeddings
outputs are determined (#3320) - Avoid error with
SentenceTransformer.encode
ifprompts
are provided andoutput_value=None
(#3327)
All Changes
- [
docs
] Update a removed article with a new source by @lakshminarasimmanv in #3309 - CachedGISTEmbedLoss Adding Margin by @daegonYu in #3299
- Support activation_function=None in Dense module by @OsamaS99 in #3316
- [
typing
] Fix typing for CrossEncoder.to by @tomaarsen in #3324 - Update (C)GIST losses to support "relative" margin instead of "percentage" by @tomaarsen in #3323
- [
feat
] hard neg mining: deprecate margin in favor of absolute_margin & relative margin by @tomaarsen in #3321 - [
fix
] Use return_dict=True in Transformer; improve how all_layer_embeddings are determined by @tomaarsen in #3320 - [
fix
] Avoid error if prompts & output_value=None by @tomaarsen in #3327 - [
backend
] Add ONNX & OpenVINO support for Cross Encoder (reranker) models by @tomaarsen in #3319
New Contributors
- @lakshminarasimmanv made their first contribution in #3309
Full Changelog: v4.0.2...v4.1.0