pypi sentence-transformers 4.1.0
v4.1.0 - ONNX and OpenVINO backends offering 2-3x speedups; improved hard negatives mining

10 days ago

This release introduces 2 new efficient computing backends for CrossEncoder (reranker) models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; improved hard negatives mining strategies, and minor improvements.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==4.1.0

# Inference only, use one of:
pip install sentence-transformers==4.1.0
pip install sentence-transformers[onnx-gpu]==4.1.0
pip install sentence-transformers[onnx]==4.1.0
pip install sentence-transformers[openvino]==4.1.0

Faster ONNX and OpenVINO Backends for CrossEncoder (#3319)

Introducing a new backend keyword argument to the CrossEncoder initialization, allowing values of "torch" (default), "onnx", and "openvino".
These require installing sentence-transformers with specific extras:

pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]

It's as simple as:

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")

query = "Which planet is known as the Red Planet?"
passages = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)

If you specify a backend and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub or model.save_pretrained into the same model repository or directory to avoid having to re-export the model every time.

All keyword arguments passed via model_kwargs will be passed on to ORTModelForSequenceClassification.from_pretrained or OVModelForSequenceClassification.from_pretrained. The most useful arguments are:

  • provider: (Only if backend="onnx") ONNX Runtime provider to use for loading the model, e.g. "CPUExecutionProvider" . See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g. "CUDAExecutionProvider") will be used.
  • file_name: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.
  • export: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.

For example:

from sentence_transformers import SentenceTransformer

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
	backend="onnx",
	model_kwargs={
		"file_name": "model_O3.onnx",
		"provider": "CPUExecutionProvider",
	}
)

query = "Which planet is known as the Red Planet?"
passages = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)

Benchmarks

We ran benchmarks for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:

These findings resulted in these recommendations:
image

For GPU, you can expect 1.88x speedup with fp16 at no cost, and for CPU you can expect ~3x speedup at no cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.

Read the Speeding up Inference documentation for more details.

ONNX & OpenVINO Optimization and Quantization

In addition to exporting default ONNX and OpenVINO models, you can also use one of the helper methods for optimizing and quantizing ONNX models:

ONNX Optimization

export_optimized_onnx_model: This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options here. This function accepts:

  • model A SentenceTransformer or CrossEncoder model loaded with backend="onnx".
  • optimization_config: "O1", "O2", "O3", or "O4" from 🤗 Optimum or a custom OptimizationConfig instance.
  • model_name_or_path: The directory or model repository where the optimized model will be saved.
  • push_to_hub: Whether the push the exported model to the hub with model_name_or_path as the repository name. If False, the model will be saved in the directory specified with model_name_or_path.
  • create_pr: If push_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to.
  • file_suffix: The suffix to add to the optimized model file name. Will use the optimization_config string or "optimized" if not set.

The usage is like this:

from sentence_transformers import SentenceTransformer, export_optimized_onnx_model

onnx_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_optimized_onnx_model(
	model=onnx_model,
	optimization_config="O4",
	model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
	push_to_hub=True,
	create_pr=True,
)

After which you can load the model with:

from sentence_transformers import CrossEncoder

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_O4.onnx"},
    revision=f"refs/pr/{pull_request_nr}"
)

or when it gets merged:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_O4.onnx"},
)

ONNX Quantization

export_dynamic_quantized_onnx_model: This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function accepts

  • model A SentenceTransformer or CrossEncoder model loaded with backend="onnx".
  • quantization_config: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from AutoQuantizationConfig, or an QuantizationConfig instance.
  • model_name_or_path: The directory or model repository where the optimized model will be saved.
  • push_to_hub: Whether the push the exported model to the hub with model_name_or_path as the repository name. If False, the model will be saved in the directory specified with model_name_or_path.
  • create_pr: If push_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to.
  • file_suffix: The suffix to add to the optimized model file name. Will use the quantization_config string or e.g. "int8_quantized" if not set.

The usage is like this:

from sentence_transformers import CrossEncoder, export_dynamic_quantized_onnx_model

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_dynamic_quantized_onnx_model(
    model,
    "avx512_vnni",
    "sentence-transformers/cross-encoder/ms-marco-MiniLM-L6-v2",
    push_to_hub=True,
    create_pr=True,
)

After which you can load the model with:

from sentence_transformers import CrossEncoder

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"},
    revision=f"refs/pr/{pull_request_nr}",
)

or when it gets merged:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"},
)

OpenVINO Quantization

OpenVINO models can be quantized to int8 precision using Optimum Intel to speed up inference. To do this, you can use the export_static_quantized_openvino_model() function, which saves the quantized model in a directory or model repository that you specify. Post-Training Static Quantization expects:

  • model: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend.
  • quantization_config: (Optional) The quantization configuration. This parameter accepts either: None for the default 8-bit quantization, a dictionary representing quantization configurations, or an OVQuantizationConfig instance.
  • model_name_or_path: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
  • dataset_name: (Optional) The name of the dataset to load for calibration. If not specified, defaults to sst2 subset from the glue dataset.
  • dataset_config_name: (Optional) The specific configuration of the dataset to load.
  • dataset_split: (Optional) The split of the dataset to load (e.g., ‘train’, ‘test’).
  • column_name: (Optional) The column name in the dataset to use for calibration.
  • push_to_hub: (Optional) a boolean to push the quantized model to the Hugging Face Hub.
  • create_pr: (Optional) a boolean to create a pull request when pushing to the Hugging Face Hub. Useful when you don’t have write access to the repository.
  • file_suffix: (Optional) a string to append to the model name when saving it. If not specified, "qint8_quantized" will be used.

The usage is like this:

from sentence_transformers import CrossEncoder, export_static_quantized_openvino_model

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino")
export_static_quantized_openvino_model(
    model,
    quantization_config=None,
    model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
    push_to_hub=True,
    create_pr=True,
)

After which you can load the model with:

from sentence_transformers import CrossEncoder

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="openvino",
    model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"},
    revision=f"refs/pr/{pull_request_nr}"
)

or when it gets merged:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="openvino",
    model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"},
)

Read the Speeding up Inference documentation for more details.

Relative Margin in Hard Negatives Mining (#3321)

This PR softly deprecates the margin option in mine_hard_negatives in favor of absolute_margin and relative_margin. In short:

  • absolute_margin: Discards negative candidates whose anchor_negative_similarity score is greater than or equal to anchor_positive_similarity - absolute_margin. With an absolute_margin of 0.1 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.76.
  • relative_margin: Discards negative candidates whose anchor_negative_similarity score is greater than or equal to anchor_positive_similarity * (1 - relative_margin). With a relative_margin of 0.05 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.817 (i.e. 95% of the anchor-positive similarity).

This means that we now support the recommended hard negatives mining strategy from the excellent NV-Retriever paper, a.k.a. the TopK-PercPos (95%) strategy:

from sentence_transformers.util import mine_hard_negatives

...

dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    relative_margin=0.05,         # 0.05 means that the negative is at most 95% as similar to the anchor as the positive
    num_negatives=num_negatives,  # 10 or less is recommended
    sampling_strategy="top",      # "top" means that we sample the top candidates as negatives
    batch_size=batch_size,        # Adjust as needed
    use_faiss=True,               # Optional: Use faiss/faiss-gpu for faster similarity search
)

Minor Changes

  • Add margin and margin_strategy to GISTEmbedLoss and CachedGISTEmbedLoss (#3299, #3323)
  • Support activation_function=None in Dense module (#3316)
  • Update how all_layer_embeddings outputs are determined (#3320)
  • Avoid error with SentenceTransformer.encode if prompts are provided and output_value=None (#3327)

All Changes

  • [docs] Update a removed article with a new source by @lakshminarasimmanv in #3309
  • CachedGISTEmbedLoss Adding Margin by @daegonYu in #3299
  • Support activation_function=None in Dense module by @OsamaS99 in #3316
  • [typing] Fix typing for CrossEncoder.to by @tomaarsen in #3324
  • Update (C)GIST losses to support "relative" margin instead of "percentage" by @tomaarsen in #3323
  • [feat] hard neg mining: deprecate margin in favor of absolute_margin & relative margin by @tomaarsen in #3321
  • [fix] Use return_dict=True in Transformer; improve how all_layer_embeddings are determined by @tomaarsen in #3320
  • [fix] Avoid error if prompts & output_value=None by @tomaarsen in #3327
  • [backend] Add ONNX & OpenVINO support for Cross Encoder (reranker) models by @tomaarsen in #3319

New Contributors

Full Changelog: v4.0.2...v4.1.0

Don't miss a new sentence-transformers release

NewReleases is sending notifications on new releases.