UKPLab/sentence-transformers v3.1.0 on GitHub

This release introduces a hard negatives mining utility to get better models out of your data, a new strong loss function for symmetric tasks, training with streaming datasets to avoid having to store datasets fully on disk, custom modules to allow for more creativity from model authors, and many bug fixes, small additions and documentation improvements.

Install this version with

# Full installation:
pip install sentence-transformers[train]==3.1.0

# Inference only:
pip install sentence-transformers==3.1.0

Warning

Due to incompatibilities with Windows, we have set numpy<2 in the Sentence Transformers requirements. If you're not on Windows, you can still install numpy>=2 and everything should work as expected.

Hard Negatives Mining utility (#2768, #2848)

Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. For example:

Anchor: "are red pandas actually pandas?"
Positive: "Red pandas, like giant pandas, are bamboo eaters native to Asia's high forests. Despite these similarities and their shared name, the two species are not closely related. Red pandas are much smaller than giant pandas and are the only living member of their taxonomic family."
Hard negative: "The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo), also known as the panda bear or simply the panda, is a bear native to south central China."

These negatives are more difficult for a model to distinguish from the correct answer, leading to a stronger training signal and a stronger overall model when used with one of the Loss Functions that accepts (anchor, positive, negative) pairs such as the one above.

This release introduces a utility function called mine_hard_negatives that allows you to mine for these hard negatives given a (anchor, positive) dataset (and optionally a corpus of negative candidate texts).

It boasts the following features to give you fine-grained control over the similarity of the mined negatives relative to the anchor:

CrossEncoder rescoring for higher quality negative selection.
Skip the top $n$ negative candidates as these might be true positives.
Consider only the top $n$ negative candidates.
Skip negative candidates that are within some margin of the true similarity between anchor and positive.
Skip negative candidates whose similarity is larger than some max_score.
Two sampling strategies: pick the top negative candidates that satisfy the requirements, or pick them randomly.
FAISS index for searching for negative candidates.
Option to return data as triplets only, or as 2 + num_negatives-tuples.

from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

# Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
print(dataset)
"""
Dataset({
    features: ['query', 'answer'],
    num_rows: 100231
})
"""

# Mine hard negatives
dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    range_min=10,
    range_max=50,
    max_score=0.8,
    margin=0.1,
    num_negatives=5,
    sampling_strategy="random",
    batch_size=128,
    use_faiss=True,
)
'''
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 588/588 [00:33<00:00, 17.37it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 784/784 [00:07<00:00, 101.55it/s]
Querying FAISS index: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:07<00:00,  1.06s/it]
Metric       Positive       Negative     Difference
Count         100,231        460,725        460,725
Mean           0.6866         0.4133         0.2917
Median         0.7010         0.4059         0.2873
Std            0.1125         0.0673         0.1006
Min            0.0303         0.1638         0.1029
25%            0.6221         0.3649         0.2112
50%            0.7010         0.4059         0.2873
75%            0.7667         0.4561         0.3647
Max            0.9584         0.7362         0.7073
Skipped 882722 potential negatives (17.27%) due to the margin of 0.1.
Skipped 27 potential negatives (0.00%) due to the maximum score of 0.8.
Could not find enough negatives for 40430 samples (8.07%). Consider adjusting the range_max, range_min, margin and max_score parameters if you'd like to find more valid negatives.
'''
print(dataset)
'''
Dataset({
    features: ['query', 'answer', 'negative'],
    num_rows: 460725
})
'''
print(dataset[0])
'''
{
    'query': 'the first person to use the word geography was',
    'answer': 'History of geography The history of geography includes many histories of geography which have differed over time and between different cultural and political groups. In more recent developments, geography has become a distinct academic discipline. \'Geography\' derives from the Greek γεωγραφία – geographia,[1] a literal translation of which would be "to describe or write about the Earth". The first person to use the word "geography" was Eratosthenes (276–194 BC). However, there is evidence for recognizable practices of geography, such as cartography (or map-making) prior to the use of the term geography.',
    'negative': 'Terminology of the British Isles The word "Great" means "larger", in comparison with Brittany in modern-day France. One historical term for the peninsula in France that largely corresponds to the modern French province is Lesser or Little Britain. That region was settled by many British immigrants during the period of Anglo-Saxon migration into Britain, and named "Little Britain" by them. The French term "Bretagne" now refers to the French "Little Britain", not to the British "Great Britain", which in French is called Grande-Bretagne. In classical times, the Graeco-Roman geographer Ptolemy in his Almagest also called the larger island megale Brettania (great Britain). At that time, it was in contrast to the smaller island of Ireland, which he called mikra Brettania (little Britain).[62] In his later work Geography, Ptolemy refers to Great Britain as Albion and to Ireland as Iwernia. These "new" names were likely to have been the native names for the islands at the time. The earlier names, in contrast, were likely to have been coined before direct contact with local peoples was made.[63]'
}
'''
dataset.push_to_hub("natural-questions-hard-negatives", "triplet")

This dataset can immediately be used in conjunction with MultipleNegativesRankingLoss, likely resulting in a stronger model than if you had just used the natural-questions dataset outright.

Here are some example datasets that I created using this new function:

Big thanks to @ChrisGeishauser and @ArthurCamara for assisting with this feature.

Add CachedMultipleNegativesSymmetricRankingLoss loss function (#2879)

Let's break this down:

MultipleNegativesRankingLoss (MNRL): Given (anchor, positive) text pairs or (anchor, positive, negative) text triplets, this loss trains for "Given an anchor (e.g. a query), which text out of a big lineup (all positives and negatives in the batch) is the true positive (e.g. the answer)?".
MultipleNegativesSymmetricRankingLoss (MNSRL): Adaptation of MNRL that adds a second loss term which means: "Given an positive (e.g. an summary), which text out of a big lineup (all anchors) is the true anchor (e.g. the full article)?". This is useful for symmetric tasks, such as clustering, classification, finding similar texts, and a bit less useful for asymmetric tasks such as question-answer retrieval.
CachedMultipleNegativesRankingLoss (CMNRL): Adaptation of MNRL such that the batch size can be increased to an arbitrary size at a flat 10-20% training speed cost. A higher batch size means a larger lineup for the model to find the true positive in, often resulting in a better training signal and model.

The v3.1 Sentence Transformers release now introduces a new loss: CachedMultipleNegativesSymmetricRankingLoss (CMNSRL), which combines both of the previous adaptations. The result is a loss adept at symmetric training tasks for which you can pick an arbitrarily large batch size. It is likely the strongest loss for Semantic Textual Similarity (STS) tasks in Sentence Transformers now.
Big thanks to @madhavthaker1 for working to include it.

Streaming Dataset support (#2792)

The v3.1 release introduces support for training with datasets.IterableDataset (Differences between Dataset and IterableDataset docs). This means that you can train without first downloading the full dataset to disk. For example:

from datasets import load_dataset

# Load a streaming dataset to finetune on
train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=True)
# IterableDataset({
#     features: ['question', 'answer'],
#     n_shards: 2
# })

from datasets import IterableDataset, Value, Features

def dataset_generator_fn():
    # Gather, fetch, load, or generate data here
    for ... in ...:
        yield ...

train_dataset = IterableDataset.from_generator(dataset_generator_fn)
train_dataset = train_dataset.cast(Features({'question': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None)}))

(Read more about Dataset features here)

For a full example of training with a streaming dataset, consider this script:

import logging
from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
)

# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
    "microsoft/mpnet-base",
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="MPNet base trained on GooAQ pairs",
    ),
)

name = "mpnet-base-gooaq-streaming"

# 2. Load a streaming dataset to finetune on
train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=True)

# 3. Define a loss function
loss = MultipleNegativesRankingLoss(model)

# 4. (Optional) Specify training arguments
train_batch_size = 64
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir=f"models/{name}",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=train_batch_size,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=True,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    # Optional tracking/debugging parameters:
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=250,
    logging_first_step=True,
    run_name=name,  # Will be used in W&B if `wandb` is installed
)

# 5. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

# 6. Save the trained model
model.save_pretrained(f"models/{name}/final")

# 7. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(name)

Advanced: Allow for Custom Modules (#2773)

Sentence Transformer models consist of several modules that are executed sequentially. Most models consist of a Transformer module, a Pooling module, and perhaps a Dense and/or Normalize module. However, as of the v3.1 release, model authors can create their own modules by writing some custom modeling code. This code can be uploaded to the Hugging Face Hub alongside the model itself, after which users can load the model like normal.

This allows for authors to replace the Transformer module with one that includes model-specific quirks, or replace the Pooling module with an all-new pooling method. This even allows for multi-modal models as authors can customize the preprocessing of the first module.

jinaai/jina-clip-v1 is the first model to take advantage of this new feature, allowing you to encode both texts and images (via paths to local images or URLs) due to their custom preprocessing. Try it out yourself:

from sentence_transformers import SentenceTransformer

# Load the model; must use trust_remote_code=True to run the custom module
model = SentenceTransformer("jinaai/jina-clip-v1", trust_remote_code=True)

# Texts and images of blue and red cats to embed
sentences = ['A blue cat', 'A red cat']
image_urls = [
    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]

# Embed the texts and images like normal
text_embeddings = model.encode(sentences)
image_embeddings = model.encode(image_urls)

# Compute similarity between text embeddings:
print(model.similarity(text_embeddings[0], text_embeddings[1]))
# tensor([[✅0.5636]])

# or cross-modal text and image embeddings:
print(model.similarity(text_embeddings, image_embeddings))
# tensor([[✅0.2906, ❌0.0569],
#         [❌0.1277, ✅0.2916]]

Additionally, model authors can take advantage of keyword argument passthrough. By updating the modules.json file to include a list of kwargs, e.g.:

[
  {
    "idx": 0,
    "name": "0",
    "path": "",
    "type": "custom_transformer.CustomTransformer",
    "kwargs": ["task_type"]
  },
  ...
]

then if a user provides the task_type keyword argument in model.encode, this value will be propagated to the forward of the custom module(s). This way, users can specify some custom functionality on the fly during inference time (as well as during load time via the model_kwargs option when initializing a SentenceTransformer model).

Update dependency versions (#2757)

Restrict numpy<2.0.0 due to issues with torch and numpy interoperability on Windows.
Increment minimum transformers version to 4.38.0 & huggingface-hub to 0.19.3 to prevent a training crash related to the prefetch_factor option

Smaller Highlights

Features

Add show_progress_bar to encode_multi_process (#2762)
Add revision to push_to_hub (#2902)
Add cache_dir and config_args to CrossEncoder (#2784)
Warn users if they might be passing training/evaluation columns in the wrong order, leading to worse training performance (#2928)

Bug fixes

Prevent crash when encoding an empty list (#2759)
Support training with GISTEmbedLoss with DataParallel (DP) and DataDistributedParallel (DDP) (#2772)
Fix a bug in GroupByLabelBatchSampler resulting in some data not being used in training (#2788)
Prevent crash if a datasets directory exists locally (#2859)
Fix Matryoshka2dLoss not importing correctly (#2907)
Resolve niche training bug with training if using multi-dataset, no-duplicates, and dataloader_drop_last=True (#2877)
Fix torch_compile=True not working in the SentenceTransformersTrainingArguments: should now work for faster training (#2884)
Fix SoftmaxLoss performing worse since v3.0 as a Linear layer was ignored by the optimizer (#2881)
Fix trainer.train(resume_from_checkpoint="...") with custom models (i.e. trust_remote_code) (#2918)
Fix the evaluation using the training batch size (#2847)
Fix encoding when passing model_kwargs={"torch_dtype": torch.float16} with models that use Dense layers (#2889)

Documentation

New documentation for batch samplers (#2921, various PRs by @fpgmaas)
New documentation for custom modules and model structure (#2773)

All changes

[Typing] make device optional by @michaelfeil in #2731
[Spelling] Docs by @michaelfeil in #2733
[Spelling] Codespell readme by @michaelfeil in #2736
[Spelling] update examples by @michaelfeil in #2734
[versions] Increment transformers/hf-hub versions to prevent training crash by @tomaarsen in #2757
Typo fixed in examples/training/sts/training_stsbenchmark.py by @akkefa in #2743
spelling: code comment updates by @michaelfeil in #2735
Update DenoisingAutoEncoderDataset.py by @sophia8844 in #2747
[fix] Prevent crash when encoding empty list by @tomaarsen in #2759
Fix syntax warning (issue #2687) by @wyattscarpenter in #2765
[feat] Add show_progress_bar to encode_multi_process by @tomaarsen in #2762
Typing overload by @janrito in #2763
[fix] Fix retokenization on DDP/DP with GIST losses by @tomaarsen in #2775
Cast predict scores to float before converting to numpy by @malteos in #2783
Elasticsearch example: simplify setup by @maxjakob in #2778
[chore] Enable ruff rules Warning (W) by @fpgmaas in #2789
[fix] Add tests for 3.12 in cicd by @fpgmaas in #2785
Allow inheriting the Transformer class by @mokha in #2810
[feat] Add hard negatives mining utility by @tomaarsen in #2768
[chore] add test for NoDuplicatesBatchSampler by @fpgmaas in #2795
[chore] Add test for RoundrobinBatchSampler by @fpgmaas in #2798
[feat] Improve GroupByLabelBatchSampler by @fpgmaas in #2788
[chore] Clean-up .gitignore by @fpgmaas in #2799
[chore] improve the use of ruff and pre-commit hooks by @fpgmaas in #2793
[feat] Move from setup.py and setup.cfg to pyproject.toml by @fpgmaas in #2786
[chore] Add pytest-cov and add test coverage command to the Makefile by @fpgmaas in #2794
Move pytest config to pyproject.toml and remove pytest.ini by @fpgmaas in #2819
[fix] Fix packages discovery in pyproject.toml by @fpgmaas in #2825
Fix ruff pre-commit hook. by @fpgmaas in #2826
[chore] Enable isort with ruff by @fpgmaas in #2828
[chore] Enable ruff rules UP006 and UP007 to improve type hints. by @fpgmaas in #2830
[chore] Enable ruff's pypgrade (UP) ruleset by @fpgmaas in #2834
update SoftmaxLoss arguments by @KiLJ4EdeN in #2894
[feat] Added revision to push_to_hub argument. by @pesuchin in #2902
Perform additional check for owner string in is_<library>_available functions by @leblancfg in #2859
[style] Replace Huggingface with Hugging Face by @tomaarsen in #2905
Fix typo: "comuptation" -> "computation" by @jeffwidman in #2909
[ci] Attempt to fix CI disk space issues by @tomaarsen in #2906
[docs] Fix typo and broken links in documentation by @ZiyiXia in #2861
Add MNSRL with GradCache by @madhavthaker1 in #2879
Fix 'module object is not callable' error in Matryoshka2dLoss by @pesuchin in #2907
[chore] Add unittests for InformationRetrievalEvaluator by @fpgmaas in #2838
[fix] Safely continue if ProportionalBatchSampler sub-batch sampler throws StopIteration by @tomaarsen in #2877
[fix] Fix torch_compile=True by always inserting a wrapped model into the loss by @tomaarsen in #2884
[fix] Fix SoftmaxLoss by initializing the optimizer over the loss(es) rather than the model by @tomaarsen in #2881
[fix] Fix trainer.train(resume_from_checkpoint="...") with custom models (i.e. trust_remote_code) by @tomaarsen in #2918
[docs] Heavily extend sampler documentation by @tomaarsen in #2921
[feat] Add support for streaming datasets by @tomaarsen in #2792
[fix] Change eval dataloader to use eval_batch_size by @akashd-2 in #2847
[feat] Add cache_dir support to CrossEncoder by @RoyBA in #2784
[deprecation] Push deprecation cycle for use_auth_token to v4 by @tomaarsen in #2926
[security] Load weights only with torch.load & pytorch_model.bin by @tomaarsen in #2927
[feat] Allow loading custom modules; encode kwargs passthrough to modules by @tomaarsen in #2773
[fix] Add dtype cast for modules other than Transformer by @ir2718 in #2889
[docs] Move losses up in the package reference; they're more important by @tomaarsen in #2929
[feat] Add column order warnings to the data collator by @tomaarsen in #2928

New Contributors

@akkefa made their first contribution in #2743
@sophia8844 made their first contribution in #2747
@wyattscarpenter made their first contribution in #2765
@janrito made their first contribution in #2763
@malteos made their first contribution in #2783
@fpgmaas made their first contribution in #2789
@KiLJ4EdeN made their first contribution in #2894
@pesuchin made their first contribution in #2902
@leblancfg made their first contribution in #2859
@jeffwidman made their first contribution in #2909
@ZiyiXia made their first contribution in #2861
@madhavthaker1 made their first contribution in #2879
@akashd-2 made their first contribution in #2847
@RoyBA made their first contribution in #2784

Big thanks to @fpgmaas for the large number of valuable contributions surrounding tests, CI, config files, and overall project health.

Full Changelog: v3.0.1...v3.1.0

UKPLab/sentence-transformers v3.1.0 v3.1.0 - Hard Negatives Mining utility; new loss function for symmetric tasks; streaming datasets; custom modules on GitHub