This release introduces a hard negatives mining utility to get better models out of your data, a new strong loss function for symmetric tasks, training with streaming datasets to avoid having to store datasets fully on disk, custom modules to allow for more creativity from model authors, and many bug fixes, small additions and documentation improvements.
Install this version with
# Full installation:
pip install sentence-transformers[train]==3.1.0
# Inference only:
pip install sentence-transformers==3.1.0
Warning
Due to incompatibilities with Windows, we have set numpy<2
in the Sentence Transformers requirements. If you're not on Windows, you can still install numpy>=2
and everything should work as expected.
Hard Negatives Mining utility (#2768, #2848)
Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. For example:
- Anchor: "are red pandas actually pandas?"
- Positive: "Red pandas, like giant pandas, are bamboo eaters native to Asia's high forests. Despite these similarities and their shared name, the two species are not closely related. Red pandas are much smaller than giant pandas and are the only living member of their taxonomic family."
- Hard negative: "The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo), also known as the panda bear or simply the panda, is a bear native to south central China."
These negatives are more difficult for a model to distinguish from the correct answer, leading to a stronger training signal and a stronger overall model when used with one of the Loss Functions that accepts (anchor, positive, negative) pairs such as the one above.
This release introduces a utility function called mine_hard_negatives
that allows you to mine for these hard negatives given a (anchor, positive) dataset (and optionally a corpus of negative candidate texts).
It boasts the following features to give you fine-grained control over the similarity of the mined negatives relative to the anchor:
- CrossEncoder rescoring for higher quality negative selection.
- Skip the top $n$ negative candidates as these might be true positives.
- Consider only the top $n$ negative candidates.
- Skip negative candidates that are within some
margin
of the true similarity between anchor and positive. - Skip negative candidates whose similarity is larger than some
max_score
. - Two sampling strategies: pick the top negative candidates that satisfy the requirements, or pick them randomly.
- FAISS index for searching for negative candidates.
- Option to return data as triplets only, or as
2 + num_negatives
-tuples.
from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
# Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
print(dataset)
"""
Dataset({
features: ['query', 'answer'],
num_rows: 100231
})
"""
# Mine hard negatives
dataset = mine_hard_negatives(
dataset=dataset,
model=model,
range_min=10,
range_max=50,
max_score=0.8,
margin=0.1,
num_negatives=5,
sampling_strategy="random",
batch_size=128,
use_faiss=True,
)
'''
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 588/588 [00:33<00:00, 17.37it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 784/784 [00:07<00:00, 101.55it/s]
Querying FAISS index: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:07<00:00, 1.06s/it]
Metric Positive Negative Difference
Count 100,231 460,725 460,725
Mean 0.6866 0.4133 0.2917
Median 0.7010 0.4059 0.2873
Std 0.1125 0.0673 0.1006
Min 0.0303 0.1638 0.1029
25% 0.6221 0.3649 0.2112
50% 0.7010 0.4059 0.2873
75% 0.7667 0.4561 0.3647
Max 0.9584 0.7362 0.7073
Skipped 882722 potential negatives (17.27%) due to the margin of 0.1.
Skipped 27 potential negatives (0.00%) due to the maximum score of 0.8.
Could not find enough negatives for 40430 samples (8.07%). Consider adjusting the range_max, range_min, margin and max_score parameters if you'd like to find more valid negatives.
'''
print(dataset)
'''
Dataset({
features: ['query', 'answer', 'negative'],
num_rows: 460725
})
'''
print(dataset[0])
'''
{
'query': 'the first person to use the word geography was',
'answer': 'History of geography The history of geography includes many histories of geography which have differed over time and between different cultural and political groups. In more recent developments, geography has become a distinct academic discipline. \'Geography\' derives from the Greek γεωγραφία – geographia,[1] a literal translation of which would be "to describe or write about the Earth". The first person to use the word "geography" was Eratosthenes (276–194 BC). However, there is evidence for recognizable practices of geography, such as cartography (or map-making) prior to the use of the term geography.',
'negative': 'Terminology of the British Isles The word "Great" means "larger", in comparison with Brittany in modern-day France. One historical term for the peninsula in France that largely corresponds to the modern French province is Lesser or Little Britain. That region was settled by many British immigrants during the period of Anglo-Saxon migration into Britain, and named "Little Britain" by them. The French term "Bretagne" now refers to the French "Little Britain", not to the British "Great Britain", which in French is called Grande-Bretagne. In classical times, the Graeco-Roman geographer Ptolemy in his Almagest also called the larger island megale Brettania (great Britain). At that time, it was in contrast to the smaller island of Ireland, which he called mikra Brettania (little Britain).[62] In his later work Geography, Ptolemy refers to Great Britain as Albion and to Ireland as Iwernia. These "new" names were likely to have been the native names for the islands at the time. The earlier names, in contrast, were likely to have been coined before direct contact with local peoples was made.[63]'
}
'''
dataset.push_to_hub("natural-questions-hard-negatives", "triplet")
This dataset can immediately be used in conjunction with MultipleNegativesRankingLoss, likely resulting in a stronger model than if you had just used the natural-questions dataset outright.
Here are some example datasets that I created using this new function:
- https://huggingface.co/datasets/tomaarsen/gooaq-hard-negatives
- https://huggingface.co/datasets/tomaarsen/natural-questions-hard-negatives
Big thanks to @ChrisGeishauser and @ArthurCamara for assisting with this feature.
Add CachedMultipleNegativesSymmetricRankingLoss loss function (#2879)
Let's break this down:
- MultipleNegativesRankingLoss (MNRL): Given (anchor, positive) text pairs or (anchor, positive, negative) text triplets, this loss trains for "Given an anchor (e.g. a query), which text out of a big lineup (all positives and negatives in the batch) is the true positive (e.g. the answer)?".
- MultipleNegativesSymmetricRankingLoss (MNSRL): Adaptation of MNRL that adds a second loss term which means: "Given an positive (e.g. an summary), which text out of a big lineup (all anchors) is the true anchor (e.g. the full article)?". This is useful for symmetric tasks, such as clustering, classification, finding similar texts, and a bit less useful for asymmetric tasks such as question-answer retrieval.
- CachedMultipleNegativesRankingLoss (CMNRL): Adaptation of MNRL such that the batch size can be increased to an arbitrary size at a flat 10-20% training speed cost. A higher batch size means a larger lineup for the model to find the true positive in, often resulting in a better training signal and model.
The v3.1 Sentence Transformers release now introduces a new loss: CachedMultipleNegativesSymmetricRankingLoss (CMNSRL), which combines both of the previous adaptations. The result is a loss adept at symmetric training tasks for which you can pick an arbitrarily large batch size. It is likely the strongest loss for Semantic Textual Similarity (STS) tasks in Sentence Transformers now.
Big thanks to @madhavthaker1 for working to include it.
Streaming Dataset support (#2792)
The v3.1 release introduces support for training with datasets.IterableDataset
(Differences between Dataset and IterableDataset docs). This means that you can train without first downloading the full dataset to disk. For example:
from datasets import load_dataset
# Load a streaming dataset to finetune on
train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=True)
# IterableDataset({
# features: ['question', 'answer'],
# n_shards: 2
# })
or
from datasets import IterableDataset, Value, Features
def dataset_generator_fn():
# Gather, fetch, load, or generate data here
for ... in ...:
yield ...
train_dataset = IterableDataset.from_generator(dataset_generator_fn)
train_dataset = train_dataset.cast(Features({'question': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None)}))
(Read more about Dataset features here)
For a full example of training with a streaming dataset, consider this script:
import logging
from datasets import load_dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
logging.basicConfig(
format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
)
# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
"microsoft/mpnet-base",
model_card_data=SentenceTransformerModelCardData(
language="en",
license="apache-2.0",
model_name="MPNet base trained on GooAQ pairs",
),
)
name = "mpnet-base-gooaq-streaming"
# 2. Load a streaming dataset to finetune on
train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=True)
# 3. Define a loss function
loss = MultipleNegativesRankingLoss(model)
# 4. (Optional) Specify training arguments
train_batch_size = 64
args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir=f"models/{name}",
# Optional training parameters:
num_train_epochs=1,
per_device_train_batch_size=train_batch_size,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=False, # Set to False if you get an error that your GPU can't run on FP16
bf16=True, # Set to True if you have a GPU that supports BF16
batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
# Optional tracking/debugging parameters:
save_strategy="steps",
save_steps=100,
save_total_limit=2,
logging_steps=250,
logging_first_step=True,
run_name=name, # Will be used in W&B if `wandb` is installed
)
# 5. Create a trainer & train
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
# 6. Save the trained model
model.save_pretrained(f"models/{name}/final")
# 7. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(name)
Advanced: Allow for Custom Modules (#2773)
Sentence Transformer models consist of several modules that are executed sequentially. Most models consist of a Transformer module, a Pooling module, and perhaps a Dense and/or Normalize module. However, as of the v3.1 release, model authors can create their own modules by writing some custom modeling code. This code can be uploaded to the Hugging Face Hub alongside the model itself, after which users can load the model like normal.
This allows for authors to replace the Transformer
module with one that includes model-specific quirks, or replace the Pooling
module with an all-new pooling method. This even allows for multi-modal models as authors can customize the preprocessing of the first module.
jinaai/jina-clip-v1 is the first model to take advantage of this new feature, allowing you to encode both texts and images (via paths to local images or URLs) due to their custom preprocessing. Try it out yourself:
from sentence_transformers import SentenceTransformer
# Load the model; must use trust_remote_code=True to run the custom module
model = SentenceTransformer("jinaai/jina-clip-v1", trust_remote_code=True)
# Texts and images of blue and red cats to embed
sentences = ['A blue cat', 'A red cat']
image_urls = [
'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]
# Embed the texts and images like normal
text_embeddings = model.encode(sentences)
image_embeddings = model.encode(image_urls)
# Compute similarity between text embeddings:
print(model.similarity(text_embeddings[0], text_embeddings[1]))
# tensor([[✅0.5636]])
# or cross-modal text and image embeddings:
print(model.similarity(text_embeddings, image_embeddings))
# tensor([[✅0.2906, ❌0.0569],
# [❌0.1277, ✅0.2916]]
Additionally, model authors can take advantage of keyword argument passthrough. By updating the modules.json
file to include a list of kwargs
, e.g.:
[
{
"idx": 0,
"name": "0",
"path": "",
"type": "custom_transformer.CustomTransformer",
"kwargs": ["task_type"]
},
...
]
then if a user provides the task_type
keyword argument in model.encode
, this value will be propagated to the forward
of the custom module(s). This way, users can specify some custom functionality on the fly during inference time (as well as during load time via the model_kwargs
option when initializing a SentenceTransformer
model).
Update dependency versions (#2757)
- Restrict
numpy<2.0.0
due to issues withtorch
andnumpy
interoperability on Windows. - Increment minimum
transformers
version to 4.38.0 &huggingface-hub
to 0.19.3 to prevent a training crash related to theprefetch_factor
option
Smaller Highlights
Features
- Add
show_progress_bar
toencode_multi_process
(#2762) - Add
revision
topush_to_hub
(#2902) - Add
cache_dir
andconfig_args
to CrossEncoder (#2784) - Warn users if they might be passing training/evaluation columns in the wrong order, leading to worse training performance (#2928)
Bug fixes
- Prevent crash when encoding an empty list (#2759)
- Support training with
GISTEmbedLoss
with DataParallel (DP) and DataDistributedParallel (DDP) (#2772) - Fix a bug in
GroupByLabelBatchSampler
resulting in some data not being used in training (#2788) - Prevent crash if a
datasets
directory exists locally (#2859) - Fix
Matryoshka2dLoss
not importing correctly (#2907) - Resolve niche training bug with training if using multi-dataset, no-duplicates, and
dataloader_drop_last=True
(#2877) - Fix
torch_compile=True
not working in theSentenceTransformersTrainingArguments
: should now work for faster training (#2884) - Fix
SoftmaxLoss
performing worse since v3.0 as a Linear layer was ignored by the optimizer (#2881) - Fix
trainer.train(resume_from_checkpoint="...")
with custom models (i.e.trust_remote_code
) (#2918) - Fix the evaluation using the training batch size (#2847)
- Fix encoding when passing
model_kwargs={"torch_dtype": torch.float16}
with models that use Dense layers (#2889)
Documentation
- New documentation for batch samplers (#2921, various PRs by @fpgmaas)
- New documentation for custom modules and model structure (#2773)
All changes
- [Typing] make device optional by @michaelfeil in #2731
- [Spelling] Docs by @michaelfeil in #2733
- [Spelling] Codespell readme by @michaelfeil in #2736
- [Spelling] update examples by @michaelfeil in #2734
- [
versions
] Increment transformers/hf-hub versions to prevent training crash by @tomaarsen in #2757 - Typo fixed in examples/training/sts/training_stsbenchmark.py by @akkefa in #2743
- spelling: code comment updates by @michaelfeil in #2735
- Update DenoisingAutoEncoderDataset.py by @sophia8844 in #2747
- [
fix
] Prevent crash when encoding empty list by @tomaarsen in #2759 - Fix syntax warning (issue #2687) by @wyattscarpenter in #2765
- [
feat
] Add show_progress_bar to encode_multi_process by @tomaarsen in #2762 - Typing overload by @janrito in #2763
- [
fix
] Fix retokenization on DDP/DP with GIST losses by @tomaarsen in #2775 - Cast predict scores to float before converting to numpy by @malteos in #2783
- Elasticsearch example: simplify setup by @maxjakob in #2778
- [chore] Enable ruff rules
Warning (W)
by @fpgmaas in #2789 - [fix] Add tests for 3.12 in cicd by @fpgmaas in #2785
- Allow inheriting the Transformer class by @mokha in #2810
- [
feat
] Add hard negatives mining utility by @tomaarsen in #2768 - [chore] add test for NoDuplicatesBatchSampler by @fpgmaas in #2795
- [chore] Add test for RoundrobinBatchSampler by @fpgmaas in #2798
- [feat] Improve GroupByLabelBatchSampler by @fpgmaas in #2788
- [
chore
] Clean-up.gitignore
by @fpgmaas in #2799 - [chore] improve the use of ruff and pre-commit hooks by @fpgmaas in #2793
- [feat] Move from
setup.py
andsetup.cfg
topyproject.toml
by @fpgmaas in #2786 - [chore] Add
pytest-cov
and add test coverage command to the Makefile by @fpgmaas in #2794 - Move
pytest
config topyproject.toml
and removepytest.ini
by @fpgmaas in #2819 - [
fix
] Fix packages discovery inpyproject.toml
by @fpgmaas in #2825 - Fix
ruff
pre-commit hook. by @fpgmaas in #2826 - [
chore
] Enableisort
withruff
by @fpgmaas in #2828 - [
chore
] Enable ruff rulesUP006
andUP007
to improve type hints. by @fpgmaas in #2830 - [
chore
] Enable ruff's pypgrade (UP
) ruleset by @fpgmaas in #2834 - update SoftmaxLoss arguments by @KiLJ4EdeN in #2894
- [feat] Added revision to push_to_hub argument. by @pesuchin in #2902
- Perform additional check for owner string in
is_<library>_available
functions by @leblancfg in #2859 - [
style
] Replace Huggingface with Hugging Face by @tomaarsen in #2905 - Fix typo: "comuptation" -> "computation" by @jeffwidman in #2909
- [
ci
] Attempt to fix CI disk space issues by @tomaarsen in #2906 - [
docs
] Fix typo and broken links in documentation by @ZiyiXia in #2861 - Add MNSRL with GradCache by @madhavthaker1 in #2879
- Fix 'module object is not callable' error in Matryoshka2dLoss by @pesuchin in #2907
- [
chore
] Add unittests forInformationRetrievalEvaluator
by @fpgmaas in #2838 - [
fix
] Safely continue if ProportionalBatchSampler sub-batch sampler throws StopIteration by @tomaarsen in #2877 - [
fix
] Fixtorch_compile=True
by always inserting a wrapped model into the loss by @tomaarsen in #2884 - [
fix
] Fix SoftmaxLoss by initializing the optimizer over the loss(es) rather than the model by @tomaarsen in #2881 - [
fix
] Fix trainer.train(resume_from_checkpoint="...") with custom models (i.e.trust_remote_code
) by @tomaarsen in #2918 - [
docs
] Heavily extend sampler documentation by @tomaarsen in #2921 - [
feat
] Add support for streaming datasets by @tomaarsen in #2792 - [
fix
] Change eval dataloader to use eval_batch_size by @akashd-2 in #2847 - [
feat
] Add cache_dir support to CrossEncoder by @RoyBA in #2784 - [
deprecation
] Push deprecation cycle foruse_auth_token
to v4 by @tomaarsen in #2926 - [
security
] Load weights only with torch.load & pytorch_model.bin by @tomaarsen in #2927 - [
feat
] Allow loading custom modules; encode kwargs passthrough to modules by @tomaarsen in #2773 - [
fix
] Add dtype cast for modules other than Transformer by @ir2718 in #2889 - [
docs
] Move losses up in the package reference; they're more important by @tomaarsen in #2929 - [
feat
] Add column order warnings to the data collator by @tomaarsen in #2928
New Contributors
- @akkefa made their first contribution in #2743
- @sophia8844 made their first contribution in #2747
- @wyattscarpenter made their first contribution in #2765
- @janrito made their first contribution in #2763
- @malteos made their first contribution in #2783
- @fpgmaas made their first contribution in #2789
- @KiLJ4EdeN made their first contribution in #2894
- @pesuchin made their first contribution in #2902
- @leblancfg made their first contribution in #2859
- @jeffwidman made their first contribution in #2909
- @ZiyiXia made their first contribution in #2861
- @madhavthaker1 made their first contribution in #2879
- @akashd-2 made their first contribution in #2847
- @RoyBA made their first contribution in #2784
Big thanks to @fpgmaas for the large number of valuable contributions surrounding tests, CI, config files, and overall project health.
Full Changelog: v3.0.1...v3.1.0