This release resolves a memory leak when deleting a model & trainer, adds compatibility between the Cached... losses and the Matryoshka loss modifier, resolves numerous bugs, and adds several small features.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==3.4.0
# Inference only, use one of:
pip install sentence-transformers==3.4.0
pip install sentence-transformers[onnx-gpu]==3.4.0
pip install sentence-transformers[onnx]==3.4.0
pip install sentence-transformers[openvino]==3.4.0Matryoshka & Cached loss compatibility (#3068, #3107)
It is now possible to combine the strong Cached losses (CachedMultipleNegativesRankingLoss, CachedGISTEmbedLoss, CachedMultipleNegativesSymmetricRankingLoss) with the Matryoshka loss modifier:
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from datasets import Dataset
model = SentenceTransformer("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
"anchor": ["It's nice weather outside today.", "He drove to work."],
"positive": ["It's so sunny.", "He took the car to the office."],
})
loss = losses.CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
loss = losses.MatryoshkaLoss(model, loss, [768, 512, 256, 128, 64])
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()See for example tomaarsen/mpnet-base-gooaq-cmnrl-mrl which was trained with CachedMultipleNegativesRankingLoss (CMNRL) with the Matryoshka loss modifier (MRL).
Resolve memory leak when Model and Trainer are reinitialized (#3144)
Due to a circular dependency in the SentenceTransformerTrainer -> SentenceTransformer -> SentenceTransformerModelCardData -> SentenceTransformerTrainer, deleting the trainer and model still doesn't clear them up via garbage disposal. I've moved a lot of components around, and now SentenceTransformerModelCardData does not need to store the SentenceTransformerTrainer, breaking the cycle.
We ran the seed optimization script (which frequently creates and deletes models and trainers):
- Before: Approximate highest recorded VRAM:
16332MiB / 24576MiB - After: Approximate highest recorded VRAM:
8222MiB / 24576MiB
Small Features
- Add Matthews Correlation Coefficient to the BinaryClassificationEvaluator in #3051.
- Add a triplet
marginparameter to the TripletEvaluator in #2862. - Put dataset information in the automatically generated model card in "expanding sections" blocks if there are many datasets in #3088.
- Add multi-GPU (and CPU multi-process) support for
mine_hard_negativesin #2967.
Notable Bug Fixes
- Subsequent batches were identical when using the
no_duplicatesBatch Sampler (#3069). This has been resolved in #3073 - The old-style
model.fit()training withwrite_csvon an evaluator would crash (#3062). This has been resolved in #3066. - The output types of some evaluators were
np.floatinstead offloat(#3075). This has been resolved in #3076 and #3096. - It was not possible to specify a
revisionorcache_dirwhen loading a PEFT Adapter model (#3061). This has been resolved in #3079 and #3174. - The CrossEncoder was lazily placed on the incorrect device, did not respond to
model.to(#3078). This has been resolved in #3104. - If a model used a custom module with custom kwargs, those
kwargskeys were not saved inmodules.jsoncorrectly, e.g. relevant for jina-embeddings-v3 (#3111). This has been resolved in #3112. HfArgumentParser(SentenceTransformerTrainingArguments)would crash due topromptstyping (#3090). This has been resolved in #3178.
Example Updates
- Update the quantization script in #3070.
- Update the seed optimization script in #3092.
- Update the TSDAE scripts in #3137.
- Add PEFT Adapter script in #3180.
Documentation Updates
- Add PEFT Adapter documentation in #3180.
- Add links to backend-export in Speeding up Inference.
All Changes
- [
training] Passsteps/epoch/output_pathto Evaluator during training by @tomaarsen in #3066 - [
examples] Update the quantization script by @tomaarsen in #3070 - [
fix] Fix different batches per epoch in NoDuplicatesBatchSampler by @tomaarsen in #3073 - [
docs] Add links to backend-export in Speeding up Inference by @tomaarsen in #3071 - add MCC to BinaryClassificationEvaluator by @JINO-ROHIT in #3051
- support cached losses in combination with matryoshka loss by @Marcel256 in #3068
- align model_card_templates.py with code by @amitport in #3081
- converting np float result to float in binary classification evaluator by @JINO-ROHIT in #3076
- Add triplet margin for distance functions in TripletEvaluator by @zivicmilos in #2862
- [
model_card] Keep the model card readable even with many datasets by @tomaarsen in #3088 - [
docs] Add NanoBEIR to the Training Overview evaluators by @tomaarsen in #3089 - [fix] revision of the adapter model can now be specified. by @pesuchin in #3079
- [
docs] Update from Sphinx==3.5.4 to 8.1.3, recommonmark -> myst-parser by @tomaarsen in #3099 - normalize to float in NanoBEIREvaluator, InformationRetrievalEvaluator, MSEEvaluator by @JINO-ROHIT in #3096
- [
docs] List 'prompts' as a key training argument by @tomaarsen in #3101 - revert float type cast manually in BinaryClassificationEvaluator by @JINO-ROHIT in #3102
- update train_sts_seed_optimization with SentenceTransformerTrainer by @JINO-ROHIT in #3092
- Fix cross encoder device issue by @susnato in #3104
- [
enhancement] Make MultipleNegativesRankingLoss easier to understand by @tomaarsen in #3100 - [
fix] Fix breaking change in PyLate when loading modules by @tomaarsen in #3110 - multi-GPU support for mine_hard_negatives by @alperctnkaya in #2967
- raises error when dataset is an empty list in NanoBEIREvaluator by @JINO-ROHIT in #3122
- Added a note to the documentation stating that the similarity method does not support embeddings other than non-quantized ones. by @pesuchin in #3131
- [
typo] Add missing space between sentences in error message by @tomaarsen in #3125 - raises ValueError when num_label !=1 when using Crossencoder.rank() by @JINO-ROHIT in #3126
- fix backward pass for cached losses by @Marcel256 in #3114
- Adding evaluation checks to prevent Transformer ValueError by @stsfaroz in #3105
- [typo] Fix incorrect spelling for "corpus" by @ignasgr in #3154
- [
fix] Save custom modulekwargsif specified by @tomaarsen in #3112 - [
memory] Avoid storing trainer in ModelCardCallback and SentenceTransformerModelCardData by @tomaarsen in #3144 - Suport for embedded representation by @Radu1999 in #3156
- [DRAFT] tests for nanobeir evaluator by @JINO-ROHIT in #3127
- Update TSDAE examples with SentenceTransformerTrainer by @JINO-ROHIT in #3137
- [
docs] Update the Static Embedding example snippet by @tomaarsen in #3177 - fix: propagate cache dir to find adapter by @lauralehoczki11 in #3174
- [
fix] Use HfArgumentParser-compatible typing for prompts by @tomaarsen in #3178 - testcases for community detection by @JINO-ROHIT in #3163
- [
docs] Add PEFT documentation + training example by @tomaarsen in #3180 - [
tests] Make TripletEvaluator test more consistent by @tomaarsen in #3183 - [
deprecation] Clarify that datasets and readers are deprecated since v3 by @tomaarsen in #3184 - [docs] Update the documentation surrounding Matryoshka + Cached losses by @tomaarsen in #3190
New Contributors
- @JINO-ROHIT made their first contribution in #3051
- @Marcel256 made their first contribution in #3068
- @amitport made their first contribution in #3081
- @zivicmilos made their first contribution in #2862
- @susnato made their first contribution in #3104
- @alperctnkaya made their first contribution in #2967
- @stsfaroz made their first contribution in #3105
- @ignasgr made their first contribution in #3154
- @Radu1999 made their first contribution in #3156
- @lauralehoczki11 made their first contribution in #3174
An explicit thanks to @JINO-ROHIT who has made a large amount of contributions in this release.
Full Changelog: v3.3.1...v3.4.0