github UKPLab/sentence-transformers v3.1.0
v3.1.0 - Hard Negatives Mining utility; new loss function for symmetric tasks; streaming datasets; custom modules

latest release: v3.1.1
8 days ago

This release introduces a hard negatives mining utility to get better models out of your data, a new strong loss function for symmetric tasks, training with streaming datasets to avoid having to store datasets fully on disk, custom modules to allow for more creativity from model authors, and many bug fixes, small additions and documentation improvements.

Install this version with

# Full installation:
pip install sentence-transformers[train]==3.1.0

# Inference only:
pip install sentence-transformers==3.1.0

Warning

Due to incompatibilities with Windows, we have set numpy<2 in the Sentence Transformers requirements. If you're not on Windows, you can still install numpy>=2 and everything should work as expected.

Hard Negatives Mining utility (#2768, #2848)

Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. For example:

  • Anchor: "are red pandas actually pandas?"
  • Positive: "Red pandas, like giant pandas, are bamboo eaters native to Asia's high forests. Despite these similarities and their shared name, the two species are not closely related. Red pandas are much smaller than giant pandas and are the only living member of their taxonomic family."
  • Hard negative: "The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo), also known as the panda bear or simply the panda, is a bear native to south central China."

These negatives are more difficult for a model to distinguish from the correct answer, leading to a stronger training signal and a stronger overall model when used with one of the Loss Functions that accepts (anchor, positive, negative) pairs such as the one above.

This release introduces a utility function called mine_hard_negatives that allows you to mine for these hard negatives given a (anchor, positive) dataset (and optionally a corpus of negative candidate texts).

It boasts the following features to give you fine-grained control over the similarity of the mined negatives relative to the anchor:

  • CrossEncoder rescoring for higher quality negative selection.
  • Skip the top $n$ negative candidates as these might be true positives.
  • Consider only the top $n$ negative candidates.
  • Skip negative candidates that are within some margin of the true similarity between anchor and positive.
  • Skip negative candidates whose similarity is larger than some max_score.
  • Two sampling strategies: pick the top negative candidates that satisfy the requirements, or pick them randomly.
  • FAISS index for searching for negative candidates.
  • Option to return data as triplets only, or as 2 + num_negatives-tuples.
from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

# Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
print(dataset)
"""
Dataset({
    features: ['query', 'answer'],
    num_rows: 100231
})
"""

# Mine hard negatives
dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    range_min=10,
    range_max=50,
    max_score=0.8,
    margin=0.1,
    num_negatives=5,
    sampling_strategy="random",
    batch_size=128,
    use_faiss=True,
)
'''
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 588/588 [00:33<00:00, 17.37it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 784/784 [00:07<00:00, 101.55it/s]
Querying FAISS index: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:07<00:00,  1.06s/it]
Metric       Positive       Negative     Difference
Count         100,231        460,725        460,725
Mean           0.6866         0.4133         0.2917
Median         0.7010         0.4059         0.2873
Std            0.1125         0.0673         0.1006
Min            0.0303         0.1638         0.1029
25%            0.6221         0.3649         0.2112
50%            0.7010         0.4059         0.2873
75%            0.7667         0.4561         0.3647
Max            0.9584         0.7362         0.7073
Skipped 882722 potential negatives (17.27%) due to the margin of 0.1.
Skipped 27 potential negatives (0.00%) due to the maximum score of 0.8.
Could not find enough negatives for 40430 samples (8.07%). Consider adjusting the range_max, range_min, margin and max_score parameters if you'd like to find more valid negatives.
'''
print(dataset)
'''
Dataset({
    features: ['query', 'answer', 'negative'],
    num_rows: 460725
})
'''
print(dataset[0])
'''
{
    'query': 'the first person to use the word geography was',
    'answer': 'History of geography The history of geography includes many histories of geography which have differed over time and between different cultural and political groups. In more recent developments, geography has become a distinct academic discipline. \'Geography\' derives from the Greek γεωγραφία – geographia,[1] a literal translation of which would be "to describe or write about the Earth". The first person to use the word "geography" was Eratosthenes (276–194 BC). However, there is evidence for recognizable practices of geography, such as cartography (or map-making) prior to the use of the term geography.',
    'negative': 'Terminology of the British Isles The word "Great" means "larger", in comparison with Brittany in modern-day France. One historical term for the peninsula in France that largely corresponds to the modern French province is Lesser or Little Britain. That region was settled by many British immigrants during the period of Anglo-Saxon migration into Britain, and named "Little Britain" by them. The French term "Bretagne" now refers to the French "Little Britain", not to the British "Great Britain", which in French is called Grande-Bretagne. In classical times, the Graeco-Roman geographer Ptolemy in his Almagest also called the larger island megale Brettania (great Britain). At that time, it was in contrast to the smaller island of Ireland, which he called mikra Brettania (little Britain).[62] In his later work Geography, Ptolemy refers to Great Britain as Albion and to Ireland as Iwernia. These "new" names were likely to have been the native names for the islands at the time. The earlier names, in contrast, were likely to have been coined before direct contact with local peoples was made.[63]'
}
'''
dataset.push_to_hub("natural-questions-hard-negatives", "triplet")

This dataset can immediately be used in conjunction with MultipleNegativesRankingLoss, likely resulting in a stronger model than if you had just used the natural-questions dataset outright.

Here are some example datasets that I created using this new function:

Big thanks to @ChrisGeishauser and @ArthurCamara for assisting with this feature.

Add CachedMultipleNegativesSymmetricRankingLoss loss function (#2879)

Let's break this down:

  • MultipleNegativesRankingLoss (MNRL): Given (anchor, positive) text pairs or (anchor, positive, negative) text triplets, this loss trains for "Given an anchor (e.g. a query), which text out of a big lineup (all positives and negatives in the batch) is the true positive (e.g. the answer)?".
  • MultipleNegativesSymmetricRankingLoss (MNSRL): Adaptation of MNRL that adds a second loss term which means: "Given an positive (e.g. an summary), which text out of a big lineup (all anchors) is the true anchor (e.g. the full article)?". This is useful for symmetric tasks, such as clustering, classification, finding similar texts, and a bit less useful for asymmetric tasks such as question-answer retrieval.
  • CachedMultipleNegativesRankingLoss (CMNRL): Adaptation of MNRL such that the batch size can be increased to an arbitrary size at a flat 10-20% training speed cost. A higher batch size means a larger lineup for the model to find the true positive in, often resulting in a better training signal and model.

The v3.1 Sentence Transformers release now introduces a new loss: CachedMultipleNegativesSymmetricRankingLoss (CMNSRL), which combines both of the previous adaptations. The result is a loss adept at symmetric training tasks for which you can pick an arbitrarily large batch size. It is likely the strongest loss for Semantic Textual Similarity (STS) tasks in Sentence Transformers now.
Big thanks to @madhavthaker1 for working to include it.

Streaming Dataset support (#2792)

The v3.1 release introduces support for training with datasets.IterableDataset (Differences between Dataset and IterableDataset docs). This means that you can train without first downloading the full dataset to disk. For example:

from datasets import load_dataset

# Load a streaming dataset to finetune on
train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=True)
# IterableDataset({
#     features: ['question', 'answer'],
#     n_shards: 2
# })

or

from datasets import IterableDataset, Value, Features

def dataset_generator_fn():
    # Gather, fetch, load, or generate data here
    for ... in ...:
        yield ...

train_dataset = IterableDataset.from_generator(dataset_generator_fn)
train_dataset = train_dataset.cast(Features({'question': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None)}))

(Read more about Dataset features here)

For a full example of training with a streaming dataset, consider this script:

import logging
from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
)

# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
    "microsoft/mpnet-base",
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="MPNet base trained on GooAQ pairs",
    ),
)

name = "mpnet-base-gooaq-streaming"

# 2. Load a streaming dataset to finetune on
train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=True)

# 3. Define a loss function
loss = MultipleNegativesRankingLoss(model)

# 4. (Optional) Specify training arguments
train_batch_size = 64
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir=f"models/{name}",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=train_batch_size,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=True,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    # Optional tracking/debugging parameters:
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=250,
    logging_first_step=True,
    run_name=name,  # Will be used in W&B if `wandb` is installed
)

# 5. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

# 6. Save the trained model
model.save_pretrained(f"models/{name}/final")

# 7. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(name)

Advanced: Allow for Custom Modules (#2773)

Sentence Transformer models consist of several modules that are executed sequentially. Most models consist of a Transformer module, a Pooling module, and perhaps a Dense and/or Normalize module. However, as of the v3.1 release, model authors can create their own modules by writing some custom modeling code. This code can be uploaded to the Hugging Face Hub alongside the model itself, after which users can load the model like normal.

This allows for authors to replace the Transformer module with one that includes model-specific quirks, or replace the Pooling module with an all-new pooling method. This even allows for multi-modal models as authors can customize the preprocessing of the first module.

jinaai/jina-clip-v1 is the first model to take advantage of this new feature, allowing you to encode both texts and images (via paths to local images or URLs) due to their custom preprocessing. Try it out yourself:

from sentence_transformers import SentenceTransformer

# Load the model; must use trust_remote_code=True to run the custom module
model = SentenceTransformer("jinaai/jina-clip-v1", trust_remote_code=True)

# Texts and images of blue and red cats to embed
sentences = ['A blue cat', 'A red cat']
image_urls = [
    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]

# Embed the texts and images like normal
text_embeddings = model.encode(sentences)
image_embeddings = model.encode(image_urls)

# Compute similarity between text embeddings:
print(model.similarity(text_embeddings[0], text_embeddings[1]))
# tensor([[✅0.5636]])

# or cross-modal text and image embeddings:
print(model.similarity(text_embeddings, image_embeddings))
# tensor([[✅0.2906, ❌0.0569],
#         [❌0.1277, ✅0.2916]]

Additionally, model authors can take advantage of keyword argument passthrough. By updating the modules.json file to include a list of kwargs, e.g.:

[
  {
    "idx": 0,
    "name": "0",
    "path": "",
    "type": "custom_transformer.CustomTransformer",
    "kwargs": ["task_type"]
  },
  ...
]

then if a user provides the task_type keyword argument in model.encode, this value will be propagated to the forward of the custom module(s). This way, users can specify some custom functionality on the fly during inference time (as well as during load time via the model_kwargs option when initializing a SentenceTransformer model).

Update dependency versions (#2757)

  • Restrict numpy<2.0.0 due to issues with torch and numpy interoperability on Windows.
  • Increment minimum transformers version to 4.38.0 & huggingface-hub to 0.19.3 to prevent a training crash related to the prefetch_factor option

Smaller Highlights

Features

  • Add show_progress_bar to encode_multi_process (#2762)
  • Add revision to push_to_hub (#2902)
  • Add cache_dir and config_args to CrossEncoder (#2784)
  • Warn users if they might be passing training/evaluation columns in the wrong order, leading to worse training performance (#2928)

Bug fixes

  • Prevent crash when encoding an empty list (#2759)
  • Support training with GISTEmbedLoss with DataParallel (DP) and DataDistributedParallel (DDP) (#2772)
  • Fix a bug in GroupByLabelBatchSampler resulting in some data not being used in training (#2788)
  • Prevent crash if a datasets directory exists locally (#2859)
  • Fix Matryoshka2dLoss not importing correctly (#2907)
  • Resolve niche training bug with training if using multi-dataset, no-duplicates, and dataloader_drop_last=True (#2877)
  • Fix torch_compile=True not working in the SentenceTransformersTrainingArguments: should now work for faster training (#2884)
  • Fix SoftmaxLoss performing worse since v3.0 as a Linear layer was ignored by the optimizer (#2881)
  • Fix trainer.train(resume_from_checkpoint="...") with custom models (i.e. trust_remote_code) (#2918)
  • Fix the evaluation using the training batch size (#2847)
  • Fix encoding when passing model_kwargs={"torch_dtype": torch.float16} with models that use Dense layers (#2889)

Documentation

All changes

New Contributors

Big thanks to @fpgmaas for the large number of valuable contributions surrounding tests, CI, config files, and overall project health.

Full Changelog: v3.0.1...v3.1.0

Don't miss a new sentence-transformers release

NewReleases is sending notifications on new releases.