github huggingface/sentence-transformers v5.5.0
v5.5.0 - Training Agent Skill, EmbedDistillLoss, and ADRMSELoss

5 hours ago

This release ships the train-sentence-transformers Agent Skill, adds two new training losses, and brings a long list of robustness and correctness fixes.

The new train-sentence-transformers Agent Skill lets AI coding agents (Claude Code, Codex, Cursor, Gemini CLI, ...) drive end-to-end training and fine-tuning across all three model types. EmbedDistillLoss is a new embedding-level knowledge distillation loss for SentenceTransformer: it aligns a student model's embeddings with pre-computed teacher embeddings, an alternative to the score-based distillation provided by MarginMSELoss and DistillKLDivLoss. ADRMSELoss is a new listwise learning-to-rank loss for CrossEncoder from the Rank-DistiLLM paper. encode() and predict() also gain a per-call processing_kwargs override, and more.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.5.0

# Inference only, use one of:
pip install sentence-transformers==5.5.0
pip install sentence-transformers[onnx-gpu]==5.5.0
pip install sentence-transformers[onnx]==5.5.0
pip install sentence-transformers[openvino]==5.5.0

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.5.0
pip install sentence-transformers[audio]==5.5.0
pip install sentence-transformers[video]==5.5.0

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.5.0

The train-sentence-transformers Agent Skill (#3752)

If you use an AI coding agent (Claude Code, Codex, Cursor, Gemini CLI, OpenCode, ...), you can now install the train-sentence-transformers Agent Skill and ask your agent to fine-tune a model on your data:

hf skills add train-sentence-transformers              # installs under ./.agents/skills/
hf skills add train-sentence-transformers --global     # installs under ~/.agents/skills/
hf skills add train-sentence-transformers --claude     # also symlinks into .claude/skills/

The skill gives the agent curated, version-aware guidance for training SentenceTransformer (bi-encoder), CrossEncoder (reranker), and SparseEncoder/SPLADE models, covering base model selection, loss and evaluator choice, hard-negative mining, distillation, LoRA, Matryoshka, multilingual training, static embeddings, plus a set of production-ready training template scripts. Then you can prompt your agent with things like:

"Train a multilingual sentence-transformer on Dutch legal pairs."

"Fine-tune a cross-encoder reranker on (question, answer) pairs from my dataset, mine hard negatives, and push to my Hub repo."

"Train a German sparse embedding model with high sparsity."

"Can you train a static embedding model on 100k code triplets?"

The skill lives in the repository under skills/train-sentence-transformers/ and is mirrored to the huggingface/skills marketplace on each release.

New loss: EmbedDistillLoss (#3665)

Introduces EmbedDistillLoss (Kim et al., 2023), an embedding-level knowledge distillation loss for SentenceTransformer. Rather than distilling teacher scores (MarginMSELoss, DistillKLDivLoss), it directly aligns the student's sentence_embedding with a pre-computed teacher embedding passed via the dataset's label column. The comparison uses a configurable distance_metric, one of "cosine" (the default), "l2", or "mse". When the student and teacher dimensions differ, pass projection_dim=<teacher_dim> to add a learnable projection from the student's embedding space into the teacher's. That projection lives on the loss rather than on the saved model, so use loss.save_projection(...) / loss.load_projection(...) to reuse it across stages (e.g. like done in Arkam et al. for Jina v5). As part of this change, MSELoss is now a thin subclass of EmbedDistillLoss with distance_metric="mse", and also gains the optional projection_dim argument.

from datasets import Dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.sentence_transformer.losses import EmbedDistillLoss

student_model = SentenceTransformer("microsoft/mpnet-base")
teacher_model = SentenceTransformer("all-mpnet-base-v2")

train_dataset = Dataset.from_dict({
    "sentence": ["It's nice weather outside today.", "He drove to work."],
})

# Pre-compute teacher embeddings once and store them as the `label` column
def add_teacher_embeddings(batch):
    return {"label": teacher_model.encode(batch["sentence"]).tolist()}

train_dataset = train_dataset.map(add_teacher_embeddings, batched=True)

loss = EmbedDistillLoss(student_model, distance_metric="cosine")
# If the student and teacher dimensions differ, add a learnable projection:
# loss = EmbedDistillLoss(student_model, distance_metric="cosine", projection_dim=768)

trainer = SentenceTransformerTrainer(
    model=student_model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

See the updated model distillation examples and the loss overview for more.

New loss: ADRMSELoss for Cross Encoders (#3690)

Introduces ADRMSELoss (Approx Discounted Rank Mean Squared Error), a listwise learning-to-rank loss for CrossEncoder from the Rank-DistiLLM paper (Schlatt et al., ECIR 2025). It computes a differentiable approximation of each document's rank via pairwise sigmoids and minimizes the nDCG-discounted squared error against the true ranks derived from the labels. It expects listwise inputs: a (query, [doc1, ..., docN]) pair plus a [score1, ..., scoreN] label list per sample (binary or continuous labels, variable document counts allowed). It's designed for LLM-distillation reranking, where the per-document scores come from a strong LLM's ordering.

from datasets import Dataset
from sentence_transformers import CrossEncoder, CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import ADRMSELoss

model = CrossEncoder("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
    "query": ["What are pandas?", "What is the capital of France?"],
    "docs": [
        ["Pandas are a kind of bear.", "Pandas are kind of like fish."],
        ["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."],
    ],
    "scores": [[0.95, 0.1], [0.98, 0.92, 0.2]],
})
loss = ADRMSELoss(model)

trainer = CrossEncoderTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

There's a full MS MARCO example at training_ms_marco_adrmse.py. Note that LambdaLoss generally remains the strongest loss in the listwise family. See the Cross Encoder loss overview for guidance on picking a loss.

Per-call processing_kwargs override (#3753)

SentenceTransformer.encode() / encode_query() / encode_document(), SparseEncoder.encode(), CrossEncoder.predict(), and model.preprocess() now accept a processing_kwargs argument that overrides the processor/tokenizer kwargs configured at construction time, for a single call. It has the same nested structure as the processing_kwargs constructor argument (top-level keys text, audio, image, video, common, chat_template) and is shallow-merged on top of the instance-level settings, so you can override just one setting (e.g. max_length) and leave the rest intact.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Override processor kwargs (e.g. max_length, truncation) for this call only:
embeddings = model.encode(
    ["a short text", "a much longer text that you want truncated more aggressively ..."],
    processing_kwargs={"text": {"max_length": 256, "truncation": True}},
)

This is especially handy for vision-language models, where you can change the image resolution per call, e.g. model.encode(images, processing_kwargs={"image": {"max_pixels": 256 * 256}}).

Smaller Features

  • Allow CrossEncoder module stacks that don't start with a Transformer, and recognize a trailing Dense(module_output_name="scores") as the scoring head, by @tomaarsen in #3742: num_labels now reads that head's out_features, and model.config / model.model return None when there's no underlying transformers model.
  • Infer that a model is an IR model on its generated model card when an InformationRetrievalEvaluator / NanoBEIREvaluator (or their sparse variants) was used during training, by @tomaarsen in #3741: the usage snippet then shows encode_query / encode_document, even without IR prompt names or a Router architecture.
  • Warn at model-load time when the installed transformers version is too old to honor use_bidirectional_attention / is_causal flags in a model's config (e.g. for google/embeddinggemma-300m), rather than silently ignoring them, by @tomaarsen in #3726.

Bug Fixes

  • Use the first non-pad token for CLS pooling with left-padding tokenizers by @tomaarsen in #3767: pooling_mode="cls" previously returned the embedding at position 0, which is a [PAD] token for left-padded inputs (common with decoder-only models), silently producing incorrect sentence embeddings. It now uses the attention mask to find the first real token per sequence. Resolves #3208.
  • Don't upcast bf16/fp16 embeddings to fp32 in the Flash Attention 2 mean-pooling path by @tomaarsen in #3751: the int64-derived divisor in mean / mean_sqrt_len_tokens pooling forced the pooled output to fp32, which could crash the downstream Dense / scoring head with a dtype mismatch.
  • Unwrap DistributedDataParallel / torch.compile wrappers in AdaptiveLayerLoss (and Matryoshka2dLoss) by @tomaarsen in #3768: training with these losses under DDP or torch.compile previously crashed with TypeError: 'DistributedDataParallel' object is not subscriptable. Resolves #3170.
  • Expose preprocess / get_embedding_dimension on DDP-wrapped models in losses by @tomaarsen in #3746: training a CrossEncoder (or using MatryoshkaLoss) under DDP crashed with AttributeError: 'DistributedDataParallel' object has no attribute 'preprocess'.
  • Push the full Sentence Transformers layout from every checkpoint by @tomaarsen in #3740: mid-training Hub pushes with hub_strategy="every_save" / "checkpoint" / "all_checkpoints" were previously missing modules.json, config_sentence_transformers.json, README.md, and module subfolders, leaving those revisions unloadable.
  • Inherit model_type from the archetype class on user subclasses by @tomaarsen in #3763: a plain subclass like class MyModel(SentenceTransformer): pass would silently load checkpoints via the conversion path (e.g. defaulting CLS-pooling models to mean pooling), producing wrong embeddings with no error. Resolves #3536. Note: a model previously saved through a subclass has the subclass name in its config and should be re-saved (or its config_sentence_transformers.json edited) under this fix.
  • Add a model.config property that delegates to the underlying transformers model's PretrainedConfig (or None if there is none) by @tomaarsen in #3764: this restores DeepSpeed ZeRO and other transformers integrations that read model.config.hidden_size, which previously crashed with AttributeError: 'SentenceTransformer' object has no attribute 'config'. Resolves #3531.
  • Robust file_io error handling for local paths and Hub failures by @tomaarsen in #3765: an incomplete local model path no longer raises a confusing HFValidationError (e.g. on Windows absolute paths), and transient Hub errors (auth, rate-limit, network) on critical files now propagate instead of silently falling back to a default architecture. Resolves #3370. A local directory whose name collides with a Hub repo id now takes precedence even if incomplete.
  • Allow Router children to load via the dynamic-module mechanism by @tomaarsen in #3749: a model whose architecture uses a Router with a repository-local custom child module class now loads with trust_remote_code=True instead of raising an ImportError.
  • Forward the Hub auth token (plus cache dir and local_files_only) to the dynamic-module loader by @tomaarsen in #3766: private Hub repos with trust_remote_code=True repo-local custom modules now load on the first try instead of failing with a misleading ModuleNotFoundError. Resolves #3367.
  • Unwrap dict- and torchcodec-decoder-wrapped audio/video that appears inside a multimodal dict input (e.g. {"audio": {"array": ..., "sampling_rate": ...}, "text": ...}) so it reaches the processor correctly by @tomaarsen in #3736. Resolves #3732.
  • Fix a crash on malformed URL-like strings (e.g. broken Markdown links) in the multimodal input parser by catching ValueError from urlparse by @forhim007 in #3760: strings like "https://www.google.com)[google.com]" raised ValueError: Invalid IPv6 URL inside modality detection. They are now treated as plain text. Resolves #3758.
  • Avoid CPU OOM in the automatic model-card dataset metrics when a dataset stores media as file-path strings by @yjoonjang in #3733: multimodal training (e.g. Qwen3-VL-Embedding) on ColPali/VDR-style datasets previously loaded ~1000 media files per text-like column per process during Trainer.__init__. Such columns are now detected by modality and skipped, the stats sample is bounded (and reduced from 1000 to 100 rows), and a modality row was added to the model-card dataset stats table.
  • Fix unescaped newlines in auto-generated model card dataset examples for long texts by @tomaarsen in #3750: example strings over 1000 characters skipped the table-safe escaping, producing a broken Markdown table on the Hub.

All Changes

  • Warn when transformers is too old to bidir. attention flags in model config by @tomaarsen in #3726
  • [ci] Reduce hub calls in tests by @tomaarsen in #3727
  • [enh] The Qwen3 integrations are merged, no need for revision anymore by @tomaarsen in #3729
  • [tests] Future-proof getting model keys as MODEL_MAPPING_NAMES is being removed by @tomaarsen in #3730
  • [examples] Fix training dataset creation by @tomaarsen in #3728
  • Link to blogposts where relevant by @tomaarsen in #3735
  • Unwrap audio/video inside multimodal dict inputs by @tomaarsen in #3736
  • docs(SimCSE): migrate README example to SentenceTransformerTrainer by @MukundaKatta in #3734
  • Avoid OOM in compute_dataset_metrics for multimodal datasets with path columns by @yjoonjang in #3733
  • Be less specific in CE model card template by @tomaarsen in #3738
  • [feat] Add ADRMSELoss by @sky-2002 in #3690
  • [model card] stats are computed over 100 samples, not 1000 by @tomaarsen in #3739
  • [trainer] Push full Sentence Transformers layout from each checkpoint by @tomaarsen in #3740
  • [model card] Set ir_model on the model card based on evaluators by @tomaarsen in #3741
  • [feat] Allow Dense as CrossEncoder scoring head by @tomaarsen in #3742
  • Update model card link format by @matthewhaynesonline in #3744
  • [fix] Expose preprocess/get_embedding_dimension on DDP-wrapped models in losses by @tomaarsen in #3746
  • [fix] Allow Router children to load via dynamic-module mechanism by @tomaarsen in #3749
  • [model_card] Fix newlines in datasets with large texts by @tomaarsen in #3750
  • [fix] Don't upcast bf16/fp16 to fp32 in flash-attention pooling path by @tomaarsen in #3751
  • [feat] Per-call processing_kwargs override in Transformer.preprocess by @tomaarsen in #3753
  • Consolidate project configuration into pyproject.toml by @Samoed in #3745
  • Add training skill: train-sentence-transformers by @tomaarsen in #3752
  • [docs] Fix MTEB links + broken 'note' by @tomaarsen in #3754
  • [examples] Modernize the MSMARCO training scripts, add MNRL + MarginMSE recipe by @tomaarsen in #3761
  • Fix Invalid Markdown URL crash by catching ValueError from urlparse by @forhim007 in #3760
  • [fix] Inherit model_type from archetype on user subclasses by @tomaarsen in #3763
  • [fix] Delegate model.config to underlying transformers model by @tomaarsen in #3764
  • [feat] Add EmbedDistillLoss by @yjoonjang in #3665
  • [fix] Forward Hub auth to dynamic-module loader for private trust_remote_code models by @tomaarsen in #3766
  • [fix] Robust file_io error handling for local paths and Hub failures by @tomaarsen in #3765
  • [fix] Use first non-pad token for CLS pooling with left-padding by @tomaarsen in #3767
  • [fix] Unwrap DDP/torch.compile wrappers in AdaptiveLayerLoss by @tomaarsen in #3768
  • [docs] Use direct class imports in examples & docs (drop losses.MSELoss(...) style) by @tomaarsen in #3770
  • docs: fix grammar in parallel-sentence-mining README by @Karthikkolli17 in #3769
  • [examples] Avoid LoggingHandler, silence httpx in examples by @tomaarsen in #3771
  • [docs] Use modality-neutral terms (input, document) in loss docs & docstrings by @tomaarsen in #3772
  • [docs] Load models in float32 in the training examples & docs by @tomaarsen in #3773

New Contributors

Full Changelog: v5.4.1...v5.5.0

Don't miss a new sentence-transformers release

NewReleases is sending notifications on new releases.