This release ships the train-sentence-transformers Agent Skill, adds two new training losses, and brings a long list of robustness and correctness fixes.
The new train-sentence-transformers Agent Skill lets AI coding agents (Claude Code, Codex, Cursor, Gemini CLI, ...) drive end-to-end training and fine-tuning across all three model types. EmbedDistillLoss is a new embedding-level knowledge distillation loss for SentenceTransformer: it aligns a student model's embeddings with pre-computed teacher embeddings, an alternative to the score-based distillation provided by MarginMSELoss and DistillKLDivLoss. ADRMSELoss is a new listwise learning-to-rank loss for CrossEncoder from the Rank-DistiLLM paper. encode() and predict() also gain a per-call processing_kwargs override, and more.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==5.5.0
# Inference only, use one of:
pip install sentence-transformers==5.5.0
pip install sentence-transformers[onnx-gpu]==5.5.0
pip install sentence-transformers[onnx]==5.5.0
pip install sentence-transformers[openvino]==5.5.0
# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.5.0
pip install sentence-transformers[audio]==5.5.0
pip install sentence-transformers[video]==5.5.0
# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.5.0The train-sentence-transformers Agent Skill (#3752)
If you use an AI coding agent (Claude Code, Codex, Cursor, Gemini CLI, OpenCode, ...), you can now install the train-sentence-transformers Agent Skill and ask your agent to fine-tune a model on your data:
hf skills add train-sentence-transformers # installs under ./.agents/skills/
hf skills add train-sentence-transformers --global # installs under ~/.agents/skills/
hf skills add train-sentence-transformers --claude # also symlinks into .claude/skills/The skill gives the agent curated, version-aware guidance for training SentenceTransformer (bi-encoder), CrossEncoder (reranker), and SparseEncoder/SPLADE models, covering base model selection, loss and evaluator choice, hard-negative mining, distillation, LoRA, Matryoshka, multilingual training, static embeddings, plus a set of production-ready training template scripts. Then you can prompt your agent with things like:
"Train a multilingual sentence-transformer on Dutch legal pairs."
"Fine-tune a cross-encoder reranker on
(question, answer)pairs from my dataset, mine hard negatives, and push to my Hub repo.""Train a German sparse embedding model with high sparsity."
"Can you train a static embedding model on 100k code triplets?"
The skill lives in the repository under skills/train-sentence-transformers/ and is mirrored to the huggingface/skills marketplace on each release.
New loss: EmbedDistillLoss (#3665)
Introduces EmbedDistillLoss (Kim et al., 2023), an embedding-level knowledge distillation loss for SentenceTransformer. Rather than distilling teacher scores (MarginMSELoss, DistillKLDivLoss), it directly aligns the student's sentence_embedding with a pre-computed teacher embedding passed via the dataset's label column. The comparison uses a configurable distance_metric, one of "cosine" (the default), "l2", or "mse". When the student and teacher dimensions differ, pass projection_dim=<teacher_dim> to add a learnable projection from the student's embedding space into the teacher's. That projection lives on the loss rather than on the saved model, so use loss.save_projection(...) / loss.load_projection(...) to reuse it across stages (e.g. like done in Arkam et al. for Jina v5). As part of this change, MSELoss is now a thin subclass of EmbedDistillLoss with distance_metric="mse", and also gains the optional projection_dim argument.
from datasets import Dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.sentence_transformer.losses import EmbedDistillLoss
student_model = SentenceTransformer("microsoft/mpnet-base")
teacher_model = SentenceTransformer("all-mpnet-base-v2")
train_dataset = Dataset.from_dict({
"sentence": ["It's nice weather outside today.", "He drove to work."],
})
# Pre-compute teacher embeddings once and store them as the `label` column
def add_teacher_embeddings(batch):
return {"label": teacher_model.encode(batch["sentence"]).tolist()}
train_dataset = train_dataset.map(add_teacher_embeddings, batched=True)
loss = EmbedDistillLoss(student_model, distance_metric="cosine")
# If the student and teacher dimensions differ, add a learnable projection:
# loss = EmbedDistillLoss(student_model, distance_metric="cosine", projection_dim=768)
trainer = SentenceTransformerTrainer(
model=student_model,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()See the updated model distillation examples and the loss overview for more.
New loss: ADRMSELoss for Cross Encoders (#3690)
Introduces ADRMSELoss (Approx Discounted Rank Mean Squared Error), a listwise learning-to-rank loss for CrossEncoder from the Rank-DistiLLM paper (Schlatt et al., ECIR 2025). It computes a differentiable approximation of each document's rank via pairwise sigmoids and minimizes the nDCG-discounted squared error against the true ranks derived from the labels. It expects listwise inputs: a (query, [doc1, ..., docN]) pair plus a [score1, ..., scoreN] label list per sample (binary or continuous labels, variable document counts allowed). It's designed for LLM-distillation reranking, where the per-document scores come from a strong LLM's ordering.
from datasets import Dataset
from sentence_transformers import CrossEncoder, CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import ADRMSELoss
model = CrossEncoder("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
"query": ["What are pandas?", "What is the capital of France?"],
"docs": [
["Pandas are a kind of bear.", "Pandas are kind of like fish."],
["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."],
],
"scores": [[0.95, 0.1], [0.98, 0.92, 0.2]],
})
loss = ADRMSELoss(model)
trainer = CrossEncoderTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()There's a full MS MARCO example at training_ms_marco_adrmse.py. Note that LambdaLoss generally remains the strongest loss in the listwise family. See the Cross Encoder loss overview for guidance on picking a loss.
Per-call processing_kwargs override (#3753)
SentenceTransformer.encode() / encode_query() / encode_document(), SparseEncoder.encode(), CrossEncoder.predict(), and model.preprocess() now accept a processing_kwargs argument that overrides the processor/tokenizer kwargs configured at construction time, for a single call. It has the same nested structure as the processing_kwargs constructor argument (top-level keys text, audio, image, video, common, chat_template) and is shallow-merged on top of the instance-level settings, so you can override just one setting (e.g. max_length) and leave the rest intact.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Override processor kwargs (e.g. max_length, truncation) for this call only:
embeddings = model.encode(
["a short text", "a much longer text that you want truncated more aggressively ..."],
processing_kwargs={"text": {"max_length": 256, "truncation": True}},
)This is especially handy for vision-language models, where you can change the image resolution per call, e.g. model.encode(images, processing_kwargs={"image": {"max_pixels": 256 * 256}}).
Smaller Features
- Allow
CrossEncodermodule stacks that don't start with aTransformer, and recognize a trailingDense(module_output_name="scores")as the scoring head, by @tomaarsen in #3742:num_labelsnow reads that head'sout_features, andmodel.config/model.modelreturnNonewhen there's no underlying transformers model. - Infer that a model is an IR model on its generated model card when an
InformationRetrievalEvaluator/NanoBEIREvaluator(or their sparse variants) was used during training, by @tomaarsen in #3741: the usage snippet then showsencode_query/encode_document, even without IR prompt names or aRouterarchitecture. - Warn at model-load time when the installed
transformersversion is too old to honoruse_bidirectional_attention/is_causalflags in a model's config (e.g. forgoogle/embeddinggemma-300m), rather than silently ignoring them, by @tomaarsen in #3726.
Bug Fixes
- Use the first non-pad token for CLS pooling with left-padding tokenizers by @tomaarsen in #3767:
pooling_mode="cls"previously returned the embedding at position 0, which is a[PAD]token for left-padded inputs (common with decoder-only models), silently producing incorrect sentence embeddings. It now uses the attention mask to find the first real token per sequence. Resolves #3208. - Don't upcast bf16/fp16 embeddings to fp32 in the Flash Attention 2 mean-pooling path by @tomaarsen in #3751: the
int64-derived divisor inmean/mean_sqrt_len_tokenspooling forced the pooled output to fp32, which could crash the downstreamDense/ scoring head with a dtype mismatch. - Unwrap
DistributedDataParallel/torch.compilewrappers inAdaptiveLayerLoss(andMatryoshka2dLoss) by @tomaarsen in #3768: training with these losses under DDP ortorch.compilepreviously crashed withTypeError: 'DistributedDataParallel' object is not subscriptable. Resolves #3170. - Expose
preprocess/get_embedding_dimensionon DDP-wrapped models in losses by @tomaarsen in #3746: training aCrossEncoder(or usingMatryoshkaLoss) under DDP crashed withAttributeError: 'DistributedDataParallel' object has no attribute 'preprocess'. - Push the full Sentence Transformers layout from every checkpoint by @tomaarsen in #3740: mid-training Hub pushes with
hub_strategy="every_save"/"checkpoint"/"all_checkpoints"were previously missingmodules.json,config_sentence_transformers.json,README.md, and module subfolders, leaving those revisions unloadable. - Inherit
model_typefrom the archetype class on user subclasses by @tomaarsen in #3763: a plain subclass likeclass MyModel(SentenceTransformer): passwould silently load checkpoints via the conversion path (e.g. defaulting CLS-pooling models to mean pooling), producing wrong embeddings with no error. Resolves #3536. Note: a model previously saved through a subclass has the subclass name in its config and should be re-saved (or itsconfig_sentence_transformers.jsonedited) under this fix. - Add a
model.configproperty that delegates to the underlying transformers model'sPretrainedConfig(orNoneif there is none) by @tomaarsen in #3764: this restores DeepSpeed ZeRO and othertransformersintegrations that readmodel.config.hidden_size, which previously crashed withAttributeError: 'SentenceTransformer' object has no attribute 'config'. Resolves #3531. - Robust
file_ioerror handling for local paths and Hub failures by @tomaarsen in #3765: an incomplete local model path no longer raises a confusingHFValidationError(e.g. on Windows absolute paths), and transient Hub errors (auth, rate-limit, network) on critical files now propagate instead of silently falling back to a default architecture. Resolves #3370. A local directory whose name collides with a Hub repo id now takes precedence even if incomplete. - Allow
Routerchildren to load via the dynamic-module mechanism by @tomaarsen in #3749: a model whose architecture uses aRouterwith a repository-local custom child module class now loads withtrust_remote_code=Trueinstead of raising anImportError. - Forward the Hub auth token (plus cache dir and
local_files_only) to the dynamic-module loader by @tomaarsen in #3766: private Hub repos withtrust_remote_code=Truerepo-local custom modules now load on the first try instead of failing with a misleadingModuleNotFoundError. Resolves #3367. - Unwrap dict- and
torchcodec-decoder-wrapped audio/video that appears inside a multimodal dict input (e.g.{"audio": {"array": ..., "sampling_rate": ...}, "text": ...}) so it reaches the processor correctly by @tomaarsen in #3736. Resolves #3732. - Fix a crash on malformed URL-like strings (e.g. broken Markdown links) in the multimodal input parser by catching
ValueErrorfromurlparseby @forhim007 in #3760: strings like"https://www.google.com)[google.com]"raisedValueError: Invalid IPv6 URLinside modality detection. They are now treated as plain text. Resolves #3758. - Avoid CPU OOM in the automatic model-card dataset metrics when a dataset stores media as file-path strings by @yjoonjang in #3733: multimodal training (e.g. Qwen3-VL-Embedding) on ColPali/VDR-style datasets previously loaded ~1000 media files per text-like column per process during
Trainer.__init__. Such columns are now detected by modality and skipped, the stats sample is bounded (and reduced from 1000 to 100 rows), and amodalityrow was added to the model-card dataset stats table. - Fix unescaped newlines in auto-generated model card dataset examples for long texts by @tomaarsen in #3750: example strings over 1000 characters skipped the table-safe escaping, producing a broken Markdown table on the Hub.
All Changes
- Warn when transformers is too old to bidir. attention flags in model config by @tomaarsen in #3726
- [
ci] Reduce hub calls in tests by @tomaarsen in #3727 - [
enh] The Qwen3 integrations are merged, no need for revision anymore by @tomaarsen in #3729 - [
tests] Future-proof getting model keys as MODEL_MAPPING_NAMES is being removed by @tomaarsen in #3730 - [
examples] Fix training dataset creation by @tomaarsen in #3728 - Link to blogposts where relevant by @tomaarsen in #3735
- Unwrap audio/video inside multimodal dict inputs by @tomaarsen in #3736
- docs(SimCSE): migrate README example to SentenceTransformerTrainer by @MukundaKatta in #3734
- Avoid OOM in compute_dataset_metrics for multimodal datasets with path columns by @yjoonjang in #3733
- Be less specific in CE model card template by @tomaarsen in #3738
- [
feat] Add ADRMSELoss by @sky-2002 in #3690 - [
model card] stats are computed over 100 samples, not 1000 by @tomaarsen in #3739 - [
trainer] Push full Sentence Transformers layout from each checkpoint by @tomaarsen in #3740 - [
model card] Setir_modelon the model card based on evaluators by @tomaarsen in #3741 - [
feat] Allow Dense as CrossEncoder scoring head by @tomaarsen in #3742 - Update model card link format by @matthewhaynesonline in #3744
- [
fix] Exposepreprocess/get_embedding_dimensionon DDP-wrapped models in losses by @tomaarsen in #3746 - [
fix] Allow Router children to load via dynamic-module mechanism by @tomaarsen in #3749 - [
model_card] Fix newlines in datasets with large texts by @tomaarsen in #3750 - [
fix] Don't upcast bf16/fp16 to fp32 in flash-attention pooling path by @tomaarsen in #3751 - [feat] Per-call processing_kwargs override in Transformer.preprocess by @tomaarsen in #3753
- Consolidate project configuration into pyproject.toml by @Samoed in #3745
- Add training skill:
train-sentence-transformersby @tomaarsen in #3752 - [
docs] Fix MTEB links + broken 'note' by @tomaarsen in #3754 - [
examples] Modernize the MSMARCO training scripts, add MNRL + MarginMSE recipe by @tomaarsen in #3761 - Fix Invalid Markdown URL crash by catching ValueError from urlparse by @forhim007 in #3760
- [
fix] Inheritmodel_typefrom archetype on user subclasses by @tomaarsen in #3763 - [
fix] Delegatemodel.configto underlying transformers model by @tomaarsen in #3764 - [
feat] Add EmbedDistillLoss by @yjoonjang in #3665 - [
fix] Forward Hub auth to dynamic-module loader for privatetrust_remote_codemodels by @tomaarsen in #3766 - [
fix] Robust file_io error handling for local paths and Hub failures by @tomaarsen in #3765 - [
fix] Use first non-pad token for CLS pooling with left-padding by @tomaarsen in #3767 - [
fix] Unwrap DDP/torch.compile wrappers inAdaptiveLayerLossby @tomaarsen in #3768 - [
docs] Use direct class imports in examples & docs (droplosses.MSELoss(...)style) by @tomaarsen in #3770 - docs: fix grammar in parallel-sentence-mining README by @Karthikkolli17 in #3769
- [
examples] Avoid LoggingHandler, silence httpx in examples by @tomaarsen in #3771 - [
docs] Use modality-neutral terms (input, document) in loss docs & docstrings by @tomaarsen in #3772 - [
docs] Load models in float32 in the training examples & docs by @tomaarsen in #3773
New Contributors
- @MukundaKatta made their first contribution in #3734
- @matthewhaynesonline made their first contribution in #3744
- @Karthikkolli17 made their first contribution in #3769
Full Changelog: v5.4.1...v5.5.0