github huggingface/sentence-transformers v5.4.0
v5.4.0 - Multimodal Embeddings and Reranking, Modular CrossEncoder, Flash Attention Input Flattening

9 hours ago

This large release introduces first-class multimodal support for both SentenceTransformer and CrossEncoder, making it easy to compute embeddings and rerank across text, images, audio, and video. The CrossEncoder class has been fully modularized, allowing for generative rerankers (CausalLM-based models) via a new LogitScore module. Flash Attention 2 now automatically skips padding for text-only inputs, providing significant speedups & memory reductions, especially when input lengths vary.

Blog post: Multimodal Embedding & Reranker Models with Sentence Transformers: a walkthrough of the new multimodal capabilities with some practical examples.

Migration guide: Migrating from v5.x to v5.4+: covers updated import paths, renamed parameters, and other softly breaking changes with deprecation warnings. Note that there are no hard deprecations, all existing code should continue to work with warnings at worst.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.4.0

# Inference only, use one of:
pip install sentence-transformers==5.4.0
pip install sentence-transformers[onnx-gpu]==5.4.0
pip install sentence-transformers[onnx]==5.4.0
pip install sentence-transformers[openvino]==5.4.0

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.4.0
pip install sentence-transformers[audio]==5.4.0
pip install sentence-transformers[video]==5.4.0

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.4.0

Multimodal Embeddings with SentenceTransformer (#3554)

SentenceTransformer now natively supports vision-language models (VLMs) and other multimodal architectures. You can encode and compare across text, images, audio, videos, or combinations of these, with automatic modality detection and preprocessing. Models advertise which modalities they support via the new model.modalities property and model.supports() method.

Using a pretrained multimodal embedding model

from PIL import Image
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
    processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
    revision="refs/pr/23",
)

# Check supported modalities
print(model.modalities)
# ['text', 'image', 'video', 'message']
print(model.supports("image"))
# True

# Encode text
text_embeddings = model.encode(["A photo of a cat", "A pollinator on a flower"])

# Encode images (PIL images, file paths, or URLs all work)
image_embeddings = model.encode([
    Image.open("cat.jpg"),
    "https://example.com/flower.jpg",
])

# Encode mixed text+image inputs
multimodal_embeddings = model.encode([
    {"text": "Describe this image", "image": Image.open("cat.jpg")},
])

# Compute cross-modal similarity
similarity = model.similarity(text_embeddings, image_embeddings)

Building multimodal models with Router

You can also compose separate encoders for different modalities using the new Router module. Unlike the single-backbone VLM approach, Router lets you combine any existing text and image encoders and route inputs based on detected modality:

from sentence_transformers import SentenceTransformer
from sentence_transformers.sentence_transformer.modules import Dense, Pooling, Router, Transformer

# Text encoder: MiniLM with mean pooling, projected to 768 dims to match image encoder
text_encoder = Transformer("sentence-transformers/all-MiniLM-L6-v2")
text_pooling = Pooling(text_encoder.get_embedding_dimension(), pooling_mode="mean")
text_projection = Dense(text_encoder.get_embedding_dimension(), 768)

# Image encoder: SigLIP outputs pooled embeddings directly
image_encoder = Transformer("google/siglip2-base-patch16-224")

# Route inputs to the appropriate encoder based on detected modality
router = Router(
    sub_modules={
        "text": [text_encoder, text_pooling, text_projection],
        "image": [image_encoder],
    },
)

model = SentenceTransformer(modules=[router])

# Text and image inputs are automatically routed to the correct encoder
text_embeddings = model.encode(["A photo of a cat"])
image_embeddings = model.encode(["https://example.com/cat.jpg"])
similarity = model.similarity(text_embeddings, image_embeddings)

Multimodal Reranking with CrossEncoder (#3554)

CrossEncoder now supports multimodal inputs for reranking, enabling cross-modal scoring of query-document pairs where either side can be text, images, audio, video, or mixed-modality content. This works with both generative rerankers (CausalLM-based, via the new LogitScore module) and encoder-based models. See the pretrained multimodal rerankers for models you can use right away.

from sentence_transformers import CrossEncoder

# Load a multimodal reranker
model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B", revision="refs/pr/11")

# Rank text documents against an image query (or vice versa)
results = model.rank(
    query="https://example.com/product.jpg",
    documents=["A red sneaker", "A blue dress", "A leather bag"],
)

Two training approaches are provided in the multimodal training examples:

  • Any-to-Any + LogitScore: Uses the full causal LM to generate a single token, scoring via log-odds of "1" vs "0".
  • Feature Extraction + Pooling + Dense: More memory-efficient alternative that skips the LM head.

Modular CrossEncoder Architecture (#3554)

CrossEncoder has been fully modularized, inheriting from BaseModel (which is a torch.nn.Sequential). You can now inspect, customize, and compose module chains, just like SentenceTransformer. See the custom models guide for full details.

from sentence_transformers import CrossEncoder

model = CrossEncoder("Qwen/Qwen3-Reranker-0.6B", revision="refs/pr/11")
print(model)
"""
CrossEncoder(
  (0): Transformer({'transformer_task': 'text-generation', ...})
  (1): LogitScore({'true_token_id': 9693, 'false_token_id': 2152, ...})
)
"""

Generative reranker support

Thanks to the modular architecture, generative rerankers like mixedbread-ai/mxbai-rerank-base-v2 now work out of the box. These models ship with a modules.json that configures the Transformer + LogitScore chain automatically:

from sentence_transformers import CrossEncoder

model = CrossEncoder("mixedbread-ai/mxbai-rerank-base-v2")
scores = model.predict([
    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 in 2022."),
    ("How many people live in Berlin?", "Berlin is well known for its museums."),
])
# array([ 9. , -0.5], dtype=float32)

Module chain patterns

The Transformer module now supports multiple task types that determine how the underlying model is loaded and what outputs it produces:

  • "sequence-classification": Loads via AutoModelForSequenceClassification, returns classification logits directly.
  • "text-generation": Loads via AutoModelForCausalLM, returns raw logits from the language model head.
  • "any-to-any": Loads via AutoModelForMultimodalLM (transformers v5+), for multimodal causal LMs that accept interleaved image/text inputs.
  • "feature-extraction": Loads via AutoModel (no task-specific head), returns hidden states.

Various module chains are possible now, here's some common ones:

  1. Encoder-based (Sequence Classification): A single Transformer module with transformer_task="sequence-classification", the traditional BERT/RoBERTa approach. This was previously the only option for CrossEncoder models.

  2. CausalLM-based (Text Generation + LogitScore): For generative rerankers (Qwen, Llama, mxbai-rerank-v2, etc.), a Transformer with transformer_task="text-generation" followed by a LogitScore module that computes logit["yes"] - logit["no"] at the last token position. For multimodal rerankers, transformer_task="any-to-any" is used instead.

  3. Feature Extraction + Pooling + Dense: A memory-efficient alternative that uses the base model without LM head, pools the last token, and projects to a single score via a Dense layer.

When loading a model without a modules.json, CrossEncoder automatically selects the right chain: if the architecture ends with ForCausalLM, it uses text-generation + LogitScore (with "yes"/"no" tokens); otherwise it uses sequence-classification. You can also construct custom module chains explicitly:

from sentence_transformers import CrossEncoder
from sentence_transformers.cross_encoder.modules import Transformer, LogitScore

transformer = Transformer("Qwen/Qwen3-Reranker-0.6B", transformer_task="text-generation", revision="refs/pr/11")
true_id = transformer.tokenizer.convert_tokens_to_ids("1")
false_id = transformer.tokenizer.convert_tokens_to_ids("0")

model = CrossEncoder(modules=[transformer, LogitScore(true_token_id=true_id, false_token_id=false_id)])

API improvements

  • All CrossEncoder.__init__ arguments after model_name_or_path are now keyword-only.
  • tokenizer_args/tokenizer_kwargs -> processor_kwargs (with deprecation warnings).
  • max_length -> max_seq_length (with deprecation warning).
  • default_activation_function -> activation_fn (with deprecation warning).

Flash Attention 2 Input Flattening (#3554)

When using Flash Attention 2, Sentence Transformers now automatically skips padding for text-only inputs by concatenating all sequences into a single flat tensor. This eliminates wasted computation on padding tokens and is especially beneficial when input lengths vary widely within a batch. See the efficiency docs for more details.

The feature is enabled automatically when all prerequisites are met:

  • transformers >= 5.0.0
  • Flash Attention with variable-length support is installed (via pip install kernels (recommended) or pip install flash-attn)
  • The model uses attn_implementation="flash_attention_2"
  • The model uses the "feature-extraction" transformer task
  • The model uses the "torch" backend and supports the "text" modality
  • The inputs are text-only (multimodal inputs are always padded normally)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
)

# Padding is automatically skipped for text inputs
embeddings = model.encode(["short", "a much longer sentence that would normally cause padding"])

You can manually control the behavior via the unpad_inputs property on the Transformer module:

model[0].unpad_inputs = False   # Force padding (e.g. for architectures that don't support unpadded inputs)
model[0].unpad_inputs = True    # Explicitly request unpadding
model[0].unpad_inputs = None    # Auto-detect (default)

Benchmarks

The following benchmark compares throughput and VRAM usage across three attention configurations using BAAI/bge-base-en-v1.5, averaged across batch sizes. Four datasets with varying text lengths are tested: stsb (avg 10 tokens), natural-questions (avg 139 tokens), imdb (avg 304 tokens), and a shuffled mix of all three.

Flash Attention 2 Input Flattening Benchmark

Flash Attention 2 with input flattening always outperforms standard Flash Attention 2, while using considerably less VRAM. The gains grow with the variance in input length, with the mixed dataset with wildly varying lengths (10-500 tokens) benefitting the most.

Behavior Changes

  • Default pooling for CausalLM models: When a SentenceTransformer automatically adds a Pooling module for a causal language model (e.g. Llama, Qwen), it now defaults to last-token pooling instead of mean pooling. This better matches how decoder-only models represent sequences. Existing models with an explicit pooling configuration are unaffected.

Bug Fixes

  • Fix inverted Euclidean/Manhattan distance in TripletLoss by @tomaarsen in #3704: The Euclidean and Manhattan distance metrics in TripletLoss were missing a negation, causing them to behave as similarity metrics rather than distance metrics, which inverted the loss. Cosine distance was not affected.
  • Separate tokenize and forward kwargs in SparseEncoder.encode to prevent Router misrouting by @ratatouille-plat in #3695: The max_active_dims parameter was leaking into tokenize kwargs, which could cause the Router to misroute inputs.

All Changes

  • [chore] Increment dev version to v5.4.0.dev0 following the v5.3.0 release by @tomaarsen in #3688
  • [ci] Remove cache, switch to uv for CI by @tomaarsen in #3689
  • [ci] Install model2vec with the distill extra by @tomaarsen in #3698
  • 🔒 Pin GitHub Actions to commit SHAs by @paulinebm in #3697
  • 🔒 Pin GitHub Actions to commit SHAs (tests.yml) by @paulinebm in #3700
  • [v5.4] Introduce cross-modality and multi-modality support; modularize CrossEncoder class by @tomaarsen in #3554
  • [fix] Fix inverted Euclidean/Manhattan distance in TripletLoss by @tomaarsen in #3704
  • [docs] Jina-reranker-m0 doesn't require a revision anymore by @tomaarsen in #3705
  • fix: Separate tokenize and forward kwargs in SparseEncoder.encode to prevent Router misrouting by @ratatouille-plat in #3695
  • Extend the migration guide for completeness by @tomaarsen in #3707

New Contributors

Full Changelog: v5.3.0...v5.4.0

Don't miss a new sentence-transformers release

NewReleases is sending notifications on new releases.