pypi sentence-transformers 5.5.1
v5.5.1 - Small Multimodal patch

5 hours ago

This patch release fixes a small quirk with multimodal inference when using single-key multimodal inputs like model.encode({"image": ...}).

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.5.1

# Inference only, use one of:
pip install sentence-transformers==5.5.1
pip install sentence-transformers[onnx-gpu]==5.5.1
pip install sentence-transformers[onnx]==5.5.1
pip install sentence-transformers[openvino]==5.5.1

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.5.1
pip install sentence-transformers[audio]==5.5.1
pip install sentence-transformers[video]==5.5.1

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.5.1

Bug fixed

Previously, inference like model.encode({"image": ...}) or model.encode([{"image": ...}, ...]) would be inferred as the ("image",) modality, which differed from the inferred modality of "image" for just model.encode(my_image) or model.encode([my_image, my_image_2, ...]).

This results in confusing errors if the model doesn't have a modality_config mapping for ("image",) in addition to "image", so now a single-key multimodal dict is collapsed to the bare modality (just "image" in this example).

This affected this code:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/BGE-VL-base', trust_remote_code=True)
embedding = model.encode({"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/ettin-reranker/mteb_ndcg10_all-MiniLM-L6-v2.png"})
print(embedding.shape)

Which previously failed as the model only implements a path for "text", "image", and ("image", "text").

All Changes

  • [fix] Collapse single-key multimodal dicts to bare modality by @tomaarsen in #3779

Full Changelog: v5.5.0...v5.5.1

Don't miss a new sentence-transformers release

NewReleases is sending notifications on new releases.