This patch release fixes a small quirk with multimodal inference when using single-key multimodal inputs like model.encode({"image": ...}).
Install this version with
# Training + Inference
pip install sentence-transformers[train]==5.5.1
# Inference only, use one of:
pip install sentence-transformers==5.5.1
pip install sentence-transformers[onnx-gpu]==5.5.1
pip install sentence-transformers[onnx]==5.5.1
pip install sentence-transformers[openvino]==5.5.1
# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.5.1
pip install sentence-transformers[audio]==5.5.1
pip install sentence-transformers[video]==5.5.1
# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.5.1Bug fixed
Previously, inference like model.encode({"image": ...}) or model.encode([{"image": ...}, ...]) would be inferred as the ("image",) modality, which differed from the inferred modality of "image" for just model.encode(my_image) or model.encode([my_image, my_image_2, ...]).
This results in confusing errors if the model doesn't have a modality_config mapping for ("image",) in addition to "image", so now a single-key multimodal dict is collapsed to the bare modality (just "image" in this example).
This affected this code:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/BGE-VL-base', trust_remote_code=True)
embedding = model.encode({"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/ettin-reranker/mteb_ndcg10_all-MiniLM-L6-v2.png"})
print(embedding.shape)Which previously failed as the model only implements a path for "text", "image", and ("image", "text").
All Changes
- [fix] Collapse single-key multimodal dicts to bare modality by @tomaarsen in #3779
Full Changelog: v5.5.0...v5.5.1