UKPLab/sentence-transformers v1.0.0 on GitHub

This release brings many new improvements and new features. Also, the version number scheme is updated. Now we use the format x.y.z with x: for major releases, y: smaller releases with new features, z: bugfixes

Text-Image-Model CLIP

You can now encode text and images in the same vector space using the OpenAI CLIP Model. You can use the model like this:

from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))

#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

#Compute cosine similarities 
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)

More Information
IPython Demo
Colab Demo

Examples how to train the CLIP model on your data will be added soon.

New Models

Add v3 models trained for semantic search on MS MARCO: MS MARCO Models v3
First models trained on Natural Questions dataset for Q&A Retrieval: Natural Questions Models v1
Add DPR Models from Facebook for Q&A Retrieval: DPR-Models

New Features

The Asym Model can now be used as the first model in a SentenceTransformer modules list.
Sorting when encoding changes: Previously, we encoded from short to long sentences. Now we encode from long to short sentences. Out-of-memory errors will then happen at the start. Also the approximation on the duration of the encode process is more precise
Improvement of the util.semantic_search method: It now uses the much faster torch.topk function. Further, you can define which scoring function should be used
New util methods: util.dot_score computes the dot product of two embedding matrices. util.normalize_embeddings will normalize embeddings to unit length
New parameter for SentenceTransformer.encode method: normalize_embeddings if set to true, it will normalize embeddings to unit length. In that case the faster util.dot_score can be used instead of util.cos_sim to compute cosine similarity scores.
If you specify in models.Transformer(do_lower_case=True) when creating a new SentenceTransformer, then all input will be lower cased.

New Examples

Add example for model quantization on CPUs (smaller models, faster run-time): model_quantization.py
Start to add example how to train SBERT models without training data: unsupervised learning. We start with an example for Query Generation to train a semantic search model.

Bugfixes

Encode method now correctly returns token_embeddings if output_value='token_embeddings' is defined
Bugfix of the LabelAccuracyEvaluator
Bugfix of removing tensors off the CPU if you specified encode(sent, convert_to_tensor=True). They now stay on the GPU

Breaking changes:

SentenceTransformer.encode-Methode: Removed depcreated parameters is_pretokenized and num_workers

UKPLab/sentence-transformers v1.0.0 v1.0.0 - Improvements, New Models, Text-Image Models on GitHub