This release brings many new improvements and new features. Also, the version number scheme is updated. Now we use the format x.y.z with x: for major releases, y: smaller releases with new features, z: bugfixes
Text-Image-Model CLIP
You can now encode text and images in the same vector space using the OpenAI CLIP Model. You can use the model like this:
from sentence_transformers import SentenceTransformer, util
from PIL import Image
#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')
#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
#Compute cosine similarities
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)
More Information
IPython Demo
Colab Demo
Examples how to train the CLIP model on your data will be added soon.
New Models
- Add v3 models trained for semantic search on MS MARCO: MS MARCO Models v3
- First models trained on Natural Questions dataset for Q&A Retrieval: Natural Questions Models v1
- Add DPR Models from Facebook for Q&A Retrieval: DPR-Models
New Features
- The Asym Model can now be used as the first model in a SentenceTransformer modules list.
- Sorting when encoding changes: Previously, we encoded from short to long sentences. Now we encode from long to short sentences. Out-of-memory errors will then happen at the start. Also the approximation on the duration of the encode process is more precise
- Improvement of the util.semantic_search method: It now uses the much faster torch.topk function. Further, you can define which scoring function should be used
- New util methods:
util.dot_score
computes the dot product of two embedding matrices.util.normalize_embeddings
will normalize embeddings to unit length - New parameter for
SentenceTransformer.encode
method:normalize_embeddings
if set to true, it will normalize embeddings to unit length. In that case the fasterutil.dot_score
can be used instead ofutil.cos_sim
to compute cosine similarity scores. - If you specify in
models.Transformer(do_lower_case=True)
when creating a new SentenceTransformer, then all input will be lower cased.
New Examples
- Add example for model quantization on CPUs (smaller models, faster run-time): model_quantization.py
- Start to add example how to train SBERT models without training data: unsupervised learning. We start with an example for Query Generation to train a semantic search model.
Bugfixes
- Encode method now correctly returns token_embeddings if
output_value='token_embeddings'
is defined - Bugfix of the
LabelAccuracyEvaluator
- Bugfix of removing tensors off the CPU if you specified
encode(sent, convert_to_tensor=True)
. They now stay on the GPU
Breaking changes:
- SentenceTransformer.encode-Methode: Removed depcreated parameters is_pretokenized and num_workers