Unsupervised Sentence Embedding Learning

New methods integrated to train sentence embedding models without labeled data. See Unsupervised Learning for an overview of all existent methods.

New methods:

CT: Integration of Semantic Re-Tuning With Contrastive Tension (CT) to tune models without labeled data
CT_In-Batch_Negatives: A modification of CT using in-batch negatives
SimCSE: An unsupervised sentence embedding learning method by Gao et al.

Pre-Training Methods

MLM: An example script to run Masked-Language-Modeling (MLM). Running MLM on your custom data before supervised training can significantly improve the performances. Further, MLM also works well for domain trainsfer: You first train on your custom data, and then train with e.g. NLI or STS data.

Paraphrase Data: In our paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation we have shown that training on paraphrase data is powerful. In that folder we provide collections of different paraphrase datasets and scripts to train on it.
NLI with MultipleNegativeRankingLoss: A dedicated example how to use MultipleNegativeRankingLoss for training with NLI data, which leads to a significant performance boost.

New NLI & STS models: Following the Paraphrase Data training example we published new models trained on NLI and NLI+STS data. Training code is available: training_nli_v2.py.

Model-Name STSb-test performance

Previous best models
nli-bert-large 79.19
stsb-roberta-large 86.39
New v2 models
nli-mpnet-base-v2 86.53
stsb-mpnet-base-v2 88.57
New MS MARCO model for Semantic Search: Hofstätter et al. optimized the training procedure on the MS MARCO dataset. The resulting model is integrated as msmarco-distilbert-base-tas-b and improves the performance on the MS MARCO dataset from 33.13 to 34.43 MRR@10

SentenceTransformer.fit() Checkpoints: The fit() method now allows to save checkpoints during the training at a fixed number of steps. More info
Pooling-mode as string: You can now pass the pooling-mode to models.Pooling() as string:
```
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')
```
Valid values are mean/max/cls.
NoDuplicatesDataLoader: When using the MultipleNegativesRankingLoss, one should avoid to have duplicate sentences in the same sentence. This data loader simplifies this task and ensures that no duplicate entries are in the same batch.~~~~