Unsupervised Sentence Embedding Learning
New methods integrated to train sentence embedding models without labeled data. See Unsupervised Learning for an overview of all existent methods.
New methods:
- CT: Integration of Semantic Re-Tuning With Contrastive Tension (CT) to tune models without labeled data
- CT_In-Batch_Negatives: A modification of CT using in-batch negatives
- SimCSE: An unsupervised sentence embedding learning method by Gao et al.
Pre-Training Methods
- MLM: An example script to run Masked-Language-Modeling (MLM). Running MLM on your custom data before supervised training can significantly improve the performances. Further, MLM also works well for domain trainsfer: You first train on your custom data, and then train with e.g. NLI or STS data.
Training Examples
- Paraphrase Data: In our paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation we have shown that training on paraphrase data is powerful. In that folder we provide collections of different paraphrase datasets and scripts to train on it.
- NLI with MultipleNegativeRankingLoss: A dedicated example how to use MultipleNegativeRankingLoss for training with NLI data, which leads to a significant performance boost.
New models
-
New NLI & STS models: Following the Paraphrase Data training example we published new models trained on NLI and NLI+STS data. Training code is available: training_nli_v2.py.
Model-Name STSb-test performance Previous best models nli-bert-large 79.19 stsb-roberta-large 86.39 New v2 models nli-mpnet-base-v2 86.53 stsb-mpnet-base-v2 88.57 -
New MS MARCO model for Semantic Search: Hofstätter et al. optimized the training procedure on the MS MARCO dataset. The resulting model is integrated as msmarco-distilbert-base-tas-b and improves the performance on the MS MARCO dataset from 33.13 to 34.43 MRR@10
New Functions
SentenceTransformer.fit()
Checkpoints: The fit() method now allows to save checkpoints during the training at a fixed number of steps. More info- Pooling-mode as string: You can now pass the pooling-mode to
models.Pooling()
as string:Valid values are mean/max/cls.pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')
- NoDuplicatesDataLoader: When using the MultipleNegativesRankingLoss, one should avoid to have duplicate sentences in the same sentence. This data loader simplifies this task and ensures that no duplicate entries are in the same batch.~~~~