- The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome
- The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set
num_workers
to a positive integer in yourDataLoader
, tokenization will happen in a background thread. This substantially increases the start-up time for training. model.encode()
uses also a PyTorch DataSet + DataLoader. If you setnum_workers
to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.- Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.
- Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.
- Smaller bugfixes
Breaking changes:
- Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator