github UKPLab/sentence-transformers v0.3.1
v0.3.1 - Updates on Multilingual Training

latest releases: v2.7.0, v2.6.1, v2.6.0...
3 years ago

This is a minor update that changes some classes for training & evaluating multilingual sentence embedding methods.

The examples for training multi-lingual sentence embeddings models have been significantly extended. See docs/training/multilingual-models.md for details. An automatic script that downloads suitable data and extends sentence embeddings to multiple languages has been added.

The following classes/files have been changed:

  • datasets/ParallelSentencesDataset.py: The dataset with parallel sentences is encoded on-the-fly, reducing the start-up time for extending a sentence embedding model to new languages. An embedding cache can be configure to store previously computed sentence embeddings during training.

New evaluation files:

  • evaluation/MSEEvaluator.py - breaking change. Now, this class expects lists of strings with parallel (translated) sentences. The old class has been renamed to MSEEvaluatorFromDataLoader.py
  • evaluation/EmbeddingSimilarityEvaluatorFromList.py - Semantic Textual Similarity data can be passed as lists of strings & scores
  • evaluation/MSEEvaluatorFromDataFrame.py - MSE Evaluation of teacher and student embeddings based on data in a data frame
  • evaluation/MSEEvaluatorFromDataLoader.py - MSE Evaluation if data is passed as a data loader

Bugfixes:

  • model.encode() failed to sort sentences by length. This function has been fixed to boost encoding speed by reducing overhead of padding tokens.

Don't miss a new sentence-transformers release

NewReleases is sending notifications on new releases.