Refactored Tokenization
- Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
- Usage of the
SentencesDataset
no longer needed for training. You can pass your train examples directly to the DataLoader:
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
- If you use a custom torch DataSet class: The dataset class must now return
InputExample
objects instead of tokenized texts - Class
SentenceLabelDataset
has been updated to new tokenization flow: It returns always two or moreInputExamples
with the same label
Asymmetric Models
Add new models.Asym
class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:
word_embedding_model = models.Transformer(base_model, max_seq_length=250)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])
##Your input examples have to look like this:
inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)
##Encoding (Note: Mixed inputs are not allowed)
model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])
Inputs that have the key 'QRY' will be passed through the d1
dense layer, while inputs with they key 'DOC' through the d2
dense layer.
More documentation on how to design asymmetric models will follow soon.
New Namespace & Models for Cross-Encoder
Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.
Logging
Log messages now use a custom logger from logging
thanks to PR #623. This allows you which log messages you want to see from which components.
Unit tests
A lot more unit tests have been added, which test the different components of the framework.