github huggingface/transformers v1.0.0
v1.0.0 - Name change, new models (XLNet, XLM), unified API for models and tokenizer, access to models internals, torchscript

latest releases: v4.40.1, v4.40.0, v4.39.3...
4 years ago

Name change: welcome PyTorch-Transformers 👾

pytorch-pretrained-bert => pytorch-transformers

Install with pip install pytorch-transformers

New models

New pretrained weights

We went from ten (in pytorch-pretrained-bert 0.6.2) to twenty-seven (in pytorch-transformers 1.0) pretrained model weights.

The newly added model weights are, in summary:

  • Two Whole-Word-Masking weights for Bert (cased and uncased)
  • Three Fine-tuned models for Bert (on SQuAD and MRPC)
  • One German model for Bert provided and trained by Deepset.ai (@tholor and @Timoeller) as detailed in their nice blogpost
  • One OpenAI GPT-2 model (medium size model)
  • Two models (base and large) for the newly added XLNet model
  • Eight models for the newly added XLM model

The documentation lists all the models with the shortcut names and we are currently adding full details of the associated pretraining/fine-tuning parameters.

New documentation

New documentation is currently being created at https://huggingface.co/pytorch-transformers/ and should be finalized over the coming days.

Standard API across models

See the readme for a quick tour of the API.

Main points:

  • All models now return tuples with various elements depending on the model and the configuration. The docstrings and documentation list all the expected outputs in order.
  • All models can now return the full list of hidden-states (embeddings output + the output hidden-states of each layer)
  • All models can now return the full list of attention weights (one tensor of attention weights for each layer)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                    output_hidden_states=True,
                                    output_attentions=True)
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
all_hidden_states, all_attentions = model(input_ids)[-2:]

Standard API to add tokens to the vocabulary and the model

Using tokenizer.add_tokens() and tokenizer.add_special_tokens(), one can now easily add tokens to each model vocabulary. The model's input embeddings can be resized accordingly to add associated word embeddings (to be trained) using model.resize_token_embeddings(len(tokenizer))

tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
model.resize_token_embeddings(len(tokenizer))

Serialization

The serialization methods have been standardized and you probably should switch to the new method save_pretrained(save_directory) if you were using any other serialization method before.

model.save_pretrained('./my_saved_model_directory/')
tokenizer.save_pretrained('./my_saved_model_directory/')

### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')

Torchscript

All models are now compatible with Torchscript.

model = model_class.from_pretrained(pretrained_weights, torchscript=True)
traced_model = torch.jit.trace(model, (input_ids,))

Examples scripts

The examples scripts have been refactored and gathered in three main examples (run_glue.py, run_squad.py and run_generation.py) which are common to several models and are designed to offer SOTA performances on the respective tasks while being clean starting point to design your own scripts.

Other examples scripts (like run_bertology.py) will be added in the coming weeks.

Breaking-changes

The migration section of the readme lists the breaking changes when switching from pytorch-pretrained-bert to pytorch-transformers.

The main breaking change is that all models now returns a tuple of results.

Don't miss a new transformers release

NewReleases is sending notifications on new releases.