Release 0.4.3 includes a host of new features including transformer-based embeddings (roBERTa, XLNet, XLM, etc.), fine-tuneable FlairEmbeddings
, crosslingual MUSE embeddings, new data loading/sampling methods, speed/memory optimizations, bug fixes and enhancements. It also begins a refactoring of interfaces that prepares more general applicability of Flair to other types of downstream tasks.
Embeddings
Transformer embeddings (#941 #972 #993)
Updates the old pytorch-pretrained-BERT
library to the latest version of pytorch-transformers
to support various new Transformer-based architectures for embeddings.
A total of 7 (new/updated) transformer-based embeddings can be used in Flair now:
from flair.embeddings import (
BertEmbeddings,
OpenAIGPTEmbeddings,
OpenAIGPT2Embeddings,
TransformerXLEmbeddings,
XLNetEmbeddings,
XLMEmbeddings,
RoBERTaEmbeddings,
)
bert_embeddings = BertEmbeddings()
gpt1_embeddings = OpenAIGPTEmbeddings()
gpt2_embeddings = OpenAIGPT2Embeddings()
txl_embeddings = TransformerXLEmbeddings()
xlnet_embeddings = XLNetEmbeddings()
xlm_embeddings = XLMEmbeddings()
roberta_embeddings = RoBERTaEmbeddings()
Detailed benchmarks on the downsampled CoNLL-2003 NER dataset for English can be found in #873 .
Crosslingual MUSE Embeddings (#853)
Use the new MuseCrosslingualEmbeddings
class to embed any sentence in one of 30 languages into the same embedding space. Behind the scenes the class first does language detection of the sentence to be embedded, and then embeds it with the appropriate language embeddings. If you train a classifier or sequence labeler with (only) this class, it will automatically work across all 30 languages, though quality may widely vary.
Here's how to embed:
# initialize embeddings
embeddings = MuseCrosslingualEmbeddings()
# two sentences in different languages
sentence_1 = Sentence("This red shoe is new .")
sentence_2 = Sentence("Dieser rote Schuh ist rot .")
# language code is auto-detected
print(sentence_1.get_language_code())
print(sentence_2.get_language_code())
# embed sentences
embeddings.embed([sentence_1, sentence_2])
# print similarities
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
for token_1, token_2 in zip (sentence_1, sentence_2):
print(f"'{token_1.text}' and '{token_2.text}' similarity: {cos(token_1.embedding, token_2.embedding)}")
FastTextEmbeddings (#879 )
Adds FastTextEmbeddings
capable of handling for oov words. Be warned though that these embeddings are huge. BytePairEmbeddings
are much smaller and reportedly of similar quality so it is probably advisable to use those instead.
Fine-tuneable FlairEmbeddings (#922)
You can now fine-tune FlairEmbeddings on downstream tasks. You can fine-tune an existing LM by simply passing the fine_tune
parameter in the FlairEmbeddings
constructor, like this:
embeddings = FlairEmbeddings('news-foward', fine_tune=True)
You can also use this option to task-train a wholly new language model by passing an empty LanguageModel
to the FlairEmbeddings
constructor and the fine_tune
parameter, like this:
# make an empty language model
language_model = LanguageModel(
Dictionary.load('chars'),
is_forward_lm=True,
hidden_size=256,
nlayers=1)
# init FlairEmbeddings to task-train this model
embeddings = FlairEmbeddings(language_model, fine_tune=True)
Optimizations
Automatic mixed precision support (#934)
Mixed precision training can significantly speed up training. It can now be enabled by setting use_amp=True
in the trainer classes. For instance for training language models you can do:
# train your language model
trainer = LanguageModelTrainer(language_model, corpus)
trainer.train('resources/taggers/language_model',
sequence_length=256,
mini_batch_size=256,
max_epochs=10,
use_amp=True)
In our experiments, we saw 3x speedup of training large language models though results vary depending on your model size and experimental setup.
Control memory / speed tradeoff during training (#891 #809).
This release introduces the embeddings_storage_mode
parameter to the ModelTrainer
class and predict()
methods. This parameter can be one of 'none', 'cpu' and 'gpu' and allows you to control the tradeoff between memory usage and speed during training:
- If set to 'none' all embeddings are deleted after usage - this has lowest memory requirements but means that embeddings need to be recomputed at each epoch of training potentially causing a slowdown.
- If set to 'cpu' all embeddings are moved to CPU memory after usage. During training, this means that they only need to be moved back to GPU for the forward pass, and not recomputed so in many cases this is faster, but requires memory.
- If set to 'gpu' all embeddings stay on GPU memory after computation. This eliminates memory shuffling during training, causing a speedup. However this option requires enough GPU memory to be available for all embeddings of the dataset.
To use this option during training, simply set the parameter:
# initialize trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train(
"path/to/your/model",
embeddings_storage_mode='gpu',
)
This release also removes the FlairEmbeddings
-specific disk-caching mechanism. In the future, a more general caching mechanism applicable to all embedding types may potentially be added as a fourth memory management option.
Speed-ups on in-memory datasets (#792)
A new DataLoader
abstract base class used in Flair will speed up data loading for in-memory datasets.
Refactoring of interfaces (#891 #843)
This release also slims down interfaces of flair.nn.Model
and adds a new DataPoint
interface that is currently implemented by the Token
and Sentence
classes. The idea is to widen the applicability of Flair to other data types and other tasks. In the future, the DataPoint
interface will for example also be implemented by an Image
object and new downstream tasks added to Flair.
The release also slims down the evaluate()
method in the flair.nn.Model
interface to take a DataLoader
instead of a group of parameters. And refactors the logging header logic. Both refactorings prepare adding new new downstream tasks to Flair in the near future.
Other features
Training Classifiers with CSV files (#826 #952 #967)
Adds the CSVClassificationCorpus
so you can train classifiers directly from CSVs instead of first having to convert to FastText format. To load a CSV, you need to pass a column_name_map
(like in ColumnCorpus
), which indicates which column(s) in the CSV holds the text and which field(s) the label(s):
corpus = CSVClassificationCorpus(
# path to the data folder containing train / test / dev files
data_folder='path/to/data',
# indicates which columns are text and labels
column_name_map={4: "text", 1: "label_topic", 2: "label_subtopic"},
# if CSV has a header, you can skip it
skip_header=True)
Data sampling (#908)
We added the first (of many) data samplers that can be passed to the ModelTrainer
to influence training. The ImbalancedClassificationDatasetSampler
for instance will upsample rare classes and downsample common classes in a classification dataset. It may potentially help with imbalanced datasets. Call like this:
# initialize trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train(
'path/to/folder',
learning_rate=0.1,
mini_batch_size=32,
sampler=ImbalancedClassificationDatasetSampler,
)
There are two experimental chunk samplers (ChunkSampler
and ExpandingChunkSampler
) split a dataset into chunks and shuffle them. This preserves some ordering of the original data while also randomizing the data.
Visualization
- Adds HTML vizualization of sequence labeling (#933). Call like this:
from flair.visual.ner_html import render_ner_html
tagger = SequenceTagger.load('ner')
sentence = Sentence(
"Thibaut Pinot's challenge ended on Friday due to injury, and then Julian Alaphilippe saw "
"his lead fall away. The BBC's Hugh Schofield in Paris reflects on 34 years of hurt."
)
tagger.predict(sentence)
html = render_ner_html(sentence)
with open("sentence.html", "w") as writer:
writer.write(html)
- Plotter now returns images for use in iPython notebooks (#943)
- Initial TensorBoard support (#924)
- Add pointer to Flair Visualizer (#1014)
Additional parameterization options
CharacterEmbeddings
now let you specify number of hidden states and embedding size (#834)
embedding = CharacterEmbedding(char_embedding_dim=64, hidden_size_char=64)
- Adds configuration option for minimal learning rate stopping criterion (#871)
num_workers
is a parameter ofLanguageModelTrainer
(#962 )
Bug fixes / enhancements
- Updates old pretrained models to remove old bugs / performance issues (#1017)
- Fix error in RNN initialization in
DocumentRNNEmbeddings
(#793) ELMoEmbeddings
now useflair.device
param (#825)- Fix download of TREC_6 dataset (#896)
- Fix download of UD_GERMAN-HDT (#980)
- Fix download of WikiNER_German (#1006)
- Fix error in
ColumnCorpus
in which words that begin with hashtags were skipped as comments (#956) - Fix
max_tokens_per_do
c param inClassificationCorpus
(#991) - Simplify split rule in
ColumnCorpus
(#990) - Fix import error message for
ELMoEmbeddings
(#1019) - References to Persian language unified across embeddings (#773)
- Updates most pre-trained models fixing quality issues / bugs (#800)
- Clarifications in documentation (#803 #860 #868)
- Fixes infinite loop for tokens without startpos (#1030)
Enhancements
- Adds a learnable initial hidden state to
SequenceTagger
(#899) - Now keeps order of sentences in mini-batch when embedding (#866)
SequenceTagger
now optionally returns a distribution of tag probabilities over all classes (#782 #949 #1016)- The model trainer now outputs a 'test.tsv' file that contains prediction of final model when done training (#771 )
- Releases logging handler when finishing training a model (#799)
- Fixes
bad_epochs
in training logs and no longer evaluates on test data at each epoch by default (#818 ) - Convenience method to remove all empty sentences from a corpus (#795)