Release 0.4.4 introduces dramatic improvements in inference speed for taggers (thanks to many contributions by @pommedeterresautee), Flair embeddings in 300 languages (thanks @stefan-it), modular tokenization and many new features and refactorings.

Speed optimizations

Many refactorings by @pommedeterresautee to improve inference speed of sequence tagger (#1038 #1053 #1068 #1093 #1130), Flair embeddings (#1074 #1095 #1107 #1132 #1145), word embeddings (#1084),
embeddings memory management (#1082 #1117), general optimizations (#1112) and classification (#1187).

The combined improvements increase inference speed by a factor of 2-3!

New features

Modular tokenization (#1022)

You can now pass custom tokenizers to Sentence objects and Dataset loaders to use different tokenizers than the included segtok library by implementing a tokenizer method. Currently, in-built support exists for whitespace tokenization, segtok tokenization and Japanese tokenization with mecab (requires mecab to be installed). In the future, we expect support for additional external tokenizers to be added.

For instance, if you wish to use Japanese tokanization performed by mecab, you can instantiate the Sentence object like this:

from flair.data import build_japanese_tokenizer
from flair.data import Sentence

# instantiate Japanese tokenizer
japanese_tokenizer = build_japanese_tokenizer()

# init sentence and pass this tokenizer
sentence = Sentence("私はベルリンが好きです。", use_tokenizer=japanese_tokenizer)
print(sentence)

Flair Embeddings for 300 languages (#1146)

Thanks to @stefan-it, there is now a massivey multilingual Flair embeddings model that covers 300 languages. See #1099 for more info on these embeddings and this repo for more details.

This replaces the old multilingual Flair embeddings that were trained for 6 languages. Load them with:

embeddings_fw = FlairEmbeddings('multi-forward')
embeddings_bw = FlairEmbeddings('multi-backward')

Multilingual Character Dictionaries (#1157)

Adds two multilingual character dictionaries computed by @stefan-it.

Load with

dictionary = Dictionary.load('chars-large')
print(len(dictionary.idx2item))

dictionary = Dictionary.load('chars-xl')
print(len(dictionary.idx2item))

Batch-growth annealing (#1138)

The paper Don't Decay the Learning Rate, Increase the Batch Size makes the case for increasing the batch size over time instead of annealing the learning rate.

This version adds the possibility to have arbitrarily large mini-batch sizes with an accumulating gradient strategy. It introduces the parameter mini_batch_chunk_size that you can set to break down large mini-batches into smaller chunks for processing purposes.

So let's say you want to have a mini-batch size of 128, but your memory cannot handle more than 32 samples at a time. Then you can train like this:

trainer = ModelTrainer(tagger, corpus)
trainer.train(
    "path/to/experiment/folder",
    # set large mini-batch size
    mini_batch_size=128,
    # set chunk size to lower memory requirements
    mini_batch_chunk_size=32,
)

Because we now can arbitrarly raise mini-batch size, we can now execute the annealing strategy in the above paper. Do it like this:

trainer = ModelTrainer(tagger, corpus)
trainer.train(
    "path/to/experiment/folder",
    # set initial mini-batch size
    mini_batch_size=32,
    # choose batch growth annealing 
    batch_growth_annealing=True,
)

Document-level sequence labeling (#1194)

Introduces the option for reading entire documents into one Sentence object for sequence labeling. This option is now supported for CONLL_03, CONLL_03_GERMAN and CONLL_03_DUTCH datasets which indicate document boundaries.

Here's how to train a model on CoNLL-03 on the document level:

# read CoNLL-03 with document_as_sequence=True
corpus = CONLL_03(in_memory=True, document_as_sequence=True)

# what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# init simple tagger with GloVe embeddings
tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=WordEmbeddings('glove'),
    tag_dictionary=tag_dictionary,
    tag_type=tag_type,
)

# initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# start training
trainer.train(
    'path/to/your/experiment',
    # set a much smaller mini-batch size because documents are huge
    mini_batch_size=2,
)

Option to evaluate on training split (#1202)

Previously, the ModelTrainer only allowed monitoring of dev and test splits during training. Now, you can also monitor the train split to better check if your method is overfitting.

Support for Danish tagging (#1183)

Adds support for Danish POS and NER thanks to @AmaliePauli!

Use like this:

from flair.data import Sentence
from flair.models import SequenceTagger

# example sentence
sentence = Sentence("København er en fantastisk by .")

# load Danish NER model and predict
ner_tagger = SequenceTagger.load('da-ner')
ner_tagger.predict(sentence)

# print annotations (NER)
print(sentence.to_tagged_string())

# load Danish POS model and predict
pos_tagger = SequenceTagger.load('da-pos')
pos_tagger.predict(sentence)

# print annotations (NER + POS)
print(sentence.to_tagged_string())

Support for DistilBERT embeddings (#1044)

You can use them like this:

from flair.data import Sentence
from flair.embeddings import BertEmbeddings

embeddings = BertEmbeddings("distilbert-base-uncased")

s = Sentence("Berlin and Munich are nice cities .")
embeddings.embed(s)

for token in s.tokens:
  print(token.embedding)
  print(token.embedding.shape)

MongoDataset for reading text classification data from a Mongo database (#1192)

Adds the option of reading data from MongoDB. See this documentation on how to use this features.

Feidegger corpus (#1199)

Adds a dataset downloader for the Feidegger corpus consisting of text-image pairs. Instantiate the corpus like this:

from flair.datasets import FeideggerCorpus

# instantiate Feidegger corpus
corpus = FeideggerCorpus()

# print a text-image pair
print(corpus.train[0])

Refactorings

Refactor checkpointing mechanism (#1101)

Refactored the checkpointing mechanism and slimmed down interfaces / code required to load checkpoints.

In detail:

The methods save_checkpoint and load_checkpoint are no longer part of the flair.nn.Model interface. Instead, saving and restoring checkpoints is now (fully) performed by the ModelTrainer.
The optimizer state and scheduler state are removed from the ModelTrainer constructor since they are no longer required here.
Loading a checkpoint is now one line of code (previously two lines).

# 1. initialize trainer as always with a model and a corpus
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(model, corpus)

# 2. train your model for 2 epochs
trainer.train(
    'experiment/folder',
    max_epochs=2,
    # example checkpointing
    checkpoint=True,
)

# 3. load last checkpoint with one line of code
trainer = ModelTrainer.load_checkpoint('experiment/folder/checkpoint.pt', corpus)

# 4. continue training for 2 extra epochs
trainer.train('experiment/folder_2',  max_epochs=4)

Refactor data sampling during training (#1154)

Adds a FlairSampler interface to better enable passing custom samplers to the ModelTrainer.

For instance, if you want to always shuffle your dataset in chunks of 5 to 10 sentences, you provide a sampler like this:

# your trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# execute training run
trainer.train('path/to/experiment/folder',
              max_epochs=150,
              # sample data in chunks of 5 to 10
              sampler=ChunkSampler(block_size=5, plus_window=5)
              )

Other refactorings

Switch everything to batch first mode (#1077)
Refactor classification to be more consistent with SequenceTagger (#1151)
PyTorch-Transformers -> Transformers #1163
In-place transpose of tensors (#1047)

Enhancements

Documentation fixes (#1045 #1098 #1121 #1157 #1160 #1168 )

Add option to set `rnn_type` used in `SequenceTagger` (#1113)

Accept string as input in NER predict (#1142)

Example usage:

# init tagger
tagger= SequenceTagger.load('ner')

# predict over list of strings
sentences = tagger.predict(
    [
        'George Washington went to Berlin .', 
        'George Berlin lived in Washington .'
    ]
)

# output predictions
for sentence in sentences:
    print(sentence.to_tagged_string())

Enable One-hot Embeddings of other Tags (#1191)

Bug fixes

Fix the learning rate finder (#1119)
Fix OneHotEmbeddings on Cuda (#1147)
Fix encoding error in CSVClassificationDataset (#1055)
Fix encoding errors related to old windows chars (#1135)
Fix length error in CharacterEmbeddings (#1088 )
Fix tokenizer insert empty token to sentence object (#1226)
Ensure StackedEmbeddings always has the same embedding order (#1114)
Use $HOME instead of ~ for cache_root (#1134)

flairNLP/flair v0.4.4 Release 0.4.4 on GitHub