This is an enhancement release that slims down Flair for quicker/easier installation and smaller library size. It also makes Flair compatible with torch 1.4.0 and adds enhancements that reduce model size and improve runtime speed for some embeddings. New features include the ability to steer the precision/recall tradeoff during training of models and support for CamemBERT embeddings.

Memory, Runtime and Dependency Improvements

Slim down dependency tree (#1296 #1299 #1335 #1336)

We want to keep list of dependencies of Flair generally small to avoid errors like #1245 and keep the library small and quick to setup. So we removed dependencies that were each only used for one particular feature, namely:

ipython and ipython-genutils, only used for visualization settings in iPython notebooks
tiny_tokenizer, used for Japanese tokenization (replaced with instructions for how to install for all users who want to use Japanese tokenizers)
pymongo, used for MongoDB datasets (replaced with instructions for how to install for all users who want to use MongoDB datasets)
torchvision, now only loaded when needed

We also relaxed version requirements for easier installation on Google CoLab (#1335 #1336)

Dramatic speed-up of BERT embeddings (#1308)

@shoarora optimized the BERTEmbeddings implementation by removing redundant calls. This was shown to lead to dramatic speed improvements.

Reduce size of models that use WordEmbeddings (#1315)

@timnon added a method to replace word embeddings in trained model with sqlite database to dramatically reduce memory usage. Creates class WordEmbeedingsStore which can be used to replace a WordEmbeddings-instance in a flair model via duck-typing. By using this, @timnon was able to reduce our ner-servers memory consumption from 6gig to 600mb (10x decrease) by adding a few lines of code. It can be tested using the following lines (also in the docstring). First create a headless version of a model without word embeddings:

from flair.inference_utils import WordEmbeddingsStore
from flair.models import SequenceTagger
import pickle
tagger = SequenceTagger.load("multi-ner-fast")
WordEmbeddingsStore.create_stores(tagger)
pickle.dump(tagger, open("multi-ner-fast-headless.pickle", "wb"))

and then to run the stored headless model without word embeddings, use:

from flair.data import Sentence
tagger = pickle.load(open("multi-ner-fast-headless.pickle", "rb"))
WordEmbeddingsStore.load_stores(tagger)
text = "Schade um den Ameisenbären. Lukas Bärfuss veröffentlicht Erzählungen aus zwanzig Jahren."
sentence = Sentence(text)
tagger.predict(sentence)

New Features

Prioritize precision/recall or specific classes during training (#1345)

@klasocki added ways to steer the precision/recall tradeoff during training of models, as well as prioritize certain classes. This option was added to the SequenceTagger and the TextClassifier.

You can steer precision/recall tradeoff by adding the beta parameter, which indicates how many more times recall is important than precision. So if you set beta=0.5, precision becomes twice as important than recall. If you set beta=2, recall becomes twice as important as precision. Do it like this:

tagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type=tag_type,
    beta=0.5)

If you want to prioritize classes, you can pass a weight_loss dictionary to the model classes. For instance, to prioritize learning the NEGATIVE class in a sentiment tagger, do:

tagger = TextClassifier(
    document_embeddings=embeddings,
    label_dictionary=tag_dictionary,
    loss_weights={'NEGATIVE': 10.})

which will increase the importance of class NEGATIVE by a factor of 10.

CamemBERT Embeddings (#1297)

@stefan-it added support for the recently proposed French language model: CamemBERT.

Thanks to the awesome 🤗/Transformers library, CamemBERT can be used in Flair like in this example:

from flair.data import Sentence
from flair.embeddings import CamembertEmbeddings

embedding = CamembertEmbeddings()

sentence = Sentence("J'aime le camembert !")
embedding.embed(sentence)

for token in sentence.tokens:
  print(token.embedding)

Bug fixes and enhancements

Fix new RNN format for torch 1.4.0 (#1360, #1382 )
Fix memory issue in PooledFlairEmbeddings (#1337 #1339)
Correct subtoken mapping function for GPT-2 and RoBERTa (#1242)
Update the transformers library to the latest 2.3 version (#1333)
Add staticmethod decorator to some functions (#1257)
Add a warning if validation data is too small (#1115)
Remove leftover printline from MUSE embeddings (#1224)
Correct generate_text() UTF-8 conversion (#1238)
Clarify documentation (#1295 #1332)
Replace sklearn by scikit-learn (#1321)
Fix off-by-one error in progress logging (#1334)
Fix typo and annotation (#1341)
Various improvements (#1347)
Make load_big_file work with read-only file (#1353)
Rename tiny_tokenizer to konoha (#1363)
Make test loss plotting optional (#1372)
Add pretty print function for Dictionary (#1375)

flairNLP/flair v0.4.5 Release 0.4.5 on GitHub