Release 0.5.1 with new features, datasets and models, including support for sentence transformers, transformer embeddings for arbitrary length sentences, new Dutch NER models, new tasks and more refactorings of evaluation and training routines to better organize the code!

New Features and Enhancements:

TransformerWordEmbeddings can now process long sentences (#1680)

Adds a heuristic as a workaround to the max sequence length of some transformer embeddings, making it possible to now embed sequences of arbitrary length if you set allow_long_sentences=True, like so:

TransformerWordEmbeddings(
        allow_long_sentences=True, # set allow_long_sentences to True to enable this features
),

Setting random seeds (#1671)

It is now possible to set seeds when loading and downsampling corpora, so that the sample is always the same:

# set a random seed 
import random
random.seed(4)

# load and downsample corpus
corpus = SENTEVAL_MR(filter_if_longer_than=50).downsample(0.1)

# print first sentence of dev and test 
print(corpus.dev[0])
print(corpus.test[0])

Make reprojection layer optional (#1676)

Makes the reprojection layer optional in SequenceTagger. You can control this behavior through the reproject_embeddings parameter. If you set it to True, embeddings are reprojected via linear map to identical size. If set to False, no reprojection happens. If you set this parameter to an integer, the linear map maps embedding vectors to vectors of this size.

# tagger with standard reprojection
tagger = SequenceTagger(
    hidden_size=256,
    [...]
    reproject_embeddings=True,
)

# tagger without reprojection
tagger = SequenceTagger(
    hidden_size=256,
    [...]
    reproject_embeddings=False,
)

# reprojection to vectors of length 128
tagger = SequenceTagger(
    hidden_size=256,
    [...]
    reproject_embeddings=128,
)

Set label name when predicting (#1671)

You can now optionally specify the "label name" of the predicted label. This may be useful if you want to for instance run two different NER models on the same sentence:

sentence = Sentence('I love Berlin')

# load two NER taggers
tagger_1 = SequenceTagger.load('ner')
tagger_2 = SequenceTagger.load('ontonotes-ner')

# specify label name of tagger_1 to be 'conll03_ner'
tagger_1.predict(sentence, label_name='conll03_ner')

# specify label name of tagger_2 to be 'onto_ner'
tagger_1.predict(sentence, label_name='onto_ner')

print(sentence)

This may be useful if you have multiple ner taggers and wish to tag the same sentence with them. Then you can distinguish between the tags by the taggers. It is also now no longer possible to give the predict method a string - you now must pass a sentence.

Sentence Transformers (#1696)

Adds the SentenceTransformerDocumentEmbeddings class so you get embeddings from the sentence-transformer library. Use as follows:

from flair.data import Sentence
from flair.embeddings import SentenceTransformerDocumentEmbeddings

# init embedding
embedding = SentenceTransformerDocumentEmbeddings('bert-base-nli-mean-tokens')

# create a sentence
sentence = Sentence('The grass is green .')

# embed the sentence
embedding.embed(sentence)

You can find a full list of their pretained models here.

Other enhancements

Update to transformers 3.0.0 (#1727)
Better Memory mode presets for classification corpora (#1701)
ClassificationDataset now also accepts line with "\t" seperator additionaly to blank spaces (#1654)
Change default fine-tuning in DocumentPoolEmbeddings to "none" (#1675)
Short-circuit the embedding loop (#1684)
Add option to pass kwargs into transformer models when initializing model (#1694)

New Datasets and Models

Two new dutch NER models (#1687)

The new default model is a BERT-based RNN model with the highest accuracy:

from flair.data import Sentence
from flair.models import SequenceTagger

# load the default BERT-based model
tagger = SequenceTagger.load('nl-ner')

# tag sentence
sentence = Sentence('Ik hou van Amsterdam')
tagger.predict(sentence)

You can also load a Flair-based RNN model (might be faster on some setups):

# load the default BERT-based model
tagger = SequenceTagger.load('nl-ner-rnn')

Corpus of communicative functions (#1683) and pre-trained model (#1706)

Adds corpus of communicate functions in scientific literature, described in this LREC paper and available here. Load with:

corpus = COMMUNICATIVE_FUNCTIONS()
print(corpus)

We also ship a pre-trained model on this corpus, which you can load with:

# load communicative function tagger
tagger = TextClassifier.load('communicative-functions')

# load communicative function tagger
sentence = Sentence("However, previous approaches are limited in scalability .")

# predict and print labels
tagger.predict(sentence)
print(sentence.labels)

Keyword Extraction Corpora (#1629) and pre-trained model (#1689)

Added 3 datasets available for keyphrase extraction via sequence labeling: Inspec, SemEval-2017 and Processed SemEval-2010

Load like this:

inspec_corpus = INSPEC()
semeval_2010_corpus = SEMEVAL2010()
semeval_2017 = SEMEVAL2017()

We also ship a pre-trained model on this corpus, which you can load with:

# load keyphrase tagger
tagger = SequenceTagger.load('keyphrase')

# load communicative function tagger
sentence = Sentence("Here, we describe the engineering of a new class of ECHs through the "
                    "functionalization of non-conductive polymers with a conductive choline-based "
                    "bio-ionic liquid (Bio-IL).", use_tokenizer=True)

# predict and print labels
tagger.predict(sentence)
print(sentence)

Swedish NER (#1652)

Add corpus for swedish NER using dataset https://github.com/klintan/swedish-ner-corpus/. Load with:

corpus = NER_SWEDISH()
print(corpus)

German Legal Named Entity Recognition (#1697)

Adds corpus of legal named entities for German. Load with:

corpus = LER_GERMAN()
print(corpus)

Refactoring of evaluation

We made a number of refactorings to the evaluation routines in Flair. In short: whenever possible, we now use the evaluation methods of sklearn (instead of our own implementations which kept getting issues). This applies to text classification and (most) sequence tagging.

A notable exception is "span-F1" which is used to evaluate NER because there is no good way of counting true negatives. After this PR, our implementation should now exactly mirror the original conlleval script of the CoNLL-02 challenge. In addition to using our reimplementation, an output file is now automatically generated that can be directly used with the conlleval script.

In more detail, this PR makes the following changes:

Span is now a list of Token and can now be iterated like a sentence
flair.DataLoader is now used throughout
The evaluate() interface in the Model base class is changed so that it no longer requires a data loader, but ran run either over list of Sentence or a Dataset
SequenceTagger.evaluate() now explicitly distinguishes between F1 and Span-F1. In the latter case, no TN are counted (#1663) and a non-sklearn implementation is used.
In the evaluate() method of the SequenceTagger and TextClassifier, we now explicitly call the .predict() method.

Bug fixes:

Fix figsize issue (#1622)
Allow strings to be passed instead of Path (#1637)
Fix segtok tokenization issue (#1653)
Serialize dropout in SequenceTagger (#1659)
Fix serialization error in DocumentPoolEmbeddings (#1671)
Fix subtokenization issues in transformers (#1674)
Add new datasets to init.py (#1677)
Fix deprecation warnings due to invalid escape sequences. (#1678)
Fix PooledFlairEmbeddings deserialization error (#1604)
Fix transformer tokenizer deserialization (#1686)
Fix issues caused by embedding mode and lambda functions in ELMoEmbeddings (#1692)
Fix serialization error in PooledFlairEmbeddings (#1593)
Fix mean pooling in PooledFlairEmbeddings (#1698)
Fix condition to assign whitespace_after attribute in the build_spacy_tokenizer wraper (#1700)
Fix WIKINER encoding for windows (#1713)
Detect and ignore empty sentences in BERT embeddings (#1716)
Fix error in returning multiple classes (#1717)

flairNLP/flair v0.5.1 Release 0.5.1 on GitHub