flairNLP/flair v0.8 on GitHub

Release 0.8 adds major new features to Flair, including our best named entity recognition (NER) models yet and the ability to host, share and test Flair models on the HuggingFace model hub! In addition, there is a host of improvements, new features and new datasets to check out!

FLERT (#2031 #2032 #2104)

This release adds the "FLERT" approach to train sequence tagging models using cross-sentence features as presented in our recent paper. This yields new state-of-the-art models which we include in Flair, as well as the features to easily train your own "FLERT" models.

Pre-trained FLERT models (#2130)

We add 5 new NER models for English (4-class and 18-class), German, Dutch and Spanish (4-class each). Load for instance with:

from flair.data import Sentence
from flair.models import SequenceTagger

# load tagger
tagger = SequenceTagger.load("ner-large")

# make example sentence
sentence = Sentence("George Washington went to Washington")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

If you want to test these models in action, for instance the new large English Ontonotes model with 18 classes, you can now use the hosted inference API on the HF model hub, like here.

Contextualized Sentences

In order to enable cross-sentence context, we made some changes to the Sentence object and data readers:

Sentence objects now have next_sentence() and previous_sentence() methods that are set automatically if loaded through ColumnCorpus. This is a pointer system to navigate through sentences in a corpus:

# load corpus
corpus = MIT_MOVIE_NER_SIMPLE(in_memory=False)

# get a sentence
sentence = corpus.test[123]
print(sentence)
# get the previous sentence
print(sentence.previous_sentence())
# get the sentence after that
print(sentence.next_sentence())
# get the sentence after the next sentence
print(sentence.next_sentence().next_sentence())

This allows dynamic computation of contexts in the embedding classes.

Sentence objects now have the is_document_boundary field which is set through the ColumnCorpus. In some datasets, there are sentences like "-DOCSTART-" that just indicate document boundaries. This is now recorded as a boolean in the object.

Refactored TransformerWordEmbeddings (breaking)

TransformerWordEmbeddings refactored for dynamic context, robustness to long sentences and readability. The names of some constructor arguments have changed for clarity: pooling_operation is now subtoken_pooling (to make clear that we pool subtokens), use_scalar_mean is now layer_mean (we only do a simple layer mean) and use_context can now optionally take an integer to indicate the length of the context. Default arguments are also changed.

For instance, to create embeddings with a document-level context of 64 subtokens, init like this:

embeddings = TransformerWordEmbeddings(
    model='bert-base-uncased',
    layers="-1",
    subtoken_pooling="first",
    fine_tune=True,
    use_context=64,
)

Train your Own FLERT Models

You can train a FLERT-model like this:

import torch

from flair.data import Sentence
from flair.datasets import CONLL_03, WNUT_17
from flair.embeddings import TransformerWordEmbeddings, DocumentPoolEmbeddings, WordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer


corpus = CONLL_03()

use_context = 64
hf_model = 'xlm-roberta-large'

embeddings = TransformerWordEmbeddings(
    model=hf_model,
    layers="-1",
    subtoken_pooling="first",
    fine_tune=True,
    use_context=use_context,
)

tag_dictionary = corpus.make_tag_dictionary('ner')

# init bare-bones tagger (no reprojection, LSTM or CRF)
tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type='ner',
    use_crf=False,
    use_rnn=False,
    reproject_embeddings=False,
)

# train with XLM parameters (AdamW, 20 epochs, small LR)
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
from torch.optim.lr_scheduler import OneCycleLR

context_string = '+context' if use_context else ''

trainer.train(f"resources/flert",
              learning_rate=5.0e-6,
              mini_batch_size=4,
              mini_batch_chunk_size=1,
              max_epochs=20,
              scheduler=OneCycleLR,
              embeddings_storage_mode='none',
              weight_decay=0.,
              )

We recommend training FLERT this way if accuracy is by far the most important feature you need. FLERT is quite slow since it works on the document-level.

HuggingFace model hub integration (#2040 #2108 #2115)

We now host Flair sequence tagging models on the HF model hub (thanks for all the support @huggingface!).

Overview of all models. There is a dedicated 'Flair' tag on the hub, so to get a list of all Flair models, check here.

The hub allows all users to upload and share their own models. Even better, you can enable the Inference API and so test all models online without downloading and running them. For instance, you can test our new very powerful English 18-class NER model here.

To load any sequence tagger on the model hub, use the string identifier when instantiating a model. For instance, to load our English ontonotes model with the id "flair/ner-english-ontonotes-large", do

from flair.data import Sentence
from flair.models import SequenceTagger

# load tagger
tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")

# make example sentence
sentence = Sentence("On September 1st George won 1 dollar while watching Game of Thrones.")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

Other New Features

New Task: Recognizing Textual Entailment (#2123)

Thanks to @marcelmmm we now support training textual entailment tasks (in fact, all pairwise sentence classification tasks) in Flair.

For instance, if you want to train an RTE task of the GLUE benchmark use this script:

import torch

from flair.data import Corpus
from flair.datasets import GLUE_RTE
from flair.embeddings import TransformerDocumentEmbeddings

# 1. get the entailment corpus
corpus: Corpus = GLUE_RTE()

# 2. make the tag dictionary from the corpus
label_dictionary = corpus.make_label_dictionary()

# 3. initialize text pair tagger
from flair.models import TextPairClassifier

tagger = TextPairClassifier(
    document_embeddings=TransformerDocumentEmbeddings(),
    label_dictionary=label_dictionary,
)

# 4. train trainer with AdamW
from flair.trainers import ModelTrainer

trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)

# 5. run training
trainer.train('resources/taggers/glue-rte-english',
              learning_rate=2e-5,
              mini_batch_chunk_size=2, # this can be removed if you hae a big GPU
              train_with_dev=True,
              max_epochs=3)

Add possibility to specify empty label name to CSV corpora (#2068)

Some CSV classification datasets contain a value that means "no class". We now extend the CSVClassificationDataset so that it is possible to specify which value should be skipped using the no_class_label argument.

For instance:

# load corpus
corpus = CSVClassificationCorpus(
    data_folder='resources/tasks/code/',
    train_file='java_io.csv',
    skip_header=True,
    column_name_map={3: 'text', 4: 'label', 5: 'label', 6: 'label', 7: 'label', 8: 'label', 9: 'label'},
    no_class_label='NONE',
)

This causes all entries of NONE in one of the label columns to be skipped.

More options for splits in corpora and training (#2034)

For various reasons, we might want to have a Corpus that does not define all three splits (train/dev/test). For instance, we might want to train a model over the entire dataset and not hold out any data for validation/evaluation.

We add several ways of doing so.

If a dataset has predefined splits, like most NLP datasets, you can pass the arguments train_with_test and train_with_dev to the ModelTrainer. This causes the trainer to train over all three splits (and do no evaluation):

trainer.train(f"path/to/your/folder",
    learning_rate=0.1,
    mini_batch_size=16,
    train_with_dev=True,
    train_with_test=True,
)

You can also now create a Corpus with fewer splits without having all three splits automatically sampled. Pass sample_missing_splits=False as argument to do this. For instance, to load SemCor WSD corpus only as training data, do:

semcor = WSD_UFSAC(train_file='semcor.xml', sample_missing_splits=False, autofind_splits=False)

Add TFIDF Embeddings (#2086)

We added some old-school embeddings (thanks @yosipk), namely the legendary TF-IDF document embeddings. These are often good baselines, and additionally they keep NLP veterans nostalgic, if not happy.

To initialize these embeddings, you must pass the train split of your training corpus, i.e.

embeddings = DocumentTFIDFEmbeddings(corpus.train, max_features=10000)

This triggers the process where the most common words are used to featurize documents.

New Datasets

Hungarian NER Corpus (#2045)

Added the Hungarian business news corpus annotated with NER information (thanks to @alibektas).

# load Hungarian business NER corpus
corpus = BUSINESS_HUN()
print(corpus)
print(corpus.make_tag_dictionary('ner'))

StackOverflow NER Corpus (#2052)

# load StackOverflow business NER corpus
corpus = STACKOVERFLOW_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))

Added GermEval 18 Offensive Language dataset (#2102)

# load StackOverflow business NER corpus
corpus = GERMEVAL_2018_OFFENSIVE_LANGUAGE()
print(corpus)
print(corpus.make_label_dictionary()

Added RTE corpora of GLUE and SuperGLUE

# load the recognizing textual entailment corpus of the GLUE benchmark
corpus = GLUE_RTE()
print(corpus)
print(corpus.make_label_dictionary()

Improvements

Allow newlines as Tokens in a Sentence (#2070)

Newlines and tabs can now become Tokens in a Sentence:

# make sentence with newlines and tabs
sentence: Sentence = Sentence(["I", "\t", "ich", "\n", "you", "\t", "du", "\n"], use_tokenizer=True)

# Alternatively: sentence: Sentence = Sentence("I \t ich \n you \t du \n", use_tokenizer=False)

# print sentence and each token
print(sentence)
for token in sentence:
    print(token)

Improve transformer serialization (#2046)

We improved the serialization of the TransformerWordEmbeddings class such that you can now train a model with one version of the transformers library and load it with another version. Previously, if you trained a model with transformers 3.5.1 and loaded it with 3.1.01, or trained with 3.5.1 and loaded with 4.1.1, or other version mismatches, there would either be errors or bad predictions.

Migration guide: If you have a model trained with an older version of Flair that uses TransformerWordEmbeddings you can save it in the new version-independent format by loading the model with the same transformers version you used to train it, and then saving it again. The newly saved model is then version-independent:

# load old model, but use the *same transformer version you used when training this model*
tagger = SequenceTagger.load('path/to/old-model.pt')

# save the model. It is now version-independent and can for instance be loaded with transformers 4.
tagger.save('path/to/new-model.pt')

Fix regression prediction errors (#2067)

Fix of two problems in the regression model:

the predict() method was unable to set labels and threw errors (see #2056)
predicted labels had no label name

Now, you can set a label name either in the predict method or during instantiation of the regression model you want to train. So the full code for training a regression model and using it to predict is:

# load regression dataset
corpus = WASSA_JOY()

# make simple document embeddings
embeddings = DocumentPoolEmbeddings([WordEmbeddings('glove')], fine_tune_mode='linear')

# init model and give name to label
model = TextRegressor(embeddings, label_name='happiness')

# target folder
output_folder = 'resources/taggers/regression_test/'

# run training
trainer = ModelTrainer(model, corpus)
trainer.train(
    output_folder,
    mini_batch_size=16,
    max_epochs=10,
)

# load model
model = TextRegressor.load(output_folder + 'best-model.pt')

# predict for sentence
sentence = Sentence('I am so happy')
model.predict(sentence)

# print sentence and prediction
print(sentence)

In my example run, this prints the following sentence + predicted value:

Sentence: "I am so happy"   [− Tokens: 4  − Sentence-Labels: {'happiness': [0.9239126443862915 (1.0)]}]

Do not shuffle first epoch during training (#2058)

Normally, we shuffle sentences at each epoch during training in the ModelTrainer class. However, in some cases it makes sense to see sentences in their natural order during the first epoch, and shuffle only from the second epoch onward.

Bug Fixes and Improvements

Update to transformers 4 (#2057)
Fix the evaluate() method in the SimilarityLearner class (#2113)
Fix memory memory leak in WordEmbeddings (#2018)
Add support for Transformer-XL Embeddings (#2009)
Restrict numpy version to <1.20 for Python 3.6 (#2014)
Small formatting and variable declaration changes (#2022)
Fix document boundary offsets for Dutch CoNLL-03 (#2061)
Changed the torch version in requirements.txt: Torch>=1.5.0 (#2063)
Fix linear input dimension if the reproject (#2073)
Various improvements for TARS (#2090 #2128)
Added a link to the interpret-flair repo (#2096)
Improve documentatin ( #2110)
Update sentencepiece and gdown version (#2131)
Add to_plain_string method to Span class (#2091)

flairNLP/flair v0.8 Release 0.8 on GitHub