Release 0.5 with tons of new models, embeddings and datasets, support for fine-tuning transformers, greatly improved sentiment analysis models for English, tons of new features and big internal refactorings to better organize the code!
New Fine-tuneable Transformers (#1494 #1544)
Flair 0.5 adds support for transformers and fine-tuning with two new embeddings classes: TransformerWordEmbeddings
and TransformerDocumentEmbeddings
, for word- and document-level transformer embeddings respectively. Both classes can be initialized with a model name that indicates what type of transformer (BERT, XLNet, RoBERTa, etc.) you wish to use (check the full list Here)
Transformer Word Embeddings
If you want to embed the words in a sentence with transformers, do it like this:
from flair.embeddings import TransformerWordEmbeddings
# init embedding
embedding = TransformerWordEmbeddings('bert-base-uncased')
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
If instead you want to use RoBERTa, do:
from flair.embeddings import TransformerWordEmbeddings
# init embedding
embedding = TransformerWordEmbeddings('roberta-base')
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
Transformer Document Embeddings
To get a single embedding for the whole document with BERT, do:
from flair.embeddings import TransformerDocumentEmbeddings
# init embedding
embedding = TransformerDocumentEmbeddings('bert-base-uncased')
# create a sentence
sentence = Sentence('The grass is green .')
# embed the sentence
embedding.embed(sentence)
If instead you want to use RoBERTa, do:
from flair.embeddings import TransformerDocumentEmbeddings
# init embedding
embedding = TransformerDocumentEmbeddings('roberta-base')
# create a sentence
sentence = Sentence('The grass is green .')
# embed the sentence
embedding.embed(sentence)
Text classification by fine-tuning a transformer
Importantly, you can now fine-tune transformers to get state-of-the-art accuracies in text classification tasks.
Use TransformerDocumentEmbeddings
for this and set fine_tune=True
. Then, use the following example code:
from torch.optim.adam import Adam
from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
# 1. get the corpus
corpus: Corpus = TREC_6()
# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()
# 3. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)
# 4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)
# 5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=Adam)
# 6. start the training
trainer.train('resources/taggers/trec',
learning_rate=3e-5, # use very small learning rate
mini_batch_size=16,
mini_batch_chunk_size=4, # optionally set this if transformer is too much for your machine
max_epochs=5, # terminate after 5 epochs
)
New Taggers, Embeddings and Datasets
Flair 0.5 adds a ton of new taggers, embeddings and datasets.
New Taggers
New sentiment models (#1613)
We added new sentiment models for English. The new models are trained over a combined corpus of sentiment dataset, including Amazon product reviews. So they should be applicable to more domains than the old sentiment models that were only trained with movie reviews.
There are two new models, a transformer-based model you can load like this:
# load tagger
classifier = TextClassifier.load('sentiment')
# predict for example sentence
sentence = Sentence("enormously entertaining for moviegoers of any age .")
classifier.predict(sentence)
# check prediction
print(sentence)
And a faster, slightly less accurate model based on RNNs you can load like this:
classifier = TextClassifier.load('sentiment-fast')
Fine-grained POS models for English (#1625)
Adds fine-grained POS models for English so you now have the option between 'pos' and 'upos' models for fine-grained and universal dependencies respectively. Load like this:
# Fine-grained POS model
tagger = SequenceTagger.load('pos')
# Fine-grained POS model (fast variant)
tagger = SequenceTagger.load('pos-fast')
# Universal POS model
tagger = SequenceTagger.load('upos')
# Universal POS model (fast variant)
tagger = SequenceTagger.load('upos-fast')
Added Malayalam POS and XPOS tagger model (#1522)
Added taggers for historical German speech and thought (#1532)
New Embeddings
Added language models for historical German by @redewiedergabe (#1507)
Load the language models with:
embeddings_forward = FlairEmbeddings('de-historic-rw-forward')
embeddings_backward = FlairEmbeddings('de-historic-rw-backward')
Added Malayalam flair embeddings models (#1458)
embeddings_forward = FlairEmbeddings('ml-forward')
embeddings_backward = FlairEmbeddings('ml-backward')
Added Flair Embeddings from CLEF HIPE Shared Task (#1554)
Adds the recently trained Flair embeddings on historic newspapers for German/English/French provided by the CLEF HIPE shared task.
New Datasets
Added NER dataset for Finnish (#1620)
You can now load a Finnish NER corpus with
ner_finnish = flair.datasets.NER_FINNISH()
Added DaNE dataset (#1425)
You can now load a Danish NER corpus with
dane = flair.datasets.DANE()
Added SentEval classification datasets (#1454)
Adds 6 SentEval classification datasets to Flair:
senteval_corpus_1 = flair.datasets.SENTEVAL_CR()
senteval_corpus_2 = flair.datasets.SENTEVAL_MR()
senteval_corpus_3 = flair.datasets.SENTEVAL_SUBJ()
senteval_corpus_4 = flair.datasets.SENTEVAL_MPQA()
senteval_corpus_5 = flair.datasets.SENTEVAL_SST_BINARY()
senteval_corpus_6 = flair.datasets.SENTEVAL_SST_GRANULAR()
Added Sentiment Datasets (#1545)
Adds two new sentiment datasets to Flair, namely AMAZON_REVIEWS, a very large corpus of Amazon reviews with sentiment labels, and SENTIMENT_140, a corpus of tweets labeled with sentiment.
amazon_reviews = flair.datasets.AMAZON_REVIEWS()
sentiment_140 = flair.datasets.SENTIMENT_140()
Added BIOfid dataset (#1589)
biofid = flair.datasets.BIOFID()
Refactorings
Any DataPoint can now be labeled (#1450)
Refactored the DataPoint
class and classes that inherit from it (Token
, Sentence
, Image
, Span
, etc.) so that all have the same methods for adding and accessing labels.
DataPoint
base class now defined labeling methods (closes #1449)- Labels can no longer be passed to
Sentence
constructor, so instead of:
sentence_1 = Sentence("this is great", labels=[Label("POSITIVE")])
you should now do:
sentence_1 = Sentence("this is great")
sentence_1.add_label('sentiment', 'POSITIVE')
or:
sentence_1 = Sentence("this is great").add_label('sentiment', 'POSITIVE')
Note that Sentence labels now have a label_type
(in the example that's 'sentiment').
- The
Corpus
method_get_class_to_count
is renamed to_count_sentence_labels
- The
Corpus
method_get_tag_to_count
is renamed to_count_token_labels
Span
is now aDataPoint
(so it has anembedding
andlabels
)
Embeddings module was split into smaller submodules (#1588)
Split the previously huge embeddings.py
into several submodules organized in an embeddings/
folder. The submodules are:
token.py
for allTokenEmbeddings
classesdocument.py
for allDocumentEmbeddings
classesimage.py
for allImageEmbeddings
classeslegacy.py
for embeddings that are now deprecatedbase.py
for remaining basic classes
All embeddings are still exposed through the embeddings package, so the command to load them doesn't change, e.g.:
from flair.embeddings import FlairEmbeddings
embeddings = FlairEmbeddings('news-forward')
so specifying the submodule is not needed.
Datasets module was split into smaller submodules (#1510)
Split the previously huge datasets.py
into several submodules organized in a datasets/
folder. The submodules are:
sequence_labeling.py
for all sequence labeling datasetsdocument_classification.py
for all document classification datasetstreebanks.py
for all dependency parsed corpora (UD treebanks)text_text.py
for all bi-text datasets (currently only parallel corpora)text_image.py
for all paired text-image datasets (currently only Feidegger)base.py
for remaining basic classes
All datasets are still exposed through the datasets package, so it is still possible to load corpora with
from flair.datasets import TREC_6
without specifying the submodule.
Other refactorings
- Refactor datasets for code legibility (#1394)
Small refactorings on flair.datasets
for easier code legibility and fewer redundancies, removing about 100 lines of code: (1) Moved the default sampling logic from all corpora classes to the parent Corpus
class. You can now instantiate a Corpus
only with a train file which will trigger the sampling. (2) Moved the default logic for identifying train, dev and test files into a dedicated method to avoid duplicates in code.
- Extend string output of Sentence (#1452)
Other
New Features
Add option to specify document delimiter for language model training (#1541)
You now have the option of specifying a document_delimiter when training a LanguageModel. Say, you have a corpus of textual lists and use "[SEP]" to mark boundaries between two lists, like this:
Colors:
- blue
- green
- red
[SEP]
Cities:
- Berlin
- Munich
[SEP]
...
Then you can now train a language model by setting the document_delimiter
in the TextCorpus
and LanguageModel
objects. This will make sure only documents as a whole will get shuffled during training (i.e. the lists in the above example):
# your document delimiter
delimiter = '[SEP]'
# set it when you load the corpus
corpus = TextCorpus(
"data/corpora/conala-corpus/",
dictionary,
is_forward_lm,
character_level=True,
document_delimiter=delimiter,
)
# set it when you init the language model
language_model = LanguageModel(
dictionary,
is_forward_lm=True,
hidden_size=512,
nlayers=1,
document_delimiter=delimiter
)
# train your language model as always
trainer = LanguageModelTrainer(language_model, corpus)
Allow column delimiter to be set in ColumnCorpus (#1526)
Added the possibility to set a different column delimite for ColumnCorpus
, i.e.
corpus = ColumnCorpus(
Path("/path/to/corpus/"),
column_format={0: 'text', 1: 'ner'},
column_delimiter='\t', # set a different delimiter
)
if you want to read a tab-separated column corpus.
Improvements in classification corpus datasets (#1545)
There are a number of improvements for the ClassificationCorpus
and ClassificationDataset
classes:
- It is now possible to select from three memory modes ('full', 'partial' and 'disk'). Use full if the entire dataset and all objects fit into memory. Use 'partial' if it doesn't and use 'disk' if even 'partial' does not fit.
- It is also now possible to provide "name maps" to rename labels in datasets. For instance, some sentiment analysis datasets use '0' and '1' as labels, while some others use 'POSITIVE' and 'NEGATIVE'. By providing name maps you can rename labels so they are consistent across datasets.
- You can now choose which splits to downsample (for instance you might want to downsample 'train' and 'dev' but not 'test')
- You can now specify the option "filter_if_longer_than", to filter all sentences that have more than the number of provided whitespaces. This is useful to limit corpus size as some sentiment analysis datasets are gigantic.
Added different ways to combine ELMo layers (#1547)
Improved default annealing scheme to anneal against score and loss (#1570)
Add new scheduler that uses dev score as main metric to anneal against, but additionally uses dev loss in case two epochs have the same dev score.
Added option for hidden state position in FlairEmbeddings (#1571)
Adds the option to choose which hidden state to use in FlairEmbeddings: either the state at the end of each word, or the state at the whitespace after. Default is the state at the whitespace after.
You can change the default like this:
embeddings = FlairEmbeddings('news-forward', with_whitespace=False)
This configuration seems to be better for syntactic tasks. For POS tagging, it seems that you should set with_whitespace=False
. For instance, on UD_ENGLISH POS-tagging, we get 96.56 +- 0.03 with whitespace and 96.72 +- 0.04 without, averaged over three runs.
See the discussion in #1362 for more details.
Other features
-
Added the option of passing different tokenizers when loading classification datasets (#1579)
-
Added option for true whitespaces in ColumnCorpus #1583
-
Configurable cache_root from environment variable (#507)
Performance improvements
-
Improve performance for loading not-in-memory corpus (#1413)
-
A new lmdb based alternative backend for word embeddings (#1515 #1536)
-
Slim down requirements (#1419)