flairNLP/flair v0.11 on GitHub

Release 0.11 is taking us ever closer to that 1.0 release! This release makes large internal refactorings and code quality / efficiency improvements to prepare Flair 1.0. We also add new features such as text clustering, a regular expression tagger, more dataset manipulation options, and some preview features like a prototype decoder.

New Features

Regular Expression Tagger (#2533)

You can now do sequence labeling in Flair with regular expressions! Simply define a RegexpTagger and add some regular expressions, like in the example below:

# sentence with a number and two quotes
sentence = Sentence('Figure 11 is both "too colorful" and "not informative enough".')

# instantiate regex tagger with a quote matching pattern
tagger = RegexpTagger(mapping=(r'(["\'])(?:(?=(\\?))\2.)*?\1', 'QUOTE'))

# also add a number mapping
tagger.register_labels(mapping=(r'\b\d+\b', 'NUMBER'))

# tag sentence
tagger.predict(sentence)

# check out matches
for entity in sentence.get_labels():
    print(entity)

Clustering with Flair (#2573 #2619)

Flair now supports clustering by ways of sklearn. Embed your sentences with a pre-trained embedding like below, then cluster then with any algorithm. Check the example below where we use sentence transformers and k-means clustering. A 'trained' clustering model can be saved and loaded for prediction, just like and other Flair classifier:

from sklearn.cluster import KMeans

from flair.data import Sentence
from flair.datasets import TREC_6
from flair.embeddings import SentenceTransformerDocumentEmbeddings
from flair.models import ClusteringModel

embeddings = SentenceTransformerDocumentEmbeddings()
# store all embeddings in memory which is required to perform clustering
corpus = TREC_6(memory_mode='full').downsample(0.05)

clustering_model = ClusteringModel(model=KMeans(n_clusters=6), embeddings=embeddings)

# fit the model on a corpus
clustering_model.fit(corpus)

# save the model
clustering_model.save(model_file="clustering_model.pt")

# load saved clustering model
model = ClusteringModel.load(model_file="clustering_model.pt")

# make example sentence
sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')

# predict for sentence
model.predict(sentence)

# print sentence with prediction
print(sentence)

Dataset Manipulations

You can now change label names, ignore labels and add custom preprocessing when loading a dataset.

For instance, the standard WNUT_17 dataset comes with 7 NER labels:

corpus = WNUT_17(in_memory=False)
print(corpus.make_label_dictionary('ner'))

which prints:

Dictionary with 7 tags: <unk>, person, location, group, corporation, product, creative-work

With the following code, you rename some labels ('person' is renamed to 'PER'), merge 2 labels into 1 ('group' and 'corporation' are merged into 'LOC'), and ignore 2 other labels ('creative-work' and 'product' are ignored):

corpus = WNUT_17(in_memory=False, label_name_map={
    'person': 'PER',
    'location': 'LOC',
    'group': 'ORG',
    'corporation': 'ORG',
    'product': 'O',
    'creative-work': 'O', # by renaming to 'O' this tag gets ignored
})

which prints:

Dictionary with 4 tags: <unk>, PER, LOC, ORG

You can manipulate the data even more with custom preprocessing functions. See the example in #2708.

Other New Features and Data Sets

A new WordTagger class for simple word-level predictions (#2607)
Classic WordEmbeddings can now be fine-tuned in Flair (#2491) by setting fine_tune=True. Also adds fine-tuning mode of https://arxiv.org/abs/2110.02861 which seem to "reduce gradient variance that comes from the highly non-uniform distribution of input tokens"
Add NER_MULTI_CONER Dataset (#2507)
Add support for HIPE 2022 (#2675)
Allow trainer to work with mutliple learning rates (#2641)
Update hyperparameter tuning (#2633)

Preview Features

Some preview features in beta stage, use at your own risk.

Prototypical networks in Flair (#2627)

Prototype networks learn prototypes for each target class. For each data point to be classified, the network predicts a vector in class-prototype-space, which is then compared to all class prototypes.The prediction is then the closest class prototype. See paper Prototypical Networks for Few-shot Learning for more info.

@plonerma implemented a custom decoder that can be added to any Flair model that inherits from DefaultClassifier (i.e. early all Flair models). For instance, use this script:

from flair.data import Corpus
from flair.datasets import UP_ENGLISH
from flair.embeddings import TransformerWordEmbeddings
from flair.models import WordTagger
from flair.nn import PrototypicalDecoder
from flair.trainers import ModelTrainer

# what tag do we want to predict?
tag_type = 'frame'

# get a corpus
corpus: Corpus = UP_ENGLISH().downsample(0.1)

# make the tag dictionary from the corpus
tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)

# initialize simple embeddings
embeddings = TransformerWordEmbeddings(model="distilbert-base-uncased",
                                       fine_tune=True,
                                       layers='-1')

# initialize prototype decoder
decoder = PrototypicalDecoder(num_prototypes=len(tag_dictionary),
                              embeddings_size=embeddings.embedding_length,
                              distance_function='euclidean',
                              normal_distributed_initial_prototypes=True,
                              )

# initialize the WordTagger, but pass the prototype decoder
tagger = WordTagger(embeddings,
                    tag_dictionary,
                    tag_type,
                    decoder=decoder)

# initialize trainer
trainer = ModelTrainer(tagger, corpus)

# run training
trainer.fine_tune('resources/taggers/prototypical_decoder')

Other Beta features

Dependency Parsing in Flair (#2486 #2579)
Lemmatization in Flair (#2531)
Initial implementation of JsonCorpora and Datasets (#2653)

Major Refactorings

With Flair expanding to many new NLP tasks (relation extraction, entity linking, etc.) and model types, we made a number of refactorings to reduce redundancy and make it easier to extend Flair.

Major refactoring of Label Logic in Flair (#2607 #2609 #2645)

The labeling logic was growing too complex to accommodate new tasks. With this release, we refactored this logic such that complex label classes like SpanLabel, RelationLabel etc. are removed in favor of a single Label class for all types of label. The Sentence object will now be automatically aware of all labels added to it.

To illustrate the difference, consider a before-and-after of how to add an entity label to a sentence.

Before:

# example sentence
sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")

# create span for "Humboldt Universität zu Berlin"
span = Span(sentence[0:4])

# make a Span-label
span_label = SpanLabel(span=span, value='University')

# add Span-label to sentence
sentence.add_complex_label(typename='ner',  label=span_label)

Now:

# example sentence
sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")

# directly add a label to the span "Humboldt Universität zu Berlin"
sentence[0:4].add_label("ner", "Organization")

So you can now just get a span from the sentence and add a label to it directly. It will get registered on the sentence as well.

Refactoring of printouts (#2704)

We changed and unified printouts across all Flair data points and labels, and updated the documentation to reflect this. Printouts should hopefully now be more concise. Let us know what you think.

Unified classes to reduce redundancy

Next to too many Label classes (see above), we also had too many corpora that essentially do the same thing, two partially overlapping transformer embedding classes and too much redundancy in our tokenization classes. This release makes many refactorings to make the code more maintainable:

Unify Corpora (#2607): Unifies several corpora into a single object. Before, we had ColumnCorpus, UniversalDependenciesCorpus, CoNNLuCorpus, and EntityLinkingCorpus, which resulted in too much redundancy. Now, there is only the ColumnCorpus for all such datasets
Unify Transformer Embeddings (#2558, #2584, #2586): There was too much redundancy and inconsistency between the two Transformer-based embeddings classes TransformerWordEmbedding and TransformerDocumentEmbedding. Thanks to @helpmefindaname, they now both inherit from the same base object and now share all features.
Unify Tokenizers (#2607) : The Tokenizer classes no longer return lists of Token, rather lists of strings that the Sentence object converts to tokens, centralizing the offset and whitespace_after detection in one place.

Simplifications to DefaultClassifier

The DefaultClassifier is the base class for nearly all models in Flair. With this release, we make a number of simplifications to reduce redundancy across classes and make it more modular.

forward_pass simplified to return 3 instead of 4 arguments
forward_pass returns embeddings instead of logits allowing us to easily switch out the decoder (see Beta feature on Prototype Networks below)
removed the unintuitive spawn logic we no longer need due to Label refactoring
unify dropouts across all classes (#2669)

Sequence tagger refactoring (#2361 #2550, #2561,#2564, #2585, #2565)

Major refactoring of SequenceTagger for better modularity and code readability.

Refactoring of Span Logic (#2607 #2609 #2645)

Spans are no longer stored as word-level 'bioes' tags, but rather directly stored as span-level annotations. The SequenceTagger will still internally use BIO/BIOES tags, but the corpora and sentences no longer explicitly store this information.

So you now choose the labeling format when instantiating the SequenceTagger, i.e.:

    tagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type="ner",
        tag_format="BIOES", # choose if you want to use BIOES or BIO internally
    )

Internally, this refactoring makes a number of changes and simplifications:

a number of fields have been added or moved up to the DataPoint class, for convenience, including properties to get start_position and end_position of datapoints, their text, their tag and score (if they have only one tag) and an unlabeled_identifier
moves up set_embedding() and to() from the data point classes (Sentence, Token, etc.) to their parent DataPoint
a number of methods like get_tag and add_tag have been removed from Token in favor of the get_label and add_label method of the parent DataPoint class
The ColumnCorpus will automatically identify which columns are span labels and treat them accordingly

Code Quality Checks (#2611)

They are back and more strict than ever! Thanks to @helpmefindaname, we now include mypy and formatting tests as part of our build process, which lead to many changes in the code and a much greater chance at catching errors early.

Speed and Memory Improvements:

EntityLinker class refactored for speed (#2607)
Performance improvements in standard evaluate() method, especially for large datasets (#2607)
ColumnCorpus no longer does disk reads when in_memory=False, it simply stores the raw data in memory leading to significant speed-ups on large datasets (#2607)
Memory management improvements for embeddings (#2645)
Efficiency improvements for WordEmbeddings (#2491) and OneHotEmbeddings (#2490)

Bug Fixes and Improvements

Add equality method to Dictionary (#2532)
Fix encoding error in lemmatizer (#2539)
Fixed printing and logging inconsistencies. (#2665)
Readme (#2525 #2618 #2617 #2662)
Fix bug in WSD_UFSAC corpus (#2521)
change position of model saving in between epochs (#2548)
Fix loss weights in TextPairClassifier and RelationExtractor models (#2576)
Fix token positions on column corpus (#2440)
long sequence transformers of any kind (#2599)
The deprecated data_fetcher is finally removed (#2607)
Small lm training improvements (#2590)
Remove minor bug in NEL_ENGLISH_AIDA corpus (#2615)
Fix module import bug (#2616)
Fix reloading fast tokenizers (#2622)
Fix two small bugs (#2634)
Fix .pre-commit-config.yaml (#2651)
patch the missing document_delmiter for lm.get_state() (#2658)
DocumentPoolEmbeddings class can now be instantiated only with a single embedding (#2645)
You can now specify a min_count when computing the label dictionary. Labels below that count will be UNK'ed. (e.g. tag_dictionary = corpus.make_label_dictionary("ner", min_count=10)) (#2607)
The Dictionary will now compute count statistics for labels in a corpus (#2607)
The ColumnCorpus can now handle relation annotation, dependency tree information and UD feats and misc (#2607)
Embeddings are stored as a torch Embedding instead of a gensim keyedvector. That way it will never come to version issues, if gensim doesn't ensure backwards compatibility
Make transformer offset calculation more robust (#2714)

flairNLP/flair v0.11 Release 0.11 on GitHub