Release 0.11 is taking us ever closer to that 1.0 release! This release makes large internal refactorings and code quality / efficiency improvements to prepare Flair 1.0. We also add new features such as text clustering, a regular expression tagger, more dataset manipulation options, and some preview features like a prototype decoder.
New Features
Regular Expression Tagger (#2533)
You can now do sequence labeling in Flair with regular expressions! Simply define a RegexpTagger
and add some regular expressions, like in the example below:
# sentence with a number and two quotes
sentence = Sentence('Figure 11 is both "too colorful" and "not informative enough".')
# instantiate regex tagger with a quote matching pattern
tagger = RegexpTagger(mapping=(r'(["\'])(?:(?=(\\?))\2.)*?\1', 'QUOTE'))
# also add a number mapping
tagger.register_labels(mapping=(r'\b\d+\b', 'NUMBER'))
# tag sentence
tagger.predict(sentence)
# check out matches
for entity in sentence.get_labels():
print(entity)
Clustering with Flair (#2573 #2619)
Flair now supports clustering by ways of sklearn. Embed your sentences with a pre-trained embedding like below, then cluster then with any algorithm. Check the example below where we use sentence transformers and k-means clustering. A 'trained' clustering model can be saved and loaded for prediction, just like and other Flair classifier:
from sklearn.cluster import KMeans
from flair.data import Sentence
from flair.datasets import TREC_6
from flair.embeddings import SentenceTransformerDocumentEmbeddings
from flair.models import ClusteringModel
embeddings = SentenceTransformerDocumentEmbeddings()
# store all embeddings in memory which is required to perform clustering
corpus = TREC_6(memory_mode='full').downsample(0.05)
clustering_model = ClusteringModel(model=KMeans(n_clusters=6), embeddings=embeddings)
# fit the model on a corpus
clustering_model.fit(corpus)
# save the model
clustering_model.save(model_file="clustering_model.pt")
# load saved clustering model
model = ClusteringModel.load(model_file="clustering_model.pt")
# make example sentence
sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')
# predict for sentence
model.predict(sentence)
# print sentence with prediction
print(sentence)
Dataset Manipulations
You can now change label names, ignore labels and add custom preprocessing when loading a dataset.
For instance, the standard WNUT_17 dataset comes with 7 NER labels:
corpus = WNUT_17(in_memory=False)
print(corpus.make_label_dictionary('ner'))
which prints:
Dictionary with 7 tags: <unk>, person, location, group, corporation, product, creative-work
With the following code, you rename some labels ('person' is renamed to 'PER'), merge 2 labels into 1 ('group' and 'corporation' are merged into 'LOC'), and ignore 2 other labels ('creative-work' and 'product' are ignored):
corpus = WNUT_17(in_memory=False, label_name_map={
'person': 'PER',
'location': 'LOC',
'group': 'ORG',
'corporation': 'ORG',
'product': 'O',
'creative-work': 'O', # by renaming to 'O' this tag gets ignored
})
which prints:
Dictionary with 4 tags: <unk>, PER, LOC, ORG
You can manipulate the data even more with custom preprocessing functions. See the example in #2708.
Other New Features and Data Sets
- A new
WordTagger
class for simple word-level predictions (#2607) - Classic
WordEmbeddings
can now be fine-tuned in Flair (#2491) by setting fine_tune=True. Also adds fine-tuning mode of https://arxiv.org/abs/2110.02861 which seem to "reduce gradient variance that comes from the highly non-uniform distribution of input tokens" - Add
NER_MULTI_CONER
Dataset (#2507) - Add support for HIPE 2022 (#2675)
- Allow trainer to work with mutliple learning rates (#2641)
- Update hyperparameter tuning (#2633)
Preview Features
Some preview features in beta stage, use at your own risk.
Prototypical networks in Flair (#2627)
Prototype networks learn prototypes for each target class. For each data point to be classified, the network predicts a vector in class-prototype-space, which is then compared to all class prototypes.The prediction is then the closest class prototype. See paper Prototypical Networks for Few-shot Learning for more info.
@plonerma implemented a custom decoder that can be added to any Flair model that inherits from DefaultClassifier
(i.e. early all Flair models). For instance, use this script:
from flair.data import Corpus
from flair.datasets import UP_ENGLISH
from flair.embeddings import TransformerWordEmbeddings
from flair.models import WordTagger
from flair.nn import PrototypicalDecoder
from flair.trainers import ModelTrainer
# what tag do we want to predict?
tag_type = 'frame'
# get a corpus
corpus: Corpus = UP_ENGLISH().downsample(0.1)
# make the tag dictionary from the corpus
tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)
# initialize simple embeddings
embeddings = TransformerWordEmbeddings(model="distilbert-base-uncased",
fine_tune=True,
layers='-1')
# initialize prototype decoder
decoder = PrototypicalDecoder(num_prototypes=len(tag_dictionary),
embeddings_size=embeddings.embedding_length,
distance_function='euclidean',
normal_distributed_initial_prototypes=True,
)
# initialize the WordTagger, but pass the prototype decoder
tagger = WordTagger(embeddings,
tag_dictionary,
tag_type,
decoder=decoder)
# initialize trainer
trainer = ModelTrainer(tagger, corpus)
# run training
trainer.fine_tune('resources/taggers/prototypical_decoder')
Other Beta features
- Dependency Parsing in Flair (#2486 #2579)
- Lemmatization in Flair (#2531)
- Initial implementation of JsonCorpora and Datasets (#2653)
Major Refactorings
With Flair expanding to many new NLP tasks (relation extraction, entity linking, etc.) and model types, we made a number of refactorings to reduce redundancy and make it easier to extend Flair.
Major refactoring of Label Logic in Flair (#2607 #2609 #2645)
The labeling logic was growing too complex to accommodate new tasks. With this release, we refactored this logic such that complex label classes like SpanLabel
, RelationLabel
etc. are removed in favor of a single Label
class for all types of label. The Sentence
object will now be automatically aware of all labels added to it.
To illustrate the difference, consider a before-and-after of how to add an entity label to a sentence.
Before:
# example sentence
sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")
# create span for "Humboldt Universität zu Berlin"
span = Span(sentence[0:4])
# make a Span-label
span_label = SpanLabel(span=span, value='University')
# add Span-label to sentence
sentence.add_complex_label(typename='ner', label=span_label)
Now:
# example sentence
sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")
# directly add a label to the span "Humboldt Universität zu Berlin"
sentence[0:4].add_label("ner", "Organization")
So you can now just get a span from the sentence and add a label to it directly. It will get registered on the sentence as well.
Refactoring of printouts (#2704)
We changed and unified printouts across all Flair data points and labels, and updated the documentation to reflect this. Printouts should hopefully now be more concise. Let us know what you think.
Unified classes to reduce redundancy
Next to too many Label classes (see above), we also had too many corpora that essentially do the same thing, two partially overlapping transformer embedding classes and too much redundancy in our tokenization classes. This release makes many refactorings to make the code more maintainable:
- Unify Corpora (#2607): Unifies several corpora into a single object. Before, we had
ColumnCorpus
,UniversalDependenciesCorpus
,CoNNLuCorpus
, andEntityLinkingCorpus
, which resulted in too much redundancy. Now, there is only theColumnCorpus
for all such datasets - Unify Transformer Embeddings (#2558, #2584, #2586): There was too much redundancy and inconsistency between the two Transformer-based embeddings classes
TransformerWordEmbedding
andTransformerDocumentEmbedding
. Thanks to @helpmefindaname, they now both inherit from the same base object and now share all features. - Unify Tokenizers (#2607) : The
Tokenizer
classes no longer return lists ofToken
, rather lists of strings that theSentence
object converts to tokens, centralizing the offset and whitespace_after detection in one place.
Simplifications to DefaultClassifier
The DefaultClassifier
is the base class for nearly all models in Flair. With this release, we make a number of simplifications to reduce redundancy across classes and make it more modular.
forward_pass
simplified to return 3 instead of 4 argumentsforward_pass
returns embeddings instead of logits allowing us to easily switch out the decoder (see Beta feature on Prototype Networks below)- removed the unintuitive
spawn
logic we no longer need due to Label refactoring - unify dropouts across all classes (#2669)
Sequence tagger refactoring (#2361 #2550, #2561,#2564, #2585, #2565)
Major refactoring of SequenceTagger
for better modularity and code readability.
Refactoring of Span Logic (#2607 #2609 #2645)
Spans are no longer stored as word-level 'bioes' tags, but rather directly stored as span-level annotations. The SequenceTagger
will still internally use BIO/BIOES tags, but the corpora and sentences no longer explicitly store this information.
So you now choose the labeling format when instantiating the SequenceTagger
, i.e.:
tagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type="ner",
tag_format="BIOES", # choose if you want to use BIOES or BIO internally
)
Internally, this refactoring makes a number of changes and simplifications:
- a number of fields have been added or moved up to the
DataPoint
class, for convenience, including properties to getstart_position
andend_position
of datapoints, theirtext
, theirtag
andscore
(if they have only one tag) and anunlabeled_identifier
- moves up
set_embedding()
andto()
from the data point classes (Sentence
,Token
, etc.) to their parentDataPoint
- a number of methods like
get_tag
andadd_tag
have been removed from Token in favor of theget_label
andadd_label
method of the parent DataPoint class - The
ColumnCorpus
will automatically identify which columns are span labels and treat them accordingly
Code Quality Checks (#2611)
They are back and more strict than ever! Thanks to @helpmefindaname, we now include mypy and formatting tests as part of our build process, which lead to many changes in the code and a much greater chance at catching errors early.
Speed and Memory Improvements:
EntityLinker
class refactored for speed (#2607)- Performance improvements in standard
evaluate()
method, especially for large datasets (#2607) ColumnCorpus
no longer does disk reads whenin_memory=False
, it simply stores the raw data in memory leading to significant speed-ups on large datasets (#2607)- Memory management improvements for embeddings (#2645)
- Efficiency improvements for WordEmbeddings (#2491) and OneHotEmbeddings (#2490)
Bug Fixes and Improvements
- Add equality method to
Dictionary
(#2532) - Fix encoding error in lemmatizer (#2539)
- Fixed printing and logging inconsistencies. (#2665)
- Readme (#2525 #2618 #2617 #2662)
- Fix bug in
WSD_UFSAC
corpus (#2521) - change position of model saving in between epochs (#2548)
- Fix loss weights in TextPairClassifier and RelationExtractor models (#2576)
- Fix token positions on column corpus (#2440)
- long sequence transformers of any kind (#2599)
- The deprecated data_fetcher is finally removed (#2607)
- Small lm training improvements (#2590)
- Remove minor bug in NEL_ENGLISH_AIDA corpus (#2615)
- Fix module import bug (#2616)
- Fix reloading fast tokenizers (#2622)
- Fix two small bugs (#2634)
- Fix .pre-commit-config.yaml (#2651)
- patch the missing document_delmiter for lm.get_state() (#2658)
DocumentPoolEmbeddings
class can now be instantiated only with a single embedding (#2645)- You can now specify a
min_count
when computing the label dictionary. Labels below that count will be UNK'ed. (e.g.tag_dictionary = corpus.make_label_dictionary("ner", min_count=10)
) (#2607) - The
Dictionary
will now compute count statistics for labels in a corpus (#2607) - The
ColumnCorpus
can now handle relation annotation, dependency tree information and UD feats and misc (#2607) - Embeddings are stored as a torch Embedding instead of a gensim keyedvector. That way it will never come to version issues, if gensim doesn't ensure backwards compatibility
- Make transformer offset calculation more robust (#2714)