Release 0.7 adds major few-shot and zero-shot learning capabilities to Flair with our new TARS approach, plus support for the Universal Proposition Banks, new NER datasets and lots of other new features!
Few-Shot and Zero-Shot Classification with TARS (#1917 #1926)
With TARS we add a major new feature to Flair for zero-shot and few-shot classification. Details on the approach can be found in our paper Halder et al. (2020). Our approach allows you to classify text in cases in which you have little or even no training data at all.
This example illustrates how you predict new classes without training data:
# 1. Load our pre-trained TARS model for English
tars = TARSClassifier.load('tars-base')
# 2. Prepare a test sentence
sentence = flair.data.Sentence("I am so glad you liked it!")
# 3. Define some classes that you want to predict using descriptive names
classes = ["happy", "sad"]
#4. Predict for these classes
tars.predict_zero_shot(sentence, classes)
# Print sentence with predicted labels
print(sentence)
For a full overview of TARS features, please refer to our new TARS tutorial.
Other New Features
Option to set Flair seed (#1979)
Adds the possibility to set a seed via wrapping the Hugging Face Transformers library helper method (thanks @stefan-it).
By specifying a seed with:
import flair
flair.set_seed(42)
you can make experimental runs reproducible. The wrapped set_seed
method sets seeds for random
, numpy
and torch
. More details here.
Control multi-word behavior in UD datasets (#1981)
To better handle multi-words in UD corpora, we introduce the split_multiwords
constructor argument to all UD corpora which by default is set to True
. It controls the handling of multiwords that are split into different tokens. For instance the German "am" is split into two different tokens: "am" -> "an" + "dem". Or the French "aux" -> "a" + "les".
If split_multiwords
is set to True
, they are split as in UD. If set to False
, we keep the original multiword as a single token. Example:
# default mode: multiwords are split
corpus = UD_GERMAN(split_multiwords=True)
# print sentence 179
print(corpus.dev[179].to_plain_string())
# alternative mode: multiwords are kept as original
corpus = UD_GERMAN(split_multiwords=False)
# print sentence 179
print(corpus.dev[179].to_plain_string())
This prints
Ein Hotel zu dem Wohlfühlen.
Ein Hotel zum Wohlfühlen.
The latter is how it appears in text, the former is after splitting of multiwords.
Pass pretokenized sentence to Sentence object (#1965)
You can now pass pass a pretokenized sequence as list of words (thanks @ulf1):
from flair.data import Sentence
sentence = Sentence(['The', 'grass', 'is', 'green', '.'])
print(sentence)
This should print:
Sentence: "The grass is green ." [− Tokens: 5]
Map label names in sequence labeling datasets (#1988)
You can now pass a label map to sequence labeling datasets to change label names (thanks @pharnisch).
# print tag dictionary with mapped names
corpus = CONLL_03_DUTCH(label_name_map={'PER': 'person', 'ORG': 'organization', 'LOC': 'location', 'MISC': 'other'})
print(corpus.make_tag_dictionary('ner'))
# print tag dictionary with original names
corpus = CONLL_03_DUTCH()
print(corpus.make_tag_dictionary('ner'))
Data Sets
Universal Proposition Banks (#1870 #1866 #1888)
Flair 0.7 adds support 7 Universal Proposition Banks to train your own multilingual semantic role labelers (thanks to @Dabendorf).
Load for instance with:
# load English Universal Proposition Bank
corpus = UP_ENGLISH()
print(corpus)
# make dictionary of frames
frame_dictionary = corpus.make_tag_dictionary('frame')
print(frame_dictionary)
Now available for Finnish, Chinese, Italian, French, German, Spanish and English
NER Corpora
We add support for 6 new NER corpora:
Arabic NER Corpus (#1901)
Added the ANER corpus for Arabic NER (thanks to @megantosh).
# load Arabic NER corpus
corpus = ANER_CORP()
print(corpus)
Movie NER Corpora (#1912)
Added the MIT movie reviews corpora annotated with NER information, in the simple and complex variant (thanks to @pharnisch).
# load simple movie NER corpus
corpus = MITMovieNERSimple()
print(corpus)
print(corpus.make_tag_dictionary('ner'))
# load complex movie NER corpus
corpus = MITMovieNERComplex()
print(corpus)
print(corpus.make_tag_dictionary('ner'))
Added SEC Fillings NER corpus (#1922)
Added corpus of SEC fillings annotated with 4-class NER tags (thanks to @samahakk).
# load SEC fillings corpus
corpus = SEC_FILLINGS()
print(corpus)
print(corpus.make_tag_dictionary('ner'))
WNUT 2020 NER dataset support (#1942)
Added corpus of wet lab protocols annotated with NER information used for WNUT 2020 challenge (thanks to @aynetdia).
# load wet lab protocol data
corpus = WNUT_2020_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))
Weibo NER dataset support (#1944)
Added dataset about NER for Chinese Social Media (thanks to @87302380).
# load Weibo NER data
corpus = WEIBO_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))
Added Finnish NER corpus (#1946)
Added the TURKU corpus for Finnish NER (thanks to @melvelet).
# load Finnish NER data
corpus = TURKU_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))
Universal Depdency Treebanks
We add support for 11 new UD treebanks:
- Greek UD Treebank (#1933, thanks @malamasn)
- Livvi UD Treebank (#1953, thanks @hebecked)
- Naija UD Treebank (#1952, thanks @teddim420)
- Buryat UD Treebank (#1954, thanks @MaxDall)
- North Sami UD Treebank (#1955, thanks @dobbersc)
- Maltese UD Treebank (#1957, thanks @phkuep)
- Marathi UD Treebank (#1958, thanks @polarlyset)
- Afrikaans UD Treebank (#1959, thanks @QueStat)
- Gothic UD Treebank (#1961, thanks @wjSimon)
- Old French UD Treebank (#1964, thanks @Weyaaron)
- Wolof UD Treebank (#1967, thanks @LukasOpp)
Load each with language name, for instance:
# load Gothic UD treebank data
corpus = UD_GOTHIC()
print(corpus)
print(corpus.test[0])
Added GoEmotions text classification corpus (#1914)
Added GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories. Load with:
# load GoEmotions corpus
corpus = GO_EMOTIONS()
print(corpus)
print(corpus.make_label_dictionary())
Enhancements and bug fixes
- Add handling for micro-average precision and recall (#1935)
- Make dev and test splits in treebanks optional (#1951)
- Updated communicative functions model (#1857)
- Biomedical Data: Explicit encodings for Windows Support (#1893)
- Fix wrong abstract method (#1923 #1940)
- Improve tutorial (#1939)
- Fix requirements (#1971 )