Release 0.6 is a major biomedical NLP upgrade for Flair, adding state-of-the-art models for biomedical NER, support for 31 biomedical NER corpora, clinical POS tagging, speculation and negation detection in biomedical literature, and many other features such as multi-tagging and one-cycle learning.
Biomedical Models and Datasets:
Most of the biomedical models and datasets were developed together with the Knowledge Management in Bioinformatics group at the HU Berlin, in particular @leonweber and @mariosaenger. This page gives an overview of the new models and datasets, and example tutorials. Some highlights:
Biomedical NER models (#1790)
Flair now has pre-trained models for biomedical NER trained over unified versions of 31 different biomedical corpora. Because they are trained on so many different datasets, the models are shown to be very robust with new datasets, outperforming all previously available off-the-shelf datasets. If you want to load a model to detect "diseases" in text for instance, do:
# make a sentence
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")
# load disease tagger and predict
tagger = SequenceTagger.load("hunflair-disease")
tagger.predict(sentence)
Done! Let's print the diseases found by the tagger:
for entity in sentence.get_spans():
print(entity)
This should print:
Span [1,2]: "Behavioral abnormalities" [− Labels: Disease (0.6736)]
Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.99)]
You can also get one model that finds 5 biomedical entity types (diseases, genes, species, chemicals and cell lines), like this:
# load bio-NER tagger and predict
tagger = MultiTagger.load("hunflair")
tagger.predict(sentence)
This should print:
Span [1,2]: "Behavioral abnormalities" [− Labels: Disease (0.6736)]
Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.99)]
Span [5]: "Fmr1" [− Labels: Gene (0.838)]
Span [7]: "Mouse" [− Labels: Species (0.9979)]
So it now also finds genes and species. As explained here these models work best if you use them together with a biomedical tokenizer.
Biomedical NER datasets (#1790)
Flair now supports 31 biomedical NER datasets out of the box, both in their standard versions as well as the "Huner" splits for reproducibility of experiments. For a full list of datasets, refer to this page.
You can load a dataset like this:
# load one of the bioinformatics corpora
corpus = JNLPBA()
# print statistics and one sentence
print(corpus)
print(corpus.train[0])
We also include "huner" corpora that combine many different biomedical datasets into a single corpus. For instance, if you execute the following line:
# load combined chemicals corpus
corpus = HUNER_CHEMICAL()
This loads a combination of 6 different corpora that contain annotation of chemicals into a single corpus. This allows you to train stronger cross-corpus models since you now combine training data from many sources. See more info here.
POS model for Portuguese clinical text (#1789)
Thanks to @LucasFerroHAILab, we now include a model for part-of-speech tagging in Portuguese clinical text. Run this model like this:
# load your tagger
tagger = SequenceTagger.load('pt-pos-clinical')
# example sentence
sentence = Sentence('O vírus Covid causa fortes dores .')
tagger.predict(sentence)
print(sentence)
You can find more details in their paper here.
Model for negation and speculation in biomedical literature (#1758)
Using the BioScope corpus, we trained a model to recognize negation and speculation in biomedical literature. Use it like this:
sentence = Sentence("The picture most likely reflects airways disease")
tagger = SequenceTagger.load("negation-speculation")
tagger.predict(sentence)
for entity in sentence.get_spans():
print(entity)
This should print:
Span [4,5,6,7]: "likely reflects airways disease" [− Labels: SPECULATION (0.9992)]
Thus indicating that this portion of the sentence is speculation.
Other New Features:
MultiTagger (#1791)
We added support for tagging text with multiple models at the same time. This can save memory usage and increase tagging speed.
For instance, if you want to POS tag, chunk, NER and detect frames in your text at the same time, do:
# load tagger for POS, chunking, NER and frame detection
tagger = MultiTagger.load(['pos', 'upos', 'chunk', 'ner', 'frame'])
# example sentence
sentence = Sentence("George Washington was born in Washington")
# predict
tagger.predict(sentence)
print(sentence)
This will give you a sentence annotated with 5 different layers of annotation.
Sentence splitting
Flair now includes convenience methods for sentence splitting. For instance, to use segtok to split and tokenize a text into sentences, use the following code:
from flair.tokenization import SegtokSentenceSplitter
# example text with many sentences
text = "This is a sentence. This is another sentence. I love Berlin."
# initialize sentence splitter
splitter = SegtokSentenceSplitter()
# use splitter to split text into list of sentences
sentences = splitter.split(text)
We also ship other splitters, such as SpacySentenceSplitter
(requires SpaCy to be installed).
Japanese tokenization (#1786)
Thanks to @himkt we now have expanded support for Japanese tokenization in Flair. For instance, use the following code to tokenize a Japanese sentence without installing extra libraries:
from flair.data import Sentence
from flair.tokenization import JapaneseTokenizer
# init japanese tokenizer
tokenizer = JapaneseTokenizer("janome")
# make sentence (and tokenize)
sentence = Sentence("私はベルリンが好き", use_tokenizer=tokenizer)
# output tokenized sentence
print(sentence)
One-Cycle Learning (#1776)
Thanks to @lucaventurini2 Flair one supports one-cycle learning, which may give quicker convergence. For instance, train a model in 20 epochs using the code below:
# train as always
trainer = ModelTrainer(tagger, corpus)
# set one cycle LR as scheduler
trainer.train('onecycle_ner',
scheduler=OneCycleLR,
max_epochs=20)
Improvements:
Changes in convention
Turn on tokenizer by default in Sentence
object (#1806)
The Sentence
object now executes tokenization (use_tokenizer=True
) by default:
# Tokenizes by default
sentence = Sentence("I love Berlin.")
print(sentence)
# i.e. this is equivalent to
sentence = Sentence("I love Berlin.", use_tokenizer=True)
print(sentence)
# i.e. if you don't want to use tokenization, set it to False
sentence = Sentence("I love Berlin.", use_tokenizer=False)
print(sentence)
TransformerWordEmbeddings
now handle long documents by default
Previously, so had to set allow_long_sentences=True
to enable handling of long sequences (greater than 512 subtokens) in TransformerWordEmbeddings
. This is no longer necessary as this value is now set to True
by default.
Bug fixes
- Fix serialization of
BytePairEmbeddings
(#1802) - Fix issues with loading models that use
ELMoEmbeddings
(#1803) - Allow longer lengths in transformers that can handle more than 512 subtokens (#1804)
- Fix encoding for WASSA datasets (#1766)
- Update BPE package (#1764)
- Improve documentation (#1752 #1778)
- Fix evaluation of
TextClassifier
if nolabel_type
is passed (#1748) - Remove torch version checks that throw errors (#1744)
- Update DaNE dataset URL (#1800)
- Fix weight extraction error for empty sentences (#1805)