CoreNLP 4.5.0
Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex
-
All PTB and German tokens normalized now in PTBLexer (previously only German umlauts).
This makes the tokenizer 2% slower, but should avoid issues with resume' for example
d46fecd -
log4j removed entirely from public CoreNLP (internal "research" branch still has a use)
f05cb54 -
Fix NumberFormatException showing up in NER models: #547 5ee2c39
-
Fix "seconds" in the lemmatizer: e7a073b
-
Fix double escaping of & in the online demos: 8413fa1
-
Report the cause of an error if "tregex" is asked for but no parse annotator is added: 4db80c0
-
Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): #1259
-
Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: #1263
-
Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: 3c40ba3 58a2288 8b97d64
-
Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas 9476a8e 6193934 afb1ea8 7c84960
-
Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases #1266
-
Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) 45b47e2
-
Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models 0d9e9c8