github stanfordnlp/CoreNLP v4.5.0

latest releases: v4.5.7, v4.5.6, v4.5.5...
23 months ago

CoreNLP 4.5.0

Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex

  • All PTB and German tokens normalized now in PTBLexer (previously only German umlauts).
    This makes the tokenizer 2% slower, but should avoid issues with resume' for example

  • log4j removed entirely from public CoreNLP (internal "research" branch still has a use)

  • Fix NumberFormatException showing up in NER models: #547 5ee2c39

  • Fix "seconds" in the lemmatizer: e7a073b

  • Fix double escaping of & in the online demos: 8413fa1

  • Report the cause of an error if "tregex" is asked for but no parse annotator is added: 4db80c0

  • Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): #1259

  • Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: #1263

  • Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: 3c40ba3 58a2288 8b97d64

  • Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas 9476a8e 6193934 afb1ea8 7c84960

  • Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases #1266

  • Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) 45b47e2

  • Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models 0d9e9c8

Don't miss a new CoreNLP release

NewReleases is sending notifications on new releases.