pypi stanza 1.10.0
v1.10.0 - rebuild with UD 2.15

19 hours ago

In this release, we rebuild all of the models with UD 2.15, allowing for new languages such as Georgian, Komi Zyrian, Low Saxon, and Ottoman Turkish. We also add an Albanian model composed of the two available UD treebanks and an Old English model based on a prototype dataset not yet published in UD.

Other notable changes:

  • Include a contextual lemmatizer in English for 's -> be or have in the default_accurate package. Also built is a HI model. Others potentially to follow. #1422
  • Upgrade the FR NER model to a gold edited version of WikiNER: https://huggingface.co/datasets/danrun/WikiNER-fr-gold ad1f938
  • Pytorch compatibility: set weights_only=True when loading models. #1430 #1429
  • augment MWT tokenization to accommodate unexpected ' characters, including " used in "s - #1437 #1436
  • when training the lemmatizer, take advantage of CorrectForm annotations in the UD treebanks dbdf429
  • add hand-lemmatized French verbs and English words to the "combined" lemmatizers, thanks to Prof. Lapalme: 99f7038
  • add VLSP 2023 constituency dataset: 1159d0d

Bugfixes:

  • raise_for_status earlier when failing to download something, so that the proper error gets displayed.
    Thank you @pattersam #1432
  • Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: 53081c2
  • reset the start/end character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: 1a36efb #1436
  • similar to the start/end char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: 215c69e
  • missing text for a Document does not cause the NER model to crash: 0732628 #1428
  • tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: f59ccd8 #1423

Don't miss a new stanza release

NewReleases is sending notifications on new releases.