Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.
Old English
MWT improvements
-
Fix words ending with
-nna
split into MWT stanfordnlp/handparsed-treebank@2c48d40 #1366 -
Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) #1371 #1378
-
Mark
start_char
andend_char
on an MWT if it is composed of exactly its subwords 2384089 #1361
Peft memory management
- Previous versions were loading multiple copies of the transformer in order to use adapters. To save memory, we can use Peft's capacity to attach multiple adapters to the same transformer instead as long as they have different names. This allows for loading just one copy of the entire transformer when using a Pipeline with several finetuned models. huggingface/peft#1523 #1381 #1384
Other bugfixes and minor upgrades
-
Fix crash when trying to load previously unknown language #1360 381736f
-
Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: d180ae0 #1367
-
Try to avoid OOM in the POS in the Pipeline by reducing its max batch length 4271813
-
Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to @Jemoka) 597d48f
Other upgrades
-
Add * to the list of functional tags to drop in the constituency parser, helping Icelandic annotation 57bfa8b #1356 (comment)
-
Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: 4048cae 15b136b