allenai/allennlp v1.2.0rc1 on GitHub

What's new

Added 🎉

Added a warning when batches_per_epoch for the validation data loader is inherited from
the train data loader.
Added a build-vocab subcommand that can be used to build a vocabulary from a training config file.
Added tokenizer_kwargs argument to PretrainedTransformerMismatchedIndexer.
Added tokenizer_kwargs and transformer_kwargs arguments to PretrainedTransformerMismatchedEmbedder.
Added official support for Python 3.8.
Added a script: scripts/release_notes.py, which automatically prepares markdown release notes from the
CHANGELOG and commit history.
Added a flag --predictions-output-file to the evaluate command, which tells AllenNLP to write the
predictions from the given dataset to the file as JSON lines.
Added the ability to ignore certain missing keys when loading a model from an archive. This is done
by adding a class-level variable called authorized_missing_keys to any PyTorch module that a Model uses.
If defined, authorized_missing_keys should be a list of regex string patterns.
Added FBetaMultiLabelMeasure, a multi-label Fbeta metric. This is a subclass of the existing FBetaMeasure.
Added ability to pass additional key word arguments to cached_transformers.get(), which will be passed on to AutoModel.from_pretrained().
Added an overrides argument to Predictor.from_path().
Added a cached-path command.
Added a function inspect_cache to common.file_utils that prints useful information about the cache. This can also
be used from the cached-path command with allennlp cached-path --inspect.
Added a function remove_cache_entries to common.file_utils that removes any cache entries matching the given
glob patterns. This can used from the cached-path command with allennlp cached-path --remove some-files-*.
Added logging for the main process when running in distributed mode.
Added a TrainerCallback object to support state sharing between batch and epoch-level training callbacks.
Added support for .tar.gz in PretrainedModelInitializer.
Added classes: nn/samplers/samplers.py with MultinomialSampler, TopKSampler, and TopPSampler for
sampling indices from log probabilities
Made BeamSearch registrable.
Added top_k_sampling and type_p_sampling BeamSearch implementations.
Pass serialization_dir to Model and DatasetReader.
Added an optional include_in_archive parameter to the top-level of configuration files. When specified, include_in_archive should be a list of paths relative to the serialization directory which will be bundled up with the final archived model from a training run.

Changed ⚠️

Subcommands that don't require plugins will no longer cause plugins to be loaded or have an --include-package flag.
Allow overrides to be JSON string or dict.
transformers dependency updated to version 3.1.0.
When cached_path is called on a local archive with extract_archive=True, the archive is now extracted into a unique subdirectory of the cache root instead of a subdirectory of the archive's directory. The extraction directory is also unique to the modification time of the archive, so if the file changes, subsequent calls to cached_path will know to re-extract the archive.
Removed the truncation_strategy parameter to PretrainedTransformerTokenizer. The way we're calling the tokenizer, the truncation strategy takes no effect anyways.
Don't use initializers when loading a model, as it is not needed.
Distributed training will now automatically search for a local open port if the master_port parameter is not provided.
In training, save model weights before evaluation.
allennlp.common.util.peak_memory_mb renamed to peak_cpu_memory, and allennlp.common.util.gpu_memory_mb renamed to peak_gpu_memory,
and they both now return the results in bytes as integers. Also, the peak_gpu_memory function now utilizes PyTorch functions to find the memory
usage instead of shelling out to the nvidia-smi command. This is more efficient and also more accurate because it only takes
into account the tensor allocations of the current PyTorch process.
Make sure weights are first loaded to the cpu when using PretrainedModelInitializer, preventing wasted GPU memory.
Load dataset readers in load_archive.
Updated AllenNlpTestCase docstring to remove reference to unittest.TestCase

Removed 👋

Removed common.util.is_master function.

Fixed ✅

Fixed a bug where the reported batch_loss metric was incorrect when training with gradient accumulation.
Class decorators now displayed in API docs.
Fixed up the documentation for the allennlp.nn.beam_search module.
Ignore *args when constructing classes with FromParams.
Ensured some consistency in the types of the values that metrics return.
Fix a PyTorch warning by explicitly providing the as_tuple argument (leaving
it as its default value of False) to Tensor.nonzero().
Remove temporary directory when extracting model archive in load_archive
at end of function rather than via atexit.
Fixed a bug where using cached_path() offline could return a cached resource's lock file instead
of the cache file.
Fixed a bug where cached_path() would fail if passed a cache_dir with the user home shortcut ~/.
Fixed a bug in our doc building script where markdown links did not render properly
if the "href" part of the link (the part inside the ()) was on a new line.
Changed how gradients are zeroed out with an optimization. See this video from NVIDIA
at around the 9 minute mark.
Fixed a bug where parameters to a FromParams class that are dictionaries wouldn't get logged
when an instance is instantiated from_params.
Fixed a bug in distributed training where the vocab would be saved from every worker, when it should have been saved by only the local master process.
Fixed a bug in the calculation of rouge metrics during distributed training where the total sequence count was not being aggregated across GPUs.
Fixed allennlp.nn.util.add_sentence_boundary_token_ids() to use device parameter of input tensor.
Be sure to close the TensorBoard writer even when training doesn't finish.
Fixed the docstring for PyTorchSeq2VecWrapper.

Commits

01644ca Pass serialization_dir to Model, DatasetReader, and support include_in_archive (#4713)
1f29f35 Update transformers requirement from <3.4,>=3.1 to >=3.1,<3.5 (#4741)
6bb9ce9 warn about batches_per_epoch with validation loader (#4735)
00bb6c5 Be sure to close the TensorBoard writer (#4731)
3f23938 Update mkdocs-material requirement from <6.1.0,>=5.5.0 to >=5.5.0,<6.2.0 (#4738)
10c11ce Fix typo in PretrainedTransformerMismatchedEmbedder docstring (#4737)
0e64b4d fix docstring for PyTorchSeq2VecWrapper (#4734)
006bab4 Don't use PretrainedModelInitializer when loading a model (#4711)
ce14bdc Allow usage of .tar.gz with PretrainedModelInitializer (#4709)
c14a056 avoid defaulting to CPU device in add_sentence_boundary_token_ids() (#4727)
24519fd fix typehint on checkpointer method (#4726)
d3c69f7 Bump mypy from 0.782 to 0.790 (#4723)
cccad29 Updated AllenNlpTestCase docstring (#4722)
3a85e35 add reasonable timeout to gpu checks job (#4719)
1ff0658 Added logging for the main process when running in distributed mode (#4710)
b099b69 Add top_k and top_p sampling to BeamSearch (#4695)
bc6f15a Fixes rouge metric calculation corrected for distributed training (#4717)
ae7cf85 automatically find local open port in distributed training (#4696)
321d4f4 TrainerCallback with batch/epoch/end hooks (#4708)
001e1f7 new way of setting env variables in GH Actions (#4700)
c14ea40 Save checkpoint before running evaluation (#4704)
40bb47a Load weights to cpu with PretrainedModelInitializer (#4712)
327188b improve memory helper functions (#4699)
90f0037 fix reported batch_loss (#4706)
39ddb52 CLI improvements (#4692)
edcb6d3 Fix a bug in saving vocab during distributed training (#4705)
3506e3f ensure parameters that are actual dictionaries get logged (#4697)
eb7f256 Add StackOverflow link to README (#4694)
17c3b84 Fix small typo (#4686)
e0b2e26 display class decorators in API docs (#4685)
b9a9284 Update transformers requirement from <3.3,>=3.1 to >=3.1,<3.4 (#4684)
d9bdaa9 add build-vocab command (#4655)
ce604f1 Update mkdocs-material requirement from <5.6.0,>=5.5.0 to >=5.5.0,<6.1.0 (#4679)
c3b5ed7 zero grad optimization (#4673)
9dabf3f Add missing tokenizer/transformer kwargs (#4682)
9ac6c76 Allow overrides to be JSON string or dict (#4680)
55cfb47 The truncation setting doesn't do anything anymore (#4672)
990c9c1 clarify conda Python version in README.md
97db538 official support for Python 3.8 🐍 (#4671)
1e381bb Clean up the documentation for beam search (#4664)
11def8e Update bug_report.md
97fe88d Cached path command (#4652)
c9f376b Update transformers requirement from <3.2,>=3.1 to >=3.1,<3.3 (#4663)
e5e3d02 tick version for nightly releases
b833f90 fix multi-line links in docs (#4660)
d7c06fe Expose from_pretrained keyword arguments (#4651)
175c76b fix confusing distributed logging info (#4654)
fbd2ccc fix numbering in RELEASE_GUIDE
2d5f24b improve how cached_path extracts archives (#4645)
824f97d smooth out release process (#4648)
c7b7c00 Feature/prevent temp directory retention (#4643)
de5d68b Fix tensor.nonzero() function overload warning (#4644)
e8e89d5 add flag for saving predictions to 'evaluate' command (#4637)
e4fd5a0 Multi-label F-beta metric (#4562)
f0e7a78 Create Dependabot config file (#4635)
0e33b0b Return consistent types from metrics (#4632)
2df364f Update transformers requirement from <3.1,>=3.0 to >=3.0,<3.2 (#4621)
6d480aa Improve handling of **kwargs in FromParams (#4629)
bf3206a Workaround for Python not finding imports in spawned processes (#4630)