github Lightning-AI/pytorch-lightning 1.3.0
Lightning CLI, PyTorch Profiler, Improved Early Stopping

latest releases: 2.4.0, 2.3.3, 2.3.2...
3 years ago

Today we are excited to announce Lightning 1.3, containing highly anticipated new features including a new Lightning CLI, improved TPU support, integrations such as PyTorch profiler, new early stopping strategies, predict and validate trainer routines, and more.

https://medium.com/pytorch/pytorch-lightning-1-3-lightning-cli-pytorch-profiler-improved-early-stopping-6e0ffd8deb29

[1.3.0] - 2021-05-06

Added

  • Added support for the EarlyStopping callback to run at the end of the training epoch (#6944)
  • Added synchronization points before and after setup hooks are run (#7202)
  • Added a teardown hook to ClusterEnvironment (#6942)
  • Added utils for metrics to scalar conversions (#7180)
  • Added utils for NaN/Inf detection for gradients and parameters (#6834)
  • Added more explicit exception message when trying to execute trainer.test() or trainer.validate() with fast_dev_run=True (#6667)
  • Added LightningCLI class to provide simple reproducibility with minimum boilerplate training CLI (#4492, #6862, #7156, #7299)
  • Added gradient_clip_algorithm argument to Trainer for gradient clipping by value (#6123).
  • Added a way to print to terminal without breaking up the progress bar (#5470)
  • Added support to checkpoint after training steps in ModelCheckpoint callback (#6146)
  • Added TrainerStatus.{INITIALIZING,RUNNING,FINISHED,INTERRUPTED} (#7173)
  • Added Trainer.validate() method to perform one evaluation epoch over the validation set (#4948)
  • Added LightningEnvironment for Lightning-specific DDP (#5915)
  • Added teardown() hook to LightningDataModule (#4673)
  • Added auto_insert_metric_name parameter to ModelCheckpoint (#6277)
  • Added arg to self.log that enables users to give custom names when dealing with multiple dataloaders (#6274)
  • Added teardown method to BaseProfiler to enable subclasses defining post-profiling steps outside of __del__ (#6370)
  • Added setup method to BaseProfiler to enable subclasses defining pre-profiling steps for every process (#6633)
  • Added no return warning to predict (#6139)
  • Added Trainer.predict config validation (#6543)
  • Added AbstractProfiler interface (#6621)
  • Added support for including module names for forward in the autograd trace of PyTorchProfiler (#6349)
  • Added support for the PyTorch 1.8.1 autograd profiler (#6618)
  • Added outputs parameter to callback's on_validation_epoch_end & on_test_epoch_end hooks (#6120)
  • Added configure_sharded_model hook (#6679)
  • Added support for precision=64, enabling training with double precision (#6595)
  • Added support for DDP communication hooks (#6736)
  • Added artifact_location argument to MLFlowLogger which will be passed to the MlflowClient.create_experiment call (#6677)
  • Added model parameter to precision plugins' clip_gradients signature (#6764, #7231)
  • Added is_last_batch attribute to Trainer (#6825)
  • Added LightningModule.lr_schedulers() for manual optimization (#6567)
  • Added MpModelWrapper in TPU Spawn (#7045)
  • Added max_time Trainer argument to limit training time (#6823)
  • Added on_predict_{batch,epoch}_{start,end} hooks (#7141)
  • Added new EarlyStopping parameters stopping_threshold and divergence_threshold (#6868)
  • Added debug flag to TPU Training Plugins (PT_XLA_DEBUG) (#7219)
  • Added new UnrepeatedDistributedSampler and IndexBatchSamplerWrapper for tracking distributed predictions (#7215)
  • Added trainer.predict(return_predictions=None|False|True) (#7215)
  • Added BasePredictionWriter callback to implement prediction saving (#7127)
  • Added trainer.tune(scale_batch_size_kwargs, lr_find_kwargs) arguments to configure the tuning algorithms (#7258)
  • Added tpu_distributed check for TPU Spawn barrier (#7241)
  • Added device updates to TPU Spawn for Pod training (#7243)
  • Added warning when missing Callback and using resume_from_checkpoint (#7254)
  • DeepSpeed single file saving (#6900)
  • Added Training type Plugins Registry (#6982, #7063, #7214, #7224)
  • Add ignore param to save_hyperparameters (#6056)

Changed

  • Changed LightningModule.truncated_bptt_steps to be property (#7323)
  • Changed EarlyStopping callback from by default running EarlyStopping.on_validation_end if only training is run. Set check_on_train_epoch_end to run the callback at the end of the train epoch instead of at the end of the validation epoch (#7069)
  • Renamed pytorch_lightning.callbacks.swa to pytorch_lightning.callbacks.stochastic_weight_avg (#6259)
  • Refactor RunningStage and TrainerState usage (#4945, #7173)
    • Added RunningStage.SANITY_CHECKING
    • Added TrainerFn.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}
    • Changed trainer.evaluating to return True if validating or testing
  • Changed setup() and teardown() stage argument to take any of {fit,validate,test,predict} (#6386)
  • Changed profilers to save separate report files per state and rank (#6621)
  • The trainer no longer tries to save a checkpoint on exception or run callback's on_train_end functions (#6864)
  • Changed PyTorchProfiler to use torch.autograd.profiler.record_function to record functions (#6349)
  • Disabled lr_scheduler.step() in manual optimization (#6825)
  • Changed warnings and recommendations for dataloaders in ddp_spawn (#6762)
  • pl.seed_everything will now also set the seed on the DistributedSampler (#7024)
  • Changed default setting for communication of multi-node training using DDPShardedPlugin (#6937)
  • trainer.tune() now returns the tuning result (#7258)
  • LightningModule.from_datasets() now accepts IterableDataset instances as training datasets. (#7503)
  • Changed resume_from_checkpoint warning to an error when the checkpoint file does not exist (#7075)
  • Automatically set sync_batchnorm for training_type_plugin (#6536)
  • Allowed training type plugin to delay optimizer creation (#6331)
  • Removed ModelSummary validation from train loop on_trainer_init (#6610)
  • Moved save_function to accelerator (#6689)
  • Updated DeepSpeed ZeRO (#6546, #6752, #6142, #6321)
  • Improved verbose logging for EarlyStopping callback (#6811)
  • Run ddp_spawn dataloader checks on Windows (#6930)
  • Updated mlflow with using resolve_tags (#6746)
  • Moved save_hyperparameters to its own function (#7119)
  • Replaced _DataModuleWrapper with __new__ (#7289)
  • Reset current_fx properties on lightning module in teardown (#7247)
  • Auto-set DataLoader.worker_init_fn with seed_everything (#6960)
  • Remove model.trainer call inside of dataloading mixin (#7317)
  • Split profilers module (#6261)
  • Ensure accelerator is valid if running interactively (#5970)
  • Disabled batch transfer in DP mode (#6098)

Deprecated

  • Deprecated outputs in both LightningModule.on_train_epoch_end and Callback.on_train_epoch_end hooks (#7339)
  • Deprecated Trainer.truncated_bptt_steps in favor of LightningModule.truncated_bptt_steps (#7323)
  • Deprecated outputs in both LightningModule.on_train_epoch_end and Callback.on_train_epoch_end hooks (#7339)
  • Deprecated LightningModule.grad_norm in favor of pytorch_lightning.utilities.grads.grad_norm (#7292)
  • Deprecated the save_function property from the ModelCheckpoint callback (#7201)
  • Deprecated LightningModule.write_predictions and LightningModule.write_predictions_dict (#7066)
  • Deprecated TrainerLoggingMixin in favor of a separate utilities module for metric handling (#7180)
  • Deprecated TrainerTrainingTricksMixin in favor of a separate utilities module for NaN/Inf detection for gradients and parameters (#6834)
  • period has been deprecated in favor of every_n_val_epochs in the ModelCheckpoint callback (#6146)
  • Deprecated trainer.running_sanity_check in favor of trainer.sanity_checking (#4945)
  • Deprecated Profiler(output_filename) in favor of dirpath and filename (#6621)
  • Deprecated PytorchProfiler(profiled_functions) in favor of record_functions (#6349)
  • Deprecated @auto_move_data in favor of trainer.predict (#6993)
  • Deprecated Callback.on_load_checkpoint(checkpoint) in favor of Callback.on_load_checkpoint(trainer, pl_module, checkpoint) (#7253)
  • Deprecated metrics in favor of torchmetrics (#6505, #6530, #6540, #6547, #6515, #6572, #6573, #6584, #6636, #6637, #6649, #6659, #7131)
  • Deprecated the LightningModule.datamodule getter and setter methods; access them through Trainer.datamodule instead (#7168)
  • Deprecated the use of Trainer(gpus="i") (string) for selecting the i-th GPU; from v1.5 this will set the number of GPUs instead of the index (#6388)

Removed

  • Removed the exp_save_path property from the LightningModule (#7266)
  • Removed training loop explicitly calling EarlyStopping.on_validation_end if no validation is run (#7069)
  • Removed automatic_optimization as a property from the training loop in favor of LightningModule.automatic_optimization (#7130)
  • Removed evaluation loop legacy returns for *_epoch_end hooks (#6973)
  • Removed support for passing a bool value to profiler argument of Trainer (#6164)
  • Removed no return warning from val/test step (#6139)
  • Removed passing a ModelCheckpoint instance to Trainer(checkpoint_callback) (#6166)
  • Removed deprecated Trainer argument enable_pl_optimizer and automatic_optimization (#6163)
  • Removed deprecated metrics (#6161)
    • from pytorch_lightning.metrics.functional.classification removed to_onehot, to_categorical, get_num_classes, roc, multiclass_roc, average_precision, precision_recall_curve, multiclass_precision_recall_curve
    • from pytorch_lightning.metrics.functional.reduction removed reduce, class_reduce
  • Removed deprecated ModelCheckpoint arguments prefix, mode="auto" (#6162)
  • Removed mode='auto' from EarlyStopping (#6167)
  • Removed epoch and step arguments from ModelCheckpoint.format_checkpoint_name(), these are now included in the metrics argument (#7344)
  • Removed legacy references for magic keys in the Result object (#6016)
  • Removed deprecated LightningModule hparams setter (#6207)
  • Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the "log"/"progress_bar" magic keys. Use self.log instead (#6734)
  • Removed trainer.fit() return value of 1. It has no return now (#7237)
  • Removed logger_connector legacy code (#6733)
  • Removed unused mixin attributes (#6487)

Fixed

  • Fixed NaN errors in progress bars when training with iterable datasets with no length defined (#7306)
  • Fixed attaching train and validation dataloaders when reload_dataloaders_every_epoch=True and num_sanity_val_steps=0 (#7207)
  • Added a barrier in the accelerator teardown to synchronize processes before execution finishes (#6814)
  • Fixed multi-node DDP sub-process launch by using local_rank instead of global_rank for main process assertion (#7061)
  • Fixed incorrect removal of WORLD_SIZE environment variable in DDP training when launching with torch distributed/torchelastic (#6942)
  • Made the Plugin.reduce method more consistent across all Plugins to reflect a mean-reduction by default (#6011)
  • Move lightning module to correct device type when using LightningDistributedWrapper (#6070)
  • Do not print top-k verbose log with ModelCheckpoint(monitor=None) (#6109)
  • Fixed ModelCheckpoint(save_top_k=0, save_last=True) not saving the last checkpoint (#6136)
  • Fixed .teardown(stage='fit') and .on_fit_{start,end}() getting called during trainer.test (#6386)
  • Fixed LightningModule all_gather on cpu tensors (#6416)
  • Fixed torch distributed not available in setup hook for DDP (#6506)
  • Fixed trainer.tuner.{lr_find,scale_batch_size} not setting the Trainer state properly (#7258)
  • Fixed bug where the learning rate schedulers did not follow the optimizer frequencies (#4868)
  • Fixed pickle error checker to now check for pickle.PickleError to catch all pickle errors (#6917)
  • Fixed a bug where the outputs object passed to LightningModule.training_epoch_end was different from the object passed to the on_train_end_epoch hook (#6969)
  • Fixed a bug where the outputs passed to train_batch_end would be listed even when using a single optimizer and no truncated backprop through time steps (#6969)
  • Fixed bug for trainer error handling which would cause hang for distributed training (#6864)
  • Fixed self.device not returning the correct device in replicas of data-parallel (#6414)
  • Fixed lr_find trying beyond num_training steps and suggesting a too high learning rate (#7076)
  • Fixed logger creating incorrect version folder in DDP with repeated Trainer.fit calls (#7077)
  • Fixed metric objects passed directly to self.log not being reset correctly (#7055)
  • Fixed CombinedLoader in distributed settings for validation / testing (#7102)
  • Fixed the save_dir in WandbLogger when the run was initiated externally (#7106)
  • Fixed num_sanity_val_steps affecting reproducibility of training data shuffling (#7014)
  • Fixed resetting device after fitting/evaluating/predicting (#7188)
  • Fixed bug where trainer.tuner.scale_batch_size(max_trials=0) would not return the correct batch size result (#7262)
  • Fixed metrics not being properly logged with precision=16 and manual_optimization (#7228)
  • Fixed BaseFinetuning properly reloading optimizer_states when using resume_from_checkpoint (#6891)
  • Fixed parameters_to_ignore not properly set to DDPWrapper (#7239)
  • Fixed parsing of fast_dev_run=True with the built-in ArgumentParser (#7240)
  • Fixed handling an IterableDataset that fails to produce a batch at the beginning of an epoch (#7294)
  • Fixed LightningModule.save_hyperparameters() when attempting to save an empty container (#7268)
  • Fixed apex not properly instantiated when running with ddp (#7274)
  • Fixed optimizer state not moved to GPU (#7277)
  • Fixed custom init args for WandbLogger (#6989)
  • Fixed a bug where an error would be raised if the train dataloader sometimes produced None for a batch (#7342)
  • Fixed examples (#6600, #6638, #7096, #7246, #6357, #6476, #6294, #6373, #6088, #7398)
  • Resolved schedule step bug for PyTorch Profiler (#6674, #6681)
  • Updated logic for checking TPUs availability (#6767)
  • Resolve TPU miss rendezvous (#6781)
  • Fixed auto-scaling mode when calling tune method on trainer (#7321)
  • Fixed finetuning complex models correctly unfreezes (#6880)
  • Ensure we set the eval/train flag correctly on accelerator model (#6877)
  • Set better defaults for rank_zero_only.rank when training is launched with SLURM and torchelastic (#6802)
  • Fixed matching the number of outputs of backward with forward for AllGatherGrad (#6625)
  • Fixed the gradient_clip_algorithm has no effect (#6928)
  • Fixed CUDA OOM detection and handling (#6934)
  • Fixed unfreeze_and_add_param_group expects modules rather than module (#6822)
  • Fixed DPP + SyncBN when move on device (#6838)
  • Fixed missing arguments in lr_find call (#6784)
  • Fixed set_default_tensor_type to torch.DoubleTensor with precision=64 (#7108)
  • Fixed NeptuneLogger.log_text(step=None) (#7194)
  • Fixed importing torchtext batch (#6365, #6323, #6211)

Contributors

@akihironitta, @alessiobonfiglio, @amisev, @amogkam, @ananthsub, @ArvinZhuang, @ashleve, @asnorkin, @awaelchli, @BloodAxe, @bmahlbrand, @Borda, @borisdayma, @camruta, @carmocca, @ceshine, @dbonner, @dhkim0225, @EdwardJB, @EliaCereda, @EricCousineau-TRI, @ethanwharris, @FlorianMF, @hemildesai, @ifsheldon, @kaushikb11, @mauvilsa, @maxfrei750, @mesejo, @ramonemiliani93, @rohitgr7, @s-rog, @sadiqj, @scart97, @SeanNaren, @shuyingsunshine21, @SkafteNicki, @SpontaneousDuck, @stllfe, @tchaton, @THasthika, @vballoli

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Don't miss a new pytorch-lightning release

NewReleases is sending notifications on new releases.