Today we are excited to announce Lightning 1.3, containing highly anticipated new features including a new Lightning CLI, improved TPU support, integrations such as PyTorch profiler, new early stopping strategies, predict and validate trainer routines, and more.
[1.3.0] - 2021-05-06
Added
- Added support for the
EarlyStopping
callback to run at the end of the training epoch (#6944) - Added synchronization points before and after
setup
hooks are run (#7202) - Added a
teardown
hook toClusterEnvironment
(#6942) - Added utils for metrics to scalar conversions (#7180)
- Added utils for NaN/Inf detection for gradients and parameters (#6834)
- Added more explicit exception message when trying to execute
trainer.test()
ortrainer.validate()
withfast_dev_run=True
(#6667) - Added
LightningCLI
class to provide simple reproducibility with minimum boilerplate training CLI (#4492, #6862, #7156, #7299) - Added
gradient_clip_algorithm
argument to Trainer for gradient clipping by value (#6123). - Added a way to print to terminal without breaking up the progress bar (#5470)
- Added support to checkpoint after training steps in
ModelCheckpoint
callback (#6146) - Added
TrainerStatus.{INITIALIZING,RUNNING,FINISHED,INTERRUPTED}
(#7173) - Added
Trainer.validate()
method to perform one evaluation epoch over the validation set (#4948) - Added
LightningEnvironment
for Lightning-specific DDP (#5915) - Added
teardown()
hook to LightningDataModule (#4673) - Added
auto_insert_metric_name
parameter toModelCheckpoint
(#6277) - Added arg to
self.log
that enables users to give custom names when dealing with multiple dataloaders (#6274) - Added
teardown
method toBaseProfiler
to enable subclasses defining post-profiling steps outside of__del__
(#6370) - Added
setup
method toBaseProfiler
to enable subclasses defining pre-profiling steps for every process (#6633) - Added no return warning to predict (#6139)
- Added
Trainer.predict
config validation (#6543) - Added
AbstractProfiler
interface (#6621) - Added support for including module names for forward in the autograd trace of
PyTorchProfiler
(#6349) - Added support for the PyTorch 1.8.1 autograd profiler (#6618)
- Added
outputs
parameter to callback'son_validation_epoch_end
&on_test_epoch_end
hooks (#6120) - Added
configure_sharded_model
hook (#6679) - Added support for
precision=64
, enabling training with double precision (#6595) - Added support for DDP communication hooks (#6736)
- Added
artifact_location
argument toMLFlowLogger
which will be passed to theMlflowClient.create_experiment
call (#6677) - Added
model
parameter to precision plugins'clip_gradients
signature (#6764, #7231) - Added
is_last_batch
attribute toTrainer
(#6825) - Added
LightningModule.lr_schedulers()
for manual optimization (#6567) - Added
MpModelWrapper
in TPU Spawn (#7045) - Added
max_time
Trainer argument to limit training time (#6823) - Added
on_predict_{batch,epoch}_{start,end}
hooks (#7141) - Added new
EarlyStopping
parametersstopping_threshold
anddivergence_threshold
(#6868) - Added
debug
flag to TPU Training Plugins (PT_XLA_DEBUG) (#7219) - Added new
UnrepeatedDistributedSampler
andIndexBatchSamplerWrapper
for tracking distributed predictions (#7215) - Added
trainer.predict(return_predictions=None|False|True)
(#7215) - Added
BasePredictionWriter
callback to implement prediction saving (#7127) - Added
trainer.tune(scale_batch_size_kwargs, lr_find_kwargs)
arguments to configure the tuning algorithms (#7258) - Added
tpu_distributed
check for TPU Spawn barrier (#7241) - Added device updates to TPU Spawn for Pod training (#7243)
- Added warning when missing
Callback
and usingresume_from_checkpoint
(#7254) - DeepSpeed single file saving (#6900)
- Added Training type Plugins Registry (#6982, #7063, #7214, #7224)
- Add
ignore
param tosave_hyperparameters
(#6056)
Changed
- Changed
LightningModule.truncated_bptt_steps
to be property (#7323) - Changed
EarlyStopping
callback from by default runningEarlyStopping.on_validation_end
if only training is run. Setcheck_on_train_epoch_end
to run the callback at the end of the train epoch instead of at the end of the validation epoch (#7069) - Renamed
pytorch_lightning.callbacks.swa
topytorch_lightning.callbacks.stochastic_weight_avg
(#6259) - Refactor
RunningStage
andTrainerState
usage (#4945, #7173)- Added
RunningStage.SANITY_CHECKING
- Added
TrainerFn.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}
- Changed
trainer.evaluating
to returnTrue
if validating or testing
- Added
- Changed
setup()
andteardown()
stage argument to take any of{fit,validate,test,predict}
(#6386) - Changed profilers to save separate report files per state and rank (#6621)
- The trainer no longer tries to save a checkpoint on exception or run callback's
on_train_end
functions (#6864) - Changed
PyTorchProfiler
to usetorch.autograd.profiler.record_function
to record functions (#6349) - Disabled
lr_scheduler.step()
in manual optimization (#6825) - Changed warnings and recommendations for dataloaders in
ddp_spawn
(#6762) pl.seed_everything
will now also set the seed on theDistributedSampler
(#7024)- Changed default setting for communication of multi-node training using
DDPShardedPlugin
(#6937) trainer.tune()
now returns the tuning result (#7258)LightningModule.from_datasets()
now acceptsIterableDataset
instances as training datasets. (#7503)- Changed
resume_from_checkpoint
warning to an error when the checkpoint file does not exist (#7075) - Automatically set
sync_batchnorm
fortraining_type_plugin
(#6536) - Allowed training type plugin to delay optimizer creation (#6331)
- Removed ModelSummary validation from train loop on_trainer_init (#6610)
- Moved
save_function
to accelerator (#6689) - Updated DeepSpeed ZeRO (#6546, #6752, #6142, #6321)
- Improved verbose logging for
EarlyStopping
callback (#6811) - Run ddp_spawn dataloader checks on Windows (#6930)
- Updated mlflow with using
resolve_tags
(#6746) - Moved
save_hyperparameters
to its own function (#7119) - Replaced
_DataModuleWrapper
with__new__
(#7289) - Reset
current_fx
properties on lightning module in teardown (#7247) - Auto-set
DataLoader.worker_init_fn
withseed_everything
(#6960) - Remove
model.trainer
call inside of dataloading mixin (#7317) - Split profilers module (#6261)
- Ensure accelerator is valid if running interactively (#5970)
- Disabled batch transfer in DP mode (#6098)
Deprecated
- Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339) - Deprecated
Trainer.truncated_bptt_steps
in favor ofLightningModule.truncated_bptt_steps
(#7323) - Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339) - Deprecated
LightningModule.grad_norm
in favor ofpytorch_lightning.utilities.grads.grad_norm
(#7292) - Deprecated the
save_function
property from theModelCheckpoint
callback (#7201) - Deprecated
LightningModule.write_predictions
andLightningModule.write_predictions_dict
(#7066) - Deprecated
TrainerLoggingMixin
in favor of a separate utilities module for metric handling (#7180) - Deprecated
TrainerTrainingTricksMixin
in favor of a separate utilities module for NaN/Inf detection for gradients and parameters (#6834) period
has been deprecated in favor ofevery_n_val_epochs
in theModelCheckpoint
callback (#6146)- Deprecated
trainer.running_sanity_check
in favor oftrainer.sanity_checking
(#4945) - Deprecated
Profiler(output_filename)
in favor ofdirpath
andfilename
(#6621) - Deprecated
PytorchProfiler(profiled_functions)
in favor ofrecord_functions
(#6349) - Deprecated
@auto_move_data
in favor oftrainer.predict
(#6993) - Deprecated
Callback.on_load_checkpoint(checkpoint)
in favor ofCallback.on_load_checkpoint(trainer, pl_module, checkpoint)
(#7253) - Deprecated metrics in favor of
torchmetrics
(#6505, #6530, #6540, #6547, #6515, #6572, #6573, #6584, #6636, #6637, #6649, #6659, #7131) - Deprecated the
LightningModule.datamodule
getter and setter methods; access them throughTrainer.datamodule
instead (#7168) - Deprecated the use of
Trainer(gpus="i")
(string) for selecting the i-th GPU; from v1.5 this will set the number of GPUs instead of the index (#6388)
Removed
- Removed the
exp_save_path
property from theLightningModule
(#7266) - Removed training loop explicitly calling
EarlyStopping.on_validation_end
if no validation is run (#7069) - Removed
automatic_optimization
as a property from the training loop in favor ofLightningModule.automatic_optimization
(#7130) - Removed evaluation loop legacy returns for
*_epoch_end
hooks (#6973) - Removed support for passing a bool value to
profiler
argument of Trainer (#6164) - Removed no return warning from val/test step (#6139)
- Removed passing a
ModelCheckpoint
instance toTrainer(checkpoint_callback)
(#6166) - Removed deprecated Trainer argument
enable_pl_optimizer
andautomatic_optimization
(#6163) - Removed deprecated metrics (#6161)
- from
pytorch_lightning.metrics.functional.classification
removedto_onehot
,to_categorical
,get_num_classes
,roc
,multiclass_roc
,average_precision
,precision_recall_curve
,multiclass_precision_recall_curve
- from
pytorch_lightning.metrics.functional.reduction
removedreduce
,class_reduce
- from
- Removed deprecated
ModelCheckpoint
argumentsprefix
,mode="auto"
(#6162) - Removed
mode='auto'
fromEarlyStopping
(#6167) - Removed
epoch
andstep
arguments fromModelCheckpoint.format_checkpoint_name()
, these are now included in themetrics
argument (#7344) - Removed legacy references for magic keys in the
Result
object (#6016) - Removed deprecated
LightningModule
hparams
setter (#6207) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the
"log"/"progress_bar"
magic keys. Useself.log
instead (#6734) - Removed
trainer.fit()
return value of1
. It has no return now (#7237) - Removed
logger_connector
legacy code (#6733) - Removed unused mixin attributes (#6487)
Fixed
- Fixed NaN errors in progress bars when training with iterable datasets with no length defined (#7306)
- Fixed attaching train and validation dataloaders when
reload_dataloaders_every_epoch=True
andnum_sanity_val_steps=0
(#7207) - Added a barrier in the accelerator
teardown
to synchronize processes before execution finishes (#6814) - Fixed multi-node DDP sub-process launch by using
local_rank
instead ofglobal_rank
for main process assertion (#7061) - Fixed incorrect removal of
WORLD_SIZE
environment variable in DDP training when launching with torch distributed/torchelastic (#6942) - Made the
Plugin.reduce
method more consistent across all Plugins to reflect a mean-reduction by default (#6011) - Move lightning module to correct device type when using LightningDistributedWrapper (#6070)
- Do not print top-k verbose log with
ModelCheckpoint(monitor=None)
(#6109) - Fixed
ModelCheckpoint(save_top_k=0, save_last=True)
not saving thelast
checkpoint (#6136) - Fixed
.teardown(stage='fit')
and.on_fit_{start,end}()
getting called duringtrainer.test
(#6386) - Fixed LightningModule
all_gather
on cpu tensors (#6416) - Fixed torch distributed not available in setup hook for DDP (#6506)
- Fixed
trainer.tuner.{lr_find,scale_batch_size}
not setting theTrainer
state properly (#7258) - Fixed bug where the learning rate schedulers did not follow the optimizer frequencies (#4868)
- Fixed pickle error checker to now check for
pickle.PickleError
to catch all pickle errors (#6917) - Fixed a bug where the outputs object passed to
LightningModule.training_epoch_end
was different from the object passed to theon_train_end_epoch
hook (#6969) - Fixed a bug where the outputs passed to
train_batch_end
would be listed even when using a single optimizer and no truncated backprop through time steps (#6969) - Fixed bug for trainer error handling which would cause hang for distributed training (#6864)
- Fixed
self.device
not returning the correct device in replicas of data-parallel (#6414) - Fixed
lr_find
trying beyondnum_training
steps and suggesting a too high learning rate (#7076) - Fixed logger creating incorrect version folder in DDP with repeated
Trainer.fit
calls (#7077) - Fixed metric objects passed directly to
self.log
not being reset correctly (#7055) - Fixed
CombinedLoader
in distributed settings for validation / testing (#7102) - Fixed the save_dir in
WandbLogger
when the run was initiated externally (#7106) - Fixed
num_sanity_val_steps
affecting reproducibility of training data shuffling (#7014) - Fixed resetting device after
fitting/evaluating/predicting
(#7188) - Fixed bug where
trainer.tuner.scale_batch_size(max_trials=0)
would not return the correct batch size result (#7262) - Fixed metrics not being properly logged with
precision=16
andmanual_optimization
(#7228) - Fixed
BaseFinetuning
properly reloadingoptimizer_states
when usingresume_from_checkpoint
(#6891) - Fixed
parameters_to_ignore
not properly set to DDPWrapper (#7239) - Fixed parsing of
fast_dev_run=True
with the built-inArgumentParser
(#7240) - Fixed handling an
IterableDataset
that fails to produce a batch at the beginning of an epoch (#7294) - Fixed
LightningModule.save_hyperparameters()
when attempting to save an empty container (#7268) - Fixed
apex
not properly instantiated when running withddp
(#7274) - Fixed optimizer
state
not moved toGPU
(#7277) - Fixed custom init args for
WandbLogger
(#6989) - Fixed a bug where an error would be raised if the train dataloader sometimes produced None for a batch (#7342)
- Fixed examples (#6600, #6638, #7096, #7246, #6357, #6476, #6294, #6373, #6088, #7398)
- Resolved schedule step bug for PyTorch Profiler (#6674, #6681)
- Updated logic for checking TPUs availability (#6767)
- Resolve TPU miss rendezvous (#6781)
- Fixed auto-scaling mode when calling tune method on trainer (#7321)
- Fixed finetuning complex models correctly unfreezes (#6880)
- Ensure we set the eval/train flag correctly on accelerator model (#6877)
- Set better defaults for
rank_zero_only.rank
when training is launched with SLURM and torchelastic (#6802) - Fixed matching the number of outputs of backward with forward for AllGatherGrad (#6625)
- Fixed the
gradient_clip_algorithm
has no effect (#6928) - Fixed CUDA OOM detection and handling (#6934)
- Fixed
unfreeze_and_add_param_group
expectsmodules
rather thanmodule
(#6822) - Fixed DPP + SyncBN when move on device (#6838)
- Fixed missing arguments in
lr_find
call (#6784) - Fixed
set_default_tensor_type
totorch.DoubleTensor
with precision=64 (#7108) - Fixed
NeptuneLogger.log_text(step=None)
(#7194) - Fixed importing torchtext batch (#6365, #6323, #6211)
Contributors
@akihironitta, @alessiobonfiglio, @amisev, @amogkam, @ananthsub, @ArvinZhuang, @ashleve, @asnorkin, @awaelchli, @BloodAxe, @bmahlbrand, @Borda, @borisdayma, @camruta, @carmocca, @ceshine, @dbonner, @dhkim0225, @EdwardJB, @EliaCereda, @EricCousineau-TRI, @ethanwharris, @FlorianMF, @hemildesai, @ifsheldon, @kaushikb11, @mauvilsa, @maxfrei750, @mesejo, @ramonemiliani93, @rohitgr7, @s-rog, @sadiqj, @scart97, @SeanNaren, @shuyingsunshine21, @SkafteNicki, @SpontaneousDuck, @stllfe, @tchaton, @THasthika, @vballoli
If we forgot someone due to not matching commit email with GitHub account, let us know :]