Today we are excited to announce Lightning 1.4, introducing support for TPU pods, XLA profiling, IPUs, and new plugins to reach 10+ billion parameters, including Deep Speed Infinity, Fully Sharded Data-Parallel and more!
https://devblog.pytorchlightning.ai/announcing-lightning-1-4-8cd20482aee9
[1.4.0] - 2021-07-27
Added
- Added
extract_batch_size
utility and corresponding tests to extract batch dimension from multiple batch types (#8357) - Added support for named parameter groups in
LearningRateMonitor
(#7987) - Added
dataclass
support forpytorch_lightning.utilities.apply_to_collection
(#7935) - Added support to
LightningModule.to_torchscript
for saving to custom filesystems withfsspec
(#7617) - Added
KubeflowEnvironment
for use with thePyTorchJob
operator in Kubeflow - Added LightningCLI support for config files on object stores (#7521)
- Added
ModelPruning(prune_on_train_epoch_end=True|False)
to choose when to apply pruning (#7704) - Added support for checkpointing based on a provided time interval during training (#7515)
- Progress tracking
- Added support for passing a
LightningDataModule
positionally as the second argument totrainer.{validate,test,predict}
(#7431) - Added argument
trainer.predict(ckpt_path)
(#7430) - Added
clip_grad_by_value
support for TPUs (#7025) - Added support for passing any class to
is_overridden
(#7918) - Added
sub_dir
parameter toTensorBoardLogger
(#6195) - Added correct
dataloader_idx
to batch transfer hooks (#6241) - Added
include_none=bool
argument toapply_to_collection
(#7769) - Added
apply_to_collections
to apply a function to two zipped collections (#7769) - Added
ddp_fully_sharded
support (#7487) - Added
should_rank_save_checkpoint
property to Training Plugins (#7684) - Added
log_grad_norm
hook toLightningModule
to customize the logging of gradient norms (#7873) - Added
save_config_filename
init argument toLightningCLI
to ease resolving name conflicts (#7741) - Added
save_config_overwrite
init argument toLightningCLI
to ease overwriting existing config files (#8059) - Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
- Added trainer stage hooks for Training Plugins and Accelerators (#7864)
- Added the
on_before_optimizer_step
hook (#8048) - Added IPU Accelerator (#7867)
- Fault-tolerant training
- Added
{,load_}state_dict
toResultCollection
(#7948) - Added
{,load_}state_dict
toLoops
(#8197) - Set
Loop.restarting=False
at the end of the first iteration (#8362) - Save the loops state with the checkpoint (opt-in) (#8362)
- Save a checkpoint to restore the state on exception (opt-in) (#8362)
- Added
state_dict
andload_state_dict
utilities forCombinedLoader
+ utilities for dataloader (#8364)
- Added
- Added
rank_zero_only
toLightningModule.log
function (#7966) - Added
metric_attribute
toLightningModule.log
function (#7966) - Added a warning if
Trainer(log_every_n_steps)
is a value too high for the training dataloader (#7734) - Added LightningCLI support for argument links applied on instantiation (#7895)
- Added LightningCLI support for configurable callbacks that should always be present (#7964)
- Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
- Added support for
torch.nn.UninitializedParameter
inModelSummary
(#7642) - Added support
LightningModule.save_hyperparameters
whenLightningModule
is a dataclass (#7992) - Added support for overriding
optimizer_zero_grad
andoptimizer_step
when using accumulate_grad_batches (#7980) - Added
logger
boolean flag tosave_hyperparameters
(#7960) - Added support for calling scripts using the module syntax (
python -m package.script
) (#8073) - Added support for optimizers and learning rate schedulers to
LightningCLI
(#8093) - Added XLA Profiler (#8014)
- Added
PrecisionPlugin.{pre,post}_backward
(#8328) - Added
on_load_checkpoint
andon_save_checkpoint
hooks to thePrecisionPlugin
base class (#7831) - Added
max_depth
parameter inModelSummary
(#8062) - Added
XLAStatsMonitor
callback (#8235) - Added
restore
function andrestarting
attribute to baseLoop
(#8247) - Added
FastForwardSampler
andCaptureIterableDataset
(#8307) - Added support for
save_hyperparameters
inLightningDataModule
(#3792) - Added the
ModelCheckpoint(save_on_train_epoch_end)
to choose when to run the saving logic (#8389) - Added
LSFEnvironment
for distributed training with the LSF resource managerjsrun
(#5102) - Added support for
accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto'
(#7808) - Added
tpu_spawn_debug
to plugin registry (#7933) - Enabled traditional/manual launching of DDP processes through
LOCAL_RANK
andNODE_RANK
environment variable assignments (#7480) - Added
quantize_on_fit_end
argument toQuantizationAwareTraining
(#8464) - Added experimental support for loop specialization (#8226)
- Added support for
devices
flag to Trainer (#8440) - Added private
prevent_trainer_and_dataloaders_deepcopy
context manager on theLightningModule
(#8472) - Added support for providing callables to the Lightning CLI instead of types (#8400)
Changed
- Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
- Changed the
Trainer
'scheckpoint_callback
argument to allow only boolean values (#7539) - Log epoch metrics before the
on_evaluation_end
hook (#7272) - Explicitly disallow calling
self.log(on_epoch=False)
during epoch-only or single-call hooks (#7874) - Changed these
Trainer
methods to be protected:call_setup_hook
,call_configure_sharded_model
,pre_dispatch
,dispatch
,post_dispatch
,call_teardown_hook
,run_train
,run_sanity_check
,run_evaluate
,run_evaluation
,run_predict
,track_output_for_epoch_end
- Changed
metrics_to_scalars
to work with any collection or value (#7888) - Changed
clip_grad_norm
to usetorch.nn.utils.clip_grad_norm_
(#7025) - Validation is now always run inside the training epoch scope (#7357)
ModelCheckpoint
now runs at the end of the training epoch by default (#8389)EarlyStopping
now runs at the end of the training epoch by default (#8286)- Refactored Loops
- Moved attributes
global_step
,current_epoch
,max/min_steps
,max/min_epochs
,batch_idx
, andtotal_batch_idx
to TrainLoop (#7437) - Refactored result handling in training loop (#7506)
- Moved attributes
hiddens
andsplit_idx
to TrainLoop (#7507) - Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
- Simplified "should run validation" logic (#7682)
- Simplified logic for updating the learning rate for schedulers (#7682)
- Removed the
on_epoch
guard from the "should stop" validation check (#7701) - Refactored internal loop interface; added new classes
FitLoop
,TrainingEpochLoop
,TrainingBatchLoop
(#7871, #8077) - Removed
pytorch_lightning/trainer/training_loop.py
(#7985) - Refactored evaluation loop interface; added new classes
DataLoaderLoop
,EvaluationLoop
,EvaluationEpochLoop
(#7990, #8077) - Removed
pytorch_lightning/trainer/evaluation_loop.py
(#8056) - Restricted public access to several internal functions (#8024)
- Refactored trainer
_run_*
functions and separate evaluation loops (#8065) - Refactored prediction loop interface; added new classes
PredictionLoop
,PredictionEpochLoop
(#7700, #8077) - Removed
pytorch_lightning/trainer/predict_loop.py
(#8094) - Moved result teardown to the loops (#8245)
- Improve
Loop
API to better handle childrenstate_dict
andprogress
(#8334)
- Moved attributes
- Refactored logging
- Renamed and moved
core/step_result.py
totrainer/connectors/logger_connector/result.py
(#7736) - Dramatically simplify the
LoggerConnector
(#7882) trainer.{logged,progress_bar,callback}_metrics
are now updated on-demand (#7882)- Completely overhaul the
Result
object in favor ofResultMetric
(#7882) - Improve epoch-level reduction time and overall memory usage (#7882)
- Allow passing
self.log(batch_size=...)
(#7891) - Each of the training loops now keeps its own results collection (#7891)
- Remove
EpochResultStore
andHookResultStore
in favor ofResultCollection
(#7909) - Remove
MetricsHolder
(#7909)
- Renamed and moved
- Moved
ignore_scalar_return_in_dp
warning suppression to the DataParallelPlugin class (#7421) - Changed the behaviour when logging evaluation step metrics to no longer append
/epoch_*
to the metric name (#7351) - Raised
ValueError
when aNone
value isself.log
-ed (#7771) - Changed
resolve_training_type_plugins
to allow settingnum_nodes
andsync_batchnorm
fromTrainer
setting (#7026) - Default
seed_everything(workers=True)
in theLightningCLI
(#7504) - Changed
model.state_dict()
inCheckpointConnector
to allowtraining_type_plugin
to customize the model'sstate_dict()
(#7474) MLflowLogger
now uses the env variableMLFLOW_TRACKING_URI
as default tracking URI (#7457)- Changed
Trainer
arg and functionality fromreload_dataloaders_every_epoch
toreload_dataloaders_every_n_epochs
(#5043) - Changed
WandbLogger(log_model={True/'all'})
to log models as artifacts (#6231) - MLFlowLogger now accepts
run_name
as an constructor argument (#7622) - Changed
teardown()
inAccelerator
to allowtraining_type_plugin
to customizeteardown
logic (#7579) Trainer.fit
now raises an error when using manual optimization with unsupported features such asgradient_clip_val
oraccumulate_grad_batches
(#7788)- Accelerator hooks are called regardless if
LightningModule
overrides the same hooks (#7826) - Moved profilers to their own file (#7822)
- The
on_after_backward
hook is now called on accumulating iterations. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328) - The mixed precision loss is no longer unscaled before the
on_after_backward
hook. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328) - The
TrainingTypePlugin.{pre,post}_backward
hooks no longer take theoptimizer, opt_idx, should_accumulate
arguments (#8328) - The
PrecisionPlugin.backward
hooks no longer returns a value (#8328) - The
PrecisionPlugin.backward
hooks no longer takes ashould_accumulate
argument (#8328) - Added the
on_before_backward
hook (#7865) LightningCLI
now aborts with a clearer message if config already exists and disables save config duringfast_dev_run
(#7963)- Saved the
LightningCLI
config onsetup
and only on the main process (#8017) - Dropped the
LightningCLI
ArgumentParser
when pickling (#8017) - Skip
broadcast
if distributed not initialized for the spawn plugins (#8017) Trainer(resume_from_checkpoint=...)
now restores the model directly afterLightningModule.setup()
, which is beforeLightningModule.configure_sharded_model()
(#7652)- Moved
torch.cuda.set_device()
to enable collective calls earlier in setup (#8312) - Used XLA utility API to move data to CPU (Single TPU core) (#8078)
- Improved error messages in
replace_sampler
when theDataLoader
attributes are not included in the signature or the signature is missing optional arguments (#8519) - Moved
DeviceDtypeModuleMixin
andHyperparametersMixin
mixin tocore
(#8396) - Return the
default_root_dir
as thelog_dir
when the logger is aLoggerCollection
(#8187)
Deprecated
- Deprecated
LightningModule.loaded_optimizer_states_dict
(#8229) - Standardized the dataloaders arguments of
trainer.{fit,valdiate,test,tune}
(#7431) - Deprecated
DataModule
properties:has_prepared_data
,has_setup_fit
,has_setup_validate
,has_setup_test
,has_setup_predict
,has_teardown_fit
,has_teardown_validate
,has_teardown_test
,has_teardown_predict
(#7657) - Deprecated
TrainerModelHooksMixin
in favor ofpytorch_lightning.utilities.signature_utils
(#7422) - Deprecated
num_nodes
andsync_batchnorm
arguments inDDPPlugin
andDDPSpawnPlugin
(#7026) - Deprecated
self.log(sync_dist_op)
in favor ofself.log(reduce_fx)
. (#7891) - Deprecated
is_overridden(model=...)
in favor ofis_overridden(instance=...)
(#7918) - Deprecated automatically detaching returned extras with grads (#7994)
- Deprecated default value of
monitor
argument in EarlyStopping callback to enforcemonitor
as a required argument (#7907) - Deprecated importing
rank_zero_{warn,deprecation}
directly frompytorch_lightning.utilities.distributed
(#8085) - Deprecated the use of
CheckpointConnector.hpc_load()
in favor ofCheckpointConnector.restore()
(#7652) - Deprecated
ModelCheckpoint(every_n_val_epochs)
in favor ofModelCheckpoint(every_n_epochs)
(#8383) - Deprecated
DDPPlugin.task_idx
in favor ofDDPPlugin.local_rank
(#8203) - Deprecated the
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#8025) - Deprecated the
Trainer.disable_validation
property in favor ofnot Trainer.enable_validation
(#8291) - Deprecated
mode
parameter inModelSummary
in favor ofmax_depth
(#8062) - Deprecated
reload_dataloaders_every_epoch
argument ofTrainer
in favor ofreload_dataloaders_every_n_epochs
(#5043) - Deprecated
distributed_backend
argument forTrainer
(#8575)
Removed
- Dropped official support/testing for PyTorch <1.6 (#8288)
- Removed
ProfilerConnector
(#7654) - Pruned deprecated classif. metrics from
pytorch_lightning.metrics.functional.classification
(#7499) - Removed deprecated data parallel classes
LightningDataParallel
andLightningDistributedDataParallel
frompytorch_lightning.overrides.data_parallel
(#7510) - Removed deprecated trainer attributes -
get_model
andaccelerator_backend
(#7502) - Removed support for automatically monitoring the
val_loss
key withModelCheckpoint
. Pass yourmonitor
of choice to theModelCheckpoint
instance instead (#8293) - Removed support for
self.log(tbptt_reduce_fx)
andself.log(tbptt_pad_token)
. Please, open a discussion explaining your use-case if you relied on these. (#7644) - Removed deprecated utils modules
model_utils
,warning_utils
,xla_device_utils
and partiallyargparse_utils
(#7503) - Removed
RPCPlugin
andRPCSequentialPlugin
. If you were successfully using these plugins, please open a GitHub discussion about your use case (#8101) - Removed deprecated trainer attributes -
on_cpu
,on_tpu
,use_tpu
,on_gpu
,use_dp
,use_ddp
,use_ddp2
,use_horovod
,use_single_gpu
(#7501) - Removed deprecated
optimizer
argument inLightningModule.manual_backward()
; Toggling optimizers in manual optimization should be done usingLightningModule.{un}toggle_optimizer()
(#8287) - Removed DeepSpeed FP16 Exception as FP32 is now supported (#8462)
- Removed environment variable
PL_EXP_VERSION
from DDP subprocesses (#7403)
Fixed
- Fixed the
GPUStatsMonitor
callbacks to use the correct GPU IDs ifCUDA_VISIBLE_DEVICES
set (#8260) - Fixed
lr_scheduler
checkpointed state by callingupdate_lr_schedulers
before saving checkpoints (#7877) - Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (#7685)
- Fixed dev debugger memory growing due to tracking events even when disabled (#7875)
- Fixed
None
loss keys getting added intraining_epoch_end
when using manual optimization and not returning a loss (#7772) - Fixed a bug where
precision=64
withaccelerator='ddp_spawn'
would throw a pickle error (#6924) - Do not override the existing
epoch
value inlogged_metrics
when already logged by the user (#7982) - Support for manual optimization with DeepSpeed (#7970)
- Fixed
dataloader_idx
argument value when predicting with only oneDataLoader
(#7941) - Fixed passing the
stage
argument ofCallback.{setup,teardown}
as a keyword (#7973) - Fixed metrics generated during
validation sanity checking
are cleaned on end (#8171) - Fixed
log_gpu_memory
metrics not being added tologging
when nothing else is logged (#8174) - Fixed a bug where calling
log
with aMetric
instance would raise an error if it was a nested attribute of the model (#8181) - Fixed a bug where using
precision=64
would cause buffers with complex dtype to be cast to real (#8208) - Fixed
is_overridden
returning true for wrapped functions with no changes (#8296) - Fixed a bug where
truncated_bptt_steps
would throw an AttributeError when the target RNN has multiple hidden states (#8145) - Fixed
self.optimizers()
not returning a single optimizer if it had been wrapped (#8326) - Fixed the
on_after_backward
hook not getting called when using manual optimization and no plugins (#8328) - Fixed the
LightningModule.backward
hook only getting called with theapex
plugin when using manual optimization (#8328) - Fixed moving batch to device before sending it to the
on_*_batch_start
/on_*_batch_end
callbacks and model hooks (#7378) - Fixed passing a custom
DDPPlugin
when choosingaccelerator="ddp_cpu"
for the accelerator (#6208) - Fixed missing call to
LightningModule.untoggle_optimizer
in training loop when running gradient accumulation with multiple optimizers (#8284) - Fixed hash of LightningEnum to work with value instead of name (#8421).
- Fixed a bug where an extra checkpoint was saved at the end of training if the
val_check_interval
did not align with the number of training batches (#7724) - Fixed hash of LightningEnum to work with value instead of name(#8421).
- Fixed
move_data_to_device
to return the batch if the objectto
function didn't returnself
(#8433) - Fixed progress bar updates for Pod Training (#8258)
- Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (#8442)
- Fixed memory leaks on GPU by moving
optimizer_states
,ResultCollection.extra
,ResultMetric
attributes, andLoggerConnector
metrics tocpu
. Also, delete the DDP wrapper onteardown
(#8490) - Fixed
SWA
callback using LightningModuleprevent_trainer_and_dataloaders_deepcopy
to avoid OOM (#8472) - Fixed
ModelPruning
callbackon_save_checkpoint
to avoid making adeepcopy
potentially leading to OOM (#8472) - Fixed the sampler replacement logic for
DataLoader
s which do not define allDataLoader
attributes as__init__
parameters (#8519) - Fixed DeepSpeed Windows support (#8488)
- Fixed DeepSpeed not properly setting the trainer
lr_schedulers
attribute (#8527) - Fixed experiment version and log-dir divergence in DDP when using multiple
Trainer
instances in sequence (#7403) - Enabled manual optimization for TPUs (#8458)
- Fixed
accumulate_grad_batches
not been recomputed during model reload (#5334) - Fixed a
TypeError
when wrapping optimizers in theHorovodPlugin
and runningTrainer.test
(#7840) - Fixed
BackboneFinetuning
restoration (#8501) - Fixed
lr_scheduler
with metric (e.g.torch.optim.lr_scheduler.ReduceLROnPlateau
) when usingautomatic_optimization = False
(#7643) - Fixed
DeepSpeed
breaking with no schedulers (#8580)
Contributors
@00sapo @AffineParameter @ajtritt @akihironitta @ananthsub @aniketmaurya @aslisabanci @awaelchli @bamblebam @Borda @borisdayma @carmocca @dalek-who @DavidMChan @davors72 @dcfidalgo @ddrevicky @deepsource-autofix @djthegr8 @edenlightning @edgarriba @eladsegal @ethanwharris @eugeneh101 @fepegar @gaoteng-git @gtauzin @i-aki-y @janhenriklambrechts @jiwidi @justusschock @karthikrangasai @kaushikb11 @loic-beheshti @Lucklyric @ManuelPalermo @mauvilsa @maxoppelt @neggert @nikvaessen @nisheethlahoti @pre-commit-ci @rohitgr7 @ruotianluo @satishjasthi @SeanNaren @shirayu @shuyingsunshine21 @sid-sundrani @Sileadim @simran2905 @stancld @t-vi @tchaton @theblackfly @theodumont @tilman151 @tomy0000000 @tshu-w @vatch123 @WrRan @yifuwang
If we forgot someone, let us know :]