Lightning-AI/pytorch-lightning 0.10.0 on GitHub

This release is a buffer in case 1.0 breaks any compatibility for people who upgrade. 0.10.0 has all the bug fixes and features of 1.0 but is 100% backward compatible. The 1.0 release following in the next 24 hours.

Overview

The major changes are:

Results objects are deprecated (we hated them too haha)
This means dataflow and logging have been decoupled

To log:

def any_step(...):
   self.log('something', i_computed)

Separately, return whatever you want from methods:

def training_step(...):
  return loss

def training_step(...):
   return {'loss': loss, 'whatever': [1, 'want']}

Detail changes

Added

Added new Metrics API. (#3868, [#3921)
Enable PyTorch 1.7 compatibility (#3541)
Added LightningModule.to_torchscript to support exporting as ScriptModule (#3258)
Added warning when dropping unpicklable hparams (#2874)
Added EMB similarity (#3349)
Added ModelCheckpoint.to_yaml method (#3048)
Allow ModelCheckpoint monitor to be None, meaning it will always save ([3630)
Disabled optimizers setup during testing (#3059)
Added support for datamodules to save and load checkpoints when training (#3563
Added support for datamodule in learning rate finder (#3425)
Added gradient clip test for native AMP (#3754)
Added dist lib to enable syncing anything across devices (#3762)
Added broadcast to TPUBackend (#3814)
Added XLADeviceUtils class to check XLA device type (#3274)

Changed

Refactored accelerator backends:
- moved TPU xxx_step to backend (#3118)
- refactored DDP backend forward (#3119)
- refactored GPU backend __step (#3120)
- refactored Horovod backend (#3121, #3122)
- remove obscure forward call in eval + CPU backend ___step (#3123)
- reduced all simplified forward (#3126)
- added hook base method (#3127)
- refactor eval loop to use hooks - use test_mode for if so we can split later (#3129)
- moved ___step_end hooks (#3130)
- training forward refactor (#3134)
- training AMP scaling refactor (#3135)
- eval step scaling factor (#3136)
- add eval loop object to streamline eval loop (#3138)
- refactored dataloader process hook (#3139)
- refactored inner eval loop (#3141)
- final inner eval loop hooks (#3154)
- clean up hooks in run_evaluation (#3156)
- clean up data reset (#3161)
- expand eval loop out (#3165)
- moved hooks around in eval loop (#3195)
- remove _evaluate fx (#3197)
- Trainer.fit hook clean up (#3198)
- DDPs train hooks (#3203)
- refactor DDP backend (#3204, #3207, #3208, #3209, #3210)
- reduced accelerator selection (#3211)
- group prepare data hook (#3212)
- added data connector (#3285)
- modular is_overridden (#3290)
- adding Trainer.tune() (#3293)
- move run_pretrain_routine -> setup_training (#3294)
- move train outside of setup training (#3297)
- move prepare_data to data connector (#3307)
- moved accelerator router (#3309)
- train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
- duplicate data interface definition up into DataHooks class (#3344)
- inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
- all logging related calls in a connector (#3395)
- device parser (#3400, #3405)
- added model connector (#3407)
- moved eval loop logging to loggers (#3408)
- moved eval loop (#3412[#3408)
- trainer/separate argparse (#3421, #3428, #3432)
- move lr_finder (#3434)
- organize args (##3435, #3442, #3447, #3448, #3449, #3456)
- move specific accelerator code (#3457)
- group connectors (#3472)
- accelerator connector methods x/n (#3469, #3470, #3474)
- merge backends (#3476, #3477, #3478, #3480, #3482)
- apex plugin (#3502)
- precision plugins (#3504)
- Result - make monitor default to checkpoint_on to simplify (#3571)
- reference to the Trainer on the LightningDataModule (#3684)
- add .log to lightning module (#3686, #3699, #3701, #3704, #3715)
- enable tracking original metric when step and epoch are both true (#3685)
- deprecated results obj, added support for simpler comms (#3681)
- move backends back to individual files (#3712)
- fixes logging for eval steps (#3763)
- decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806)
- remove weight loading hack for ddp_cpu (#3808)
- separate torchelastic from DDP (#3810)
- separate SLURM from DDP (#3809)
- decoupled DDP2 (#3816)
- bug fix with logging val epoch end + monitor (#3812)
- decoupled DDP, DDP spawn (#3733, #3817, #3819, #3927)
- callback system and init DDP (#3836)
- adding compute environments (#3837, [#3842)
- epoch can now log independently (#3843)
- test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
- fixed init_slurm_connection causing hostname errors (#3856)
- moves init apex from LM to apex connector (#3923)
- moves sync bn to each backend (#3925)
- moves configure ddp to each backend (#3924)
Deprecation warning (#3844)
Changed LearningRateLogger to LearningRateMonitor (#3251)
Used fsspec instead of gfile for all IO (#3320)
- Swaped torch.load for fsspec load in DDP spawn backend (#3787)
- Swaped torch.load for fsspec load in cloud_io loading (#3692)
- Added support for to_disk() to use remote filepaths with fsspec (#3930)
- Updated model_checkpoint's to_yaml to use fsspec open (#3801)
- Fixed fsspec is inconsistant when doing fs.ls (#3805)
Refactor GPUStatsMonitor to improve training speed (#3257)
Changed IoU score behavior for classes absent in target and pred (#3098)
Changed IoU remove_bg bool to ignore_index optional int (#3098)
Changed defaults of save_top_k and save_last to None in ModelCheckpoint (#3680)
row_log_interval and log_save_interval are now based on training loop's global_step instead of epoch-internal batch index (#3667)
Silenced some warnings. verified ddp refactors (#3483)
Cleaning up stale logger tests (#3490)
Allow ModelCheckpoint monitor to be None (#3633)
Enable None model checkpoint default (#3669)
Skipped best_model_path if checkpoint_callback is None (#2962)
Used raise .. from .. to explicitly chain exceptions (#3750)
Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
Write predictions in LightningModule instead of EvalResult [#3882

Deprecated

Deprecated TrainResult and EvalResult, use self.log and self.write from the LightningModule to log metrics and write predictions. training_step can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681)
Deprecate early_stop_callback Trainer argument (#3845)
Rename Trainer arguments row_log_interval >> log_every_n_steps and log_save_interval >> flush_logs_every_n_steps (#3748)

Removed

Removed experimental Metric API (#3868, #3943, #3949, #3946), listed changes before final removal:
- Added EmbeddingSimilarity metric (#3349, [#3358)
- Added hooks to metric module interface (#2528)
- Added error when AUROC metric is used for multiclass problems (#3350)
- Fixed ModelCheckpoint with save_top_k=-1 option not tracking the best models when a monitor metric is available (#3735)
- Fixed counter-intuitive error being thrown in Accuracy metric for zero target tensor (#3764)
- Fixed aggregation of metrics (#3517)
- Fixed Metric aggregation (#3321)
- Fixed RMSLE metric (#3188)
- Renamed reduction to class_reduction in classification metrics (#3322)
- Changed class_reduction similar to sklearn for classification metrics (#3322)
- Renaming of precision recall metric (#3308)

Fixed

Fixed on_train_batch_start hook to end epoch early (#3700)
Fixed num_sanity_val_steps is clipped to limit_val_batches (#2917)
Fixed ONNX model save on GPU (#3145)
Fixed GpuUsageLogger to work on different platforms (#3008)
Fixed auto-scale batch size not dumping auto_lr_find parameter (#3151)
Fixed batch_outputs with optimizer frequencies (#3229)
Fixed setting batch size in LightningModule.datamodule when using auto_scale_batch_size (#3266)
Fixed Horovod distributed backend compatibility with native AMP (#3404)
Fixed batch size auto scaling exceeding the size of the dataset (#3271)
Fixed getting experiment_id from MLFlow only once instead of each training loop (#3394)
Fixed overfit_batches which now correctly disables shuffling for the training loader. (#3501)
Fixed gradient norm tracking for row_log_interval > 1 (#3489)
Fixed ModelCheckpoint name formatting ([3164)
Fixed auto-scale batch size (#3151)
Fixed example implementation of AutoEncoder (#3190)
Fixed invalid paths when remote logging with TensorBoard (#3236)
Fixed change t() to transpose() as XLA devices do not support .t() on 1-dim tensor (#3252)
Fixed (weights only) checkpoints loading without PL (#3287)
Fixed gather_all_tensors cross GPUs in DDP (#3319)
Fixed CometML save dir (#3419)
Fixed forward key metrics (#3467)
Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
Fixed global step increment in training loop when training_epoch_end hook is used (#3673)
Fixed dataloader shuffling not getting turned off with overfit_batches > 0 and distributed_backend = "ddp" (#3534)
Fixed determinism in DDPSpawnBackend when using seed_everything in main process (#3335)
Fixed ModelCheckpoint period to actually save every period epochs (#3630)
Fixed val_progress_bar total with num_sanity_val_steps (#3751)
Fixed Tuner dump: add current_epoch to dumped_params (#3261)
Fixed current_epoch and global_step properties mismatch between Trainer and LightningModule (#3785)
Fixed learning rate scheduler for optimizers with internal state (#3897)
Fixed tbptt_reduce_fx when non-floating tensors are logged (#3796)
Fixed model checkpoint frequency (#3852)
Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (#3855)
Fixed TrainerEvaluationLoopMixin activates model.train() at the end (#3858)
Fixed overfit_batches when using with multiple val/test_dataloaders (#3857)
Fixed enables training_step to return None (#3862)
Fixed init nan for checkpointing (#3863)
Fixed for load_from_checkpoint (#2776)
Fixes incorrect batch_sizes when Dataloader returns a dict with multiple tensors (#3668)
Fixed unexpected signature for validation_step (#3947)

Contributors

@abrahambotros, @akihironitta, @ananthsub, @ananyahjha93, @awaelchli, @Borda, @c00k1ez, @carmocca, @f4hy, @GimmickNG, @jbschiratti, @justusschock, @LeeJZh, @lezwon, @Lucas-Steinmann, @maxjeblick, @monney, @mpariente, @nateraw, @nrupatunga, @patrickorlando, @PhilJd, @rohitgr7, @s-rog, @ShomyLiu, @SkafteNicki, @Sordie, @teddykoker, @tgaddair, @Vozf, @williamFalcon, @XDynames, @ydcjeff

If we forgot someone due to not matching the commit email with GitHub account, let us know :]

Lightning-AI/pytorch-lightning 0.10.0 Buffer release before 1.0 on GitHub