Great writeup, thanks for sharing.
I am under the impression that the backwards pass is in fact the synchronization barrier.
When your "main" worker is busy doing stuff like running the validation or writing checkpoints, for example at the end of the nth epoch, the other workers already start the next epoch.
The other workers will complete exactly one mini-batch, and then get "stuck" since they will wait for the "main" worker to exchange gradients with them.
They will wait for the timeout you set on the process group - so I would set it to <5 minutes.
If something takes longer than that, probably one of the processes has really failed.
This is mainly based on me running distributed training on multiple spots, so there is actually a good chance one of the workers did fail...