This is part II of a two-part series, describing our solution for running distributed training on spot instances using TorchElastic and Kubernetes.
Part I introduced our overall technology selection, design principles and benchmarks.
In Part II, we will walk you through the process of creating a simplified version of the training environment on AWS.
Note: This post assumes you’ve read Part I of this series, and that you have a decent level of experience working with Kubernetes and AWS infrastructure.
Deep Learning development is becoming more and more about minimizing the time from idea to trained model.
To shorten this lead time, researchers need access to a training environment that supports running multiple experiments concurrently, each utilizing several GPUs.
Until recently, training environments with tens or hundreds of GPUs were the sole property of the largest and richest technology companies. However, recent advances in the open-source community have helped close this gap, making this technology accessible even for small startups.
In this series, we will share our experience in building out a scalable training environment using TorchElastic and Kubernetes, utilizing…
Great writeup, thanks for sharing.
I am under the impression that the backwards pass is in fact the synchronization barrier.
When your "main" worker is busy doing stuff like running the validation or writing checkpoints, for example at the end of the nth epoch, the other workers already start the next epoch.
The other workers will complete exactly one mini-batch, and then get "stuck" since they will wait for the "main" worker to exchange gradients with them.
They will wait for the timeout you set on the process group - so I would set it to <5 minutes.
If something takes longer than that, probably one of the processes has really failed.
This is mainly based on me running distributed training on multiple spots, so there is actually a good chance one of the workers did fail...
Almost every leader you talk to will quote building trust as one of the most important things for creating a highly functional team.
However, one of the most common mistakes leaders make on a weekly basis, is not providing their team with good “whys”.
When I say “why”, I am not talking about the company’s mission to transform X or make the world a better place.
“providing a why” in this context is simply frequent communications around why decisions are being made, which is at least:
2. Aligned with a shared value/belief which is reinforced often
Deep learning experimentation speed is important for delivering high-quality solutions on time.
The data loading path — i.e. getting training examples from storage into the GPU is an often overlooked area, despite having a major impact on experimentation speed.
In this post, I will provide an introduction to components of the data loading path, how to formalize their performance goals, and how to design a solution which meets these performance goals.
GPUs are expensive to buy/lease. So, during training, you want to ensure they are in 100% utilization, crunching those matrixes and updating weights.
GPU’s work most efficiently in mini-batches…
Thanks for sharing!
I spent a few years going through this learning curve with my teams at a startup doing AI for medical imaging.
It makes me proud to see yet another validation for the path we took to identify and overcome these issues.
Here is my summary of Google's retinopathy journey, in detail. Notice the similarity to DeepMinds learning curve....
Deploying machine learning models to production in order to perform inference, i.e. predict results on new data points, has proved to be a confusing and risky area of engineering.
Many projects fail to make the transition from the lab to production, partially because these difficulties are not addressed on time.
In my opinion, there are three major factors which make deployment challenging:
Google Health’s recent paper provides a fresh set of insights about what it takes to bring an AI solution to real clinical use. Below is a summary of some key takeaways which should be applicable to many companies and solutions in this space
for the last few years, Google Health has been making headlines over its AI based solution for detecting diabetic retinopathy (and diabetic macular edema) from retinal fundus images. The latest episode of their journey covers a post-development phase of clinical deployments, and a prospective study combined with a “human centered evaluation” of their solution.
Having worked on…
Over the last few years, many software organisations have started developing products which leverage machine learning.
Like every new technology, ML involves a certain learning curve, and managing this learning curve is critical for a successful ML adoption.
Interestingly, many organisations focus so much on the areas they need to learn, that they miss the opportunities to leverage assets they already have, which are critical for their success.
A great example of this phenomenon is the “Data Science Unicorn” myth.
Some companies expect to hire a single data scientist, who will deliver their ML project practically single handedly.