This is part II of a two-part series, describing our solution for running distributed training on spot instances using TorchElastic and Kubernetes.

Part I introduced our overall technology selection, design principles and benchmarks.

In Part II, we will walk you through the process of creating a simplified version of the training environment on AWS.

Note: This post assumes you’ve read Part I of this series, and that you have a decent level of experience working with Kubernetes and AWS infrastructure.

Target architecture

Image by the author


  1. Build a training container
  2. Set up an EKS cluster
  3. Install TorchElastic infra on the cluster
  4. Set up an EFS PersistentVolumeClaim

Deep Learning development is becoming more and more about minimizing the time from idea to trained model.

To shorten this lead time, researchers need access to a training environment that supports running multiple experiments concurrently, each utilizing several GPUs.

Until recently, training environments with tens or hundreds of GPUs were the sole property of the largest and richest technology companies. However, recent advances in the open-source community have helped close this gap, making this technology accessible even for small startups.

In this series, we will share our experience in building out a scalable training environment using TorchElastic and Kubernetes, utilizing…

Great writeup, thanks for sharing.

I am under the impression that the backwards pass is in fact the synchronization barrier.

When your "main" worker is busy doing stuff like running the validation or writing checkpoints, for example at the end of the nth epoch, the other workers already start the next epoch.

The other workers will complete exactly one mini-batch, and then get "stuck" since they will wait for the "main" worker to exchange gradients with them.

They will wait for the timeout you set on the process group - so I would set it to <5 minutes.

If something takes longer than that, probably one of the processes has really failed.

This is mainly based on me running distributed training on multiple spots, so there is actually a good chance one of the workers did fail...

Photo by Caleb Jones on Unsplash

Almost every leader you talk to will quote building trust as one of the most important things for creating a highly functional team.

However, one of the most common mistakes leaders make on a weekly basis, is not providing their team with good “whys”.

When I say “why”, I am not talking about the company’s mission to transform X or make the world a better place.

“providing a why” in this context is simply frequent communications around why decisions are being made, which is at least:

1. Honest

2. Aligned with a shared value/belief which is reinforced often

Great “whys”…

Optimize your deep learning training process by understanding and tuning data loading from disk to GPU memory

Photo by David Lázaro on Unsplash

Deep learning experimentation speed is important for delivering high-quality solutions on time.

The data loading path — i.e. getting training examples from storage into the GPU is an often overlooked area, despite having a major impact on experimentation speed.

In this post, I will provide an introduction to components of the data loading path, how to formalize their performance goals, and how to design a solution which meets these performance goals.

Working backwards — from the GPU to the storage

Rationale — Keeping the GPU busy

GPUs are expensive to buy/lease. So, during training, you want to ensure they are in 100% utilization, crunching those matrixes and updating weights.

GPU’s work most efficiently in mini-batches…

Thanks for sharing!

I spent a few years going through this learning curve with my teams at a startup doing AI for medical imaging.

It makes me proud to see yet another validation for the path we took to identify and overcome these issues.

Here is my summary of Google's retinopathy journey, in detail. Notice the similarity to DeepMinds learning curve....

image credit:

Why you should read this post

Deploying machine learning models to production in order to perform inference, i.e. predict results on new data points, has proved to be a confusing and risky area of engineering.

Many projects fail to make the transition from the lab to production, partially because these difficulties are not addressed on time.

In my opinion, there are three major factors which make deployment challenging:

  1. Starting the process late — Partially due to cultural gaps, and partially due to the nature of the research cycle, many teams leave the subject of deployment till very late in the game, resulting in an array of…


Google Health’s recent paper provides a fresh set of insights about what it takes to bring an AI solution to real clinical use. Below is a summary of some key takeaways which should be applicable to many companies and solutions in this space


for the last few years, Google Health has been making headlines over its AI based solution for detecting diabetic retinopathy (and diabetic macular edema) from retinal fundus images. The latest episode of their journey covers a post-development phase of clinical deployments, and a prospective study combined with a “human centered evaluation” of their solution.

Having worked on…

Why ML needs its own flavour of dev. methodology and a partial, draft proposal for such a methodology

Photo by Denise Jones on Unsplash


  • ML teams working on complex projects need to battle the intrinsic challenges in ML, as well as the friction which arises from their multi-disciplinary nature which makes it hard to make decisions.
    A solid Development Methodology can help teams improve execution.
  • ML is different from standard software in its level of uncertainty, as well as the fact that models are influenced indirectly — and not engineered by design.
  • Agile doesn’t work out of the box for ML despite being a useful mindset — For example, it assumes that small features can be built by design and plan with low risk…

… But ML teams need to overcome a big cultural gap

Over the last few years, many software organisations have started developing products which leverage machine learning.

Like every new technology, ML involves a certain learning curve, and managing this learning curve is critical for a successful ML adoption.

Interestingly, many organisations focus so much on the areas they need to learn, that they miss the opportunities to leverage assets they already have, which are critical for their success.

A great example of this phenomenon is the “Data Science Unicorn” myth.

The “Data Science Unicorn” myth

Some companies expect to hire a single data scientist, who will deliver their ML project practically single handedly.

This person…

Assaf Pinhasi

Machine Learning and Engineering Leader and consultant.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store