What Digital Health AI startups can learn from Google Health’s attempt to create a real-life AI solution
Google Health’s recent paper provides a fresh set of insights about what it takes to bring an AI solution to real clinical use. Below is a summary of some key takeaways which should be applicable to many companies and solutions in this space
Introduction
for the last few years, Google Health has been making headlines over its AI based solution for detecting diabetic retinopathy (and diabetic macular edema) from retinal fundus images. The latest episode of their journey covers a post-development phase of clinical deployments, and a prospective study combined with a “human centered evaluation” of their solution.
Having worked on commercial AI-based solutions for healthcare (radiology and more) for a number of years, I find that these publications resonate deeply with my own experience and learning.
I believe that with some extrapolation and a lot of reading between the lines, Google’s publications can offer startups and other companies starting to develop products in this space an opportunity to learn some seriously valuable lessons.
Below is my attempt at summarising, extrapolating, and expanding on some these learnings, based on my own experience, focusing on what can be learnt from the process (vs. an attempt to dissect accurately Google’s scientific efforts or commercial offering)
Can you even learn from Google?
Entrepreneurs reading Google’s publications may be tempted to dismiss them for thinking that :
- Google might be working on this as an academic research project; or
- Google’s size, resources and public profile means that Google is shooting unnecessarily high with their solution (i.e. a lean startup can aim a lot lower and still succeed)
Reading the publications from the last 2 years, you can rest assured that this is not the case :
- It’s true that Google started off treating this as a research project, but at least for the last 2–3 years, they seem very focused on deploying it and making real clinical impact.
- As for the for their standards — at the moment they seem focused on getting the solution to work on the most basic way… no bells and whistles whatsoever
Let’s look at their high level picture, and then delve into how they approached each phase in the project, and about some of the practices/issues you can expect to encounter working on a similar project.
AI for healthcare is an ultra marathon, not a sprint
Google’s 5+ years journey and counting…
A quick search surfaces papers on the DR algorithm from as early as 2016, for which they annotated data during 2015.
In 2018, they published a paper describing improvement to the algorithm, and a significant re-annotation effort.
in 2019, they built a device, in collaboration with Verily (an alphabet company), and deployed it as part of a solution in India, and later Thailand.
Reading the latest paper from 2020, seems that Google is trying to get Thailand to work. I’d say they’re not out of the woods yet…
…and requires significant resources
Google clearly has access to a the best talent and resources money can buy.
It’s not entirely clear what is the team size they put on this project.
According to one source, in 2019 there were more than 18 people on the team:
Google’s team is more than 18 people strong — the size of a startup after an A round.
And this team has support from Google’s top notch legal, business development, political liaisons, high salaried clinicians and annotators, Machine learning infrastructure, travel budget, and brand appeal — all the surrounding assets which most startups will never have!
Enough said…
Google’s development and evaluation methodology
There’s a lot to learn from Google’s publications about their development efforts and how they evolved their understanding of the problem over time.
here are the main points :
1. Defining the solution is the first step you need to take, as it guides numerous decisions on how to build an algorithm which will actually solve a problem (vs. boast about high performance in the lab)
2. Failing to do so means you may reach amazing results in the lab, but fail completely in the context of your use-case - since the data distribution may be completely different from the one you trained and tested on in the lab.
3. Clinical input data seems homogenous at first, until it doesn’t and you realize you need to dig really deep clinically and technically if you want to understand it and be able to predict whether your algorithm will generalize or not
4. Clinical experts (and non-experts) don’t agree with each other most of the time, which makes it hard to determine what is the “truth” for training and evaluation
5. Visualisation helps build empathy between the users and your solutions, and also to detect systematic issues
6. Decision support system seem like an easier goal to achieve with AI, but in fact expose your algorithm to continuous subjective evaluation, if you were able to win some of the practitioners trust at all (which most algorithms don’t)
Let’s dive deeper into Google’s journey:
2016— Develop an Algorithm and publish a paper with great looking results
Google published their first serious paper about DR in 2016, with some very impressive results.
Without being an expert on ophthalmology (to say the least), you can detect a number of good practices in their paper:
- Their dataset was certainly not a small feat to achieve — 128 175 images, annotated 3–7 times each! that’s an incredible number.
- They used data from diverse institutions, countries, and camera models in training
- They evaluated themselves on data from a source they left out of the training entirely, which also originated from a different country — good practice
- They addressed at least one nuance of the clinical protocol — diversity the clinical process of obtaining the images (pupil dilation yes/no)
- There are hints to them dealing with difficult-to-define annotation tasks by specifying very detailed instructions for their annotators. (”Referable diabetic macular edema was defined as any hard exudates within 1 disc diameter of the macula,15 which is a proxy for macular edema when stereoscopic views are not available”).
These definitions are another way to create agreement amongst annotators, who sometimes (often) use somewhat subjective criteria / interpretation for certain findings - Continuing along this line, they noticed and began to address the incredibly irritating fact that medical experts rarely agree with each other even on well defined tasks. They used majority votes, rated the level of inter and intra-reader consistency, and chose the most consistent annotators for the validation set. These practices help clean the noise in the lab, but do not prepare you to the point when you get exposed to subjective evaluations — which re-surfaces these disagreement in a very painful way.
- They published the distribution of their dataset across several dimensions. Here you can immediately notice that the more severe cases were not very represented — not a great practice.
- You can also notice that there is not much reference to other co-morbidities of the retina in this analysis — which indicates that the clinical depth to which they went was not too deep
- Regarding the algorithm design: Google build a binary classifier to output whether a “referable DR defined as levels moderate or worse, OR referable macular edema” is present. Per the paper, there are 5 levels of DR (none, mild, moderate, severe, or proliferative), which they annotated the dataset with, but chose to train a binary classifier. This decision seems like an innocuous one — but it shows that Google was not sure yet where and how the algorithm will be used. They later changed it to predict the specific level when they finally decided how to apply it to a clinical problem.
Conclusion:
Solid work, but clinical depth is not the focus.
A lot of work to address issues caused by inter-expert disagreements.
Good results and a nice paper — in the lab and without a clinical use-case in sight
2018 — Fixing model issues; starting to partner with clinics to deploy a solution
in 2018, google published a paper with some serious improvements they made to the data and model.
You have to wonder: if their 2016 results were that impressive (human level performance), then why invest so much in improving the algorithm?
I will speculate that the trigger was Google receiving some external feedback. Probably still not in a full clinical setting, but from clinicians who attempted to tell if they could “trust” the algorithm.
This is a frustrating point in the project: You achieve great results in the lab, but in an early external evaluation, your precious algorithm receives lukewarm feedback claiming that “the algorithm has many misses”, or “makes very obvious mistakes which are unacceptable”.
This type of feedback causes a trust crisis:
“Can it be that the data was lying?”
This is what happens next:
- The inter-expert disagreement is first to blame.
Using methods for ground truth like consensus/majority/quorum is a typical move. - Then, there’s the more productive (IMHO) effort of understanding
what makes the algorithms mistakes “obvious”? are there systematic (vs. random) issues we can address in the algorithm which just alluded us thus far? This marks the early stages of the journey to really understand the data. - One area to investigate is checking the performance on inputs with different quality, different technical characteristics, or which were acquired under a different clinical protocols:
For example, it’s possible the research team eliminated unclear images from the datasets in the effort to create a stronger signal for learning — but these are often images which practitioners can and do read without much problem; so you can expect to lose points there (Google ran into this issue in Thailand, it seems). - Another area of systematic misses is co-morbidities and pathologies — clinical phenomena which either appear often with the condition you are predicting, or are visually quite similar to it; the first issue is an issue, due to the fact DL algorithms often mix causation with correlation.
- Another step tis to develop visualisation techniques on the model results, which can be shown to experts in order to:
Create empathy towards the algorithm’s decision
Help detect systematic errors in the algorithm by pointing you to what it’s “looking at”
Increase agreement with the algorithm and between readers
Here is a valuable point: practitioners will develop a lot more empathy towards an algorithm and forgive a mistake, if they can “understand how it made the decision”.
First, this can help them dismiss it as a mistake without wasting time over it; second — it enables them to generate explanations regarding the source of the mistake (some of these explanation attribute a lot more sense to the Algorithm than it deserves).
This lesson is very applicable for real life deployments too, not just for improving the algorithm but for establishing trust and allowing the practitioner to “work around” model idiosyncrasies and leverage it when it makes sense.
This is what Google said they did to improve the algorithm:
- They added more data
- They added more annotators
- They changed their annotation protocol somewhat
- They changed the model to predict the 5 sub-categories of DR vs. just binary
- They tightened their definitions of ground truth by employing experts who needed to reach a consensus (moving further away from the pain of subjective evaluation)
- They drilled down into co-morbidities (i.e. various clinical phenomenon which co-appear with the one they were trying to detect, and cause disagreement/confusion for experts or the algorithm)
- They developed a visualisation mechanism to explain the model’s result
The last chapter of 2018 was that Google Health researchers partnered with Verily, an Alphabet company, to bring their algorithm to a device and deploy it in India, where there is a shortage of specialists who are able to diagnose population.
Interestingly, in the same year, another company (IDx) received FDA clearance for a device to detect DR and beat Google to this milestone.
From what I can gather, Google has met with the FDA at some point, but didn’t seem to take this further and pursue an FDA approval, despite Verily being on the FDA pre-certification path.
Conclusions:
Inter-reader disagreement is a huge pain during development, and establishing ground truth in Healthcare can be both expensive and elusive;
Algorithms often face a very cold welcome the first time a semi-objective practitioner attempts to evaluate whether the algorithm actually “makes clinical sense” (vs. achieve a high score on an engineered dataset)
To win practitioners trust, you need to ensure that the algorithm can tell between similar / co-occurring findings, handle most of the inputs practitioners can handle, and provide a way for them to “understand why” the algorithm thought what it did
2019 — deployment in India not going great
in a blog post from 2019, Google and Verily announced that they are deployed in the Aravind Eye Hospital in India, where the algorithm has been put to clinical use.
It seems that the solution works as follows:
As part of a diabetes checkup, nurses/technicians take eye scan from patients. They then need to send it to an ophthalmologist for evaluation. the results take weeks to arrive. The ophthalmologist diagnosis determines the next steps in caring for that patient.
So the plan is that Google’s algorithm will replace the ophthalmologist, shortening the waiting time and speeding care (the diagram below refers to the time savings they expected to see in Thai hospitals)
However, seems that not all was going well for Google in their first major clinical deployment (later it became evident that it’s certainly wasn’t going smoothly in Thailand).
The WSJ wrote a piece about Google’s work in India, which included the following paragraph:
“While ARDA is effective working with sample data, according to three studies including one published in the Journal of the American Medical Association, a recent visit to a hospital in India where it is being tested showed it can struggle with images taken in field clinics. Often they are of such poor quality that the Google tool stops short of producing a diagnosis — an obstacle that ARDA researchers are trying to overcome”
Turns out that image quality, as well as differences in acquisition protocols, vary wildly across institutions, and can often lead to images which are substantially different and unfamiliar to the algorithm.
It gets even more frustrating that that: Google did have data from Aravid Eye Hospital all the way back in 2016 when they were training their original model! but mysteriously they chose not to use that data in the evaluation they included in the paper… My guess is that they saw the bad performance and decided the data wasn’t telling the whole story.
On the bright side, Google pursued a CE mark with Verily for their device.
Conclusions: first sign of the price Google paid for trying to develop an algorithm in the lab, and decided in retrospect how to use it (and hence how it will actually be evaluated).
More of this in 2020...
2020 — more difficulties in Thailand
A few days ago, Google published more information about what they were doing in Thailand.
The hypothesis remains that Google will try to replace the remote ophthalmologist, provide the staff in the clinic an immediate feedback and help them determine what’s the clinical path for patients with DR.
However, in Thailand, Google hit a new bump in the road: the nurses who deal with the patients in the clinic sometimes read the images themselves, and decide to issue referrals for followup “on the spot”, without waiting for the ophthalmologist’s input.
Looks like these nurses had a go at Google’s system, and disagreed with its results to the point that they didn’t trust it at all, to the point that they were reluctant to onboard patients to Google’s prospective study!
Google would first need to get the nurses to trust their solution to perform well enough — at least to perform their “on the spot” evaluation. And, just like that, Google’s solution started being evaluated as a decision support tool for nurses (at least temporarily).
Most of Google’s latest paper is about what nurses thought about their solution (in short, they didn’t love it).
This is a huge setback: The original plan to replace the ophthalmologist (an expert), required a complex but objective / scientific prospective study which proves statistically that their algorithm was comparable to the experts; once they’ve done that, they could make the claim that the algorithm can replace the human experts entirely (perhaps aside from some random quality checks by experts).
However, now, Google’s solution is exposed as a decision support tool for nurses — which means a continuous, highly subjective evaluation by non-experts, who first need to choose to pay attention to the tool and can decide to ditch it whenever their patience runs out…
Google’s stories from Thailand teach us another important lesson: an overworked healthcare professional will quickly lose patience and stop using a decision support tool if they don’t develop trust towards it quickly, or can’t see it’s value .
For the professional, it’s usually significantly more work to use the tool, provide feedback, and double check it’s result — vs. just ignoring it, and going about their job like they are used to and trained for
Google didn’t make this easier for themselves either — it seems that again, like in India, they chose to deploy their algorithm to clinics were the conditions are very different from those the algorithm was evaluated under (the paper is full of details about where it fell short on the Thai data).
This probably let to bad first impression by the nurses and probably wiped out their patience very quickly.
There is a lot to say about the difficulty of developing a useful decision support tool in Healthcare, but we’ll leave it for another post.
Conclusion
It’s fascinating to observe Google’s long, slow and difficult journey to deliver a working, impactful solution to the Healthcare domain.
The tech giant is still not out of the woods, but I bet that if anyone can do it, they can…
Their journey seems to follow a similar path to that taken by other, smaller companies — a path full of lessons learnt, mostly the hard way — which is at the same time exhilarating, full of learning, frustrating and humbling.
Personally, I love working on AI in healthcare exactly because of these challenge.
I hope this post can help others on this path feel like they are not alone, and perhaps even to avoid costly mistakes.
If you feel like discussing your challenges in delivering an AI solution in the healthcare space, I would be happy to try and assist — or just share experiences.
Cheers,
Assaf