Month: December 2017

Big Data, Machine Learning and Healthcare – An increasingly significant interplay

The field of healthcare is undergoing a revolution with the increasing adoption of technology – devices, sensors, software, insights and artificial intelligence. Unfortunately buzzwords such as machine learning and Big Data have clouded this conversation. People throwing around these buzzwords do a major disservice to any real adoption of new technologies. And I am not the only one who is bothered by this: others are voicing their concerns too: Machine Learning – Can We Please Just Agree What This Means? Although machine learning and big data have become buzzwords, which these days do carry a negative connotation, in this particular case it is making significant inroads in healthcare. What is Big Data, what is machine learning and how are they changing healthcare?

Defining Big Data is like defining what life is – it depends on who you ask. The simplest and closest definition is attributed to Gartner – Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. But not everybody agrees, according to The Big Data Conundrum: How to Define It, published by MIT Technology Review, Jonathan Stuart Ward and Adam Barker at the University of St Andrews in Scotland surveyed what it means to different organizations and got very different results and bravely finished their survey with a definition of their own: Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning. One significant evolution of the term seems to be the marriage of the concept with the enabling technology or algorithmic framework, specifically database, optimizing algorithms and machine learning.

So let’s now turn our focus on machine learning, which has similar problems. To me machine learning is simply the process by which a computer can learn to do something. That something might be as simple as reading the alphabet, or as complex as driving a car on its own. Although explaining this to a non-technical audience is not easy, valiant efforts have been made by some lost souls, for example, Pararth Shah, Ex-Stanford student and currently at Google Research, How do you explain Machine Learning and Data Mining to a layman? and Daniel Tunkelang, data scientist, who led teams at LinkedIn and Google, How do you explain machine learning to a child?” How this learning can take place, however, is harder to explain. Before attempting that, let me clarify some other relevant technical jargons people may have thrown at you, such has AI, soft computing and computational intelligence. AI, which stands for Artificial intelligence, is the generic study of how human intelligence can be incorporated into computers. Machine learning, which is a sub-area of AI, on the other hand, concentrates on the theoretical foundations used by computational aspects of these algorithms, considered to belong to the field of Computational Intelligence and Soft Computing, some examples of which are neural networks, fuzzy systems and evolutionary algorithms. More simplistically, a machine-learning algorithm is used to determine the relationship between a system’s inputs and outputs using a learning data set that is representative of all the behavior found in the system using various data modeling techniques. This learning can be either supervised or unsupervised.

The interesting reality is that whether we are aware or not, machine learning based solutions are already part of our daily lives, so much so that BBC thought it would be fun just to point it out in Eight ways intelligent machines are already in your life. Not surprisingly, one of the eight areas mentioned there is healthcare.

Now we come to the hard part of this discussion. There are numerous different interplays between big data, machine learning and healthcare. Thousands of books are being written on it – my most recent search on Amazon with “machine learning” yielded 14,389 matches! Dedicated conferences on this topic are being organized. There are so many courses on it that David Venturi from Udacity was inspired to research and publish Every single Machine Learning course on the internet, ranked by your reviews. Virtually almost all healthcare startups are now expected to use some form of machine learning – VC funding to healthcare startups that uses some form of AI increased 29% year-over-year to hit 88 deals in 2016, and are already on track to reach a 6-year-high in 2017. A good starting point, however, is “Top 4 Machine Learning Use Cases for Healthcare Providers” written by Jennifer Bresnick. She broadly identifies the following areas where significant inroads have already been made by machine learning, imaging analytics and pathology, natural language processing and free-text data, clinical decision support and predictive analytics and finally, cyber-security and ransomware. If you are looking to get more specific, check out 7 Applications of Machine Learning in Pharma and Medicine by Daniel Faggella, where he identifies applications that use machine learning with the most forecasted impact on healthcare – disease identification/diagnosis, personalized treatment/behavioral modification, drug discovery/manufacturing, clinical trial research, radiology and radiotherapy, smart electronic health records and finally, epidemic outbreak prediction. In reality, it is becoming harder and harder to find any area of healthcare which is untouched by machine learning in some way these days.

Why is this happening? It’s because we are realizing that machine learning has enormous potential to make healthcare more efficient, smarter and cost effective. My prediction is that in the future, we will not talk about machine learning as a separate tool, it will become so ubiquitous that we will automatically assume it is part of a solution, much in the same way we do not think of internet search as a separate tool anymore, we automatically assume it is available. The more important question is now that the jinni is out of the bottle, where will it end? Will we one day have completely autonomous artificial systems such as the famous Emergency Medical Hologram Mark I replace human doctors as the creators of Star Trek imagined? Or will healthcare prove to be so complex that no machine can ever replace humans completely? Only time will tell.

When is a small sample size too small for statistical reporting?

It has been a fairly well known assumption in Statistics that a sample size of 30 is a so-called magic number in estimating distribution or statistical errors. The problem is that, firstly, according to Andrew Messing of the Center for Brain Science, Harvard University, like a lot of Rule of Thumb commonsense measures, this assumption does not have a solid theoretical basis to prove its veracity. Secondly, the number 30 is itself arbitrary, and some textbooks give alternative magic numbers of 50 or 20. Examples can be found in Is 30 the magic number issues in sample size estimation? or Shamanism as Statistical Knowledge: Is a Sample Size of 30 All You Need?, for example. Funny thing is that there is no formal proof that any of these numbers are useful because they all rely on assumptions that can fail to hold true in one or more ways, and as a result, the adequate sample size cannot be derived using the methods typically taught (and used) in the medical, social, cognitive, and behavioral sciences.

Like so many others before me, this got me thinking. One of my domains is healthcare data analytics, a field that is perpetually inundated with data. Should I test this rule of thumb and see if there is any truth to it?

Let’s set the background first. The data I was looking at centers around the treatment of Hepatitis C.  The goal of Hepatitis C therapy is to clear the patient’s blood of the Hepatitis C virus (HCV). During this treatment, the doctors routinely monitor the level of virus in the patient’s blood – a measurement known as viral load – typically in terms of International Units per milliliter (IU/mL). When I was slicing and dicing the data using different criteria such as age, sex, genotype etc., to report effectiveness of treatment, sometimes the sample sizes of these cohorts were becoming too small. So what should the minimum size of my sample set before I can confidently report that result?

Let’s look at some fairly simple mathematical model now. In this specific case, we assumed a T-distribution of our data. T-distribution is almost engineered so it gives a better estimate of our confidence intervals especially since we have a small sample size. It looks very similar to a normal distribution. It has a “mean”, which is our mean of our sampling distribution. But it also has fatter tails. So normally what we can do is that we find the estimate of the true standard deviation, and then we can say that the standard deviation of the sampling distribution is equal to the true standard deviation of our population divided by the square root of n, which is the sample size.

This is especially useful since we never know the true standard deviation, or we seldom know the true standard deviation. So, if we don’t know that, the best thing we can put in there is our sample standard deviation.

We do not necessarily call this estimate a probability interval; rather it is a “confidence interval” because we are making some assumptions. This confidence measure is going to change from sample to sample. And in particular, the expectation is that this is going to be a particularly bad estimate when we have a really small sample size.

Accordingly, we calculated confidence intervals with the above procedure for the data. If the size of the sample is more than a cut off, say 30, we have used Z-scores; otherwise we have used t-table for calculation. (I am assuming t-tables and Z-scores are outside the scope of this article.) This basically means that we first find the mean, then find the standard deviation, and finally find the standard error, which is equal to standard deviation divided by the square root of sample size. We then find the range either from the t-table or Z-score, as mentioned above. Finally, the adjusted range with a specific % confidence will be equal to the mean +/- the range, as calculated above.

The graph below shows the results where the sample size was 1,720 patients.

The next graph shows the results where the sample size is 28.

In both of these cases, it appears that sample size around 30 gives us enough statistical confidence in the results we are presenting. In both cases, however, we are bound by the fact that comparing effectiveness across treatments will probably be best related to the size of the sample (cohort) itself, the closest metric being the utilization factor. For the calculated values within each category, however, we should be able to report the numbers with a prescribed confidence interval. In all the calculations presented above, that confidence interval was 95%.

Somehow, we picked a set of “arbitrary” healthcare data and somehow, a sample size around 30 was an adequately large number to generate dependable statistics. But the question remains, why?

The best rationale I have come across of why this is such a popular number was given by Christopher C. Rout, of the University of KwaZulu-Natal, Department of Anesthetics and Critical Care, Durban, KwaZulu-Natal, South Africa. According to him, it is not “enough”, but rather it is that we need “at least” 30 samples before we can reasonably expect an analysis based upon the normal distribution (i.e. Z test) to be valid. That is, it represents a threshold above which the sample size is no longer considered small. It may have to do with the difference between the square roots of 1/n and 1/(n-1). At about 30 (actually between 32 and 33) this difference becomes less than 0.001, so in a way, the intuitive sense is that at or around that number of the sample size, the difference between samples of larger size may not contribute too much to the probability distribution calculation and a measure of estimated error goes down to acceptable levels.

Well, sounds about right to me.

Copyright © 2024 The Data Dive

Theme by Anders NorenUp ↑