50% of Machine Learning studies using [particular medical dataset] fundamentally flawed

By David Röthlisberger. Tweet your comments @drothlis.

Published 17 Feb 2020.

Gilles Vandewiele et al’s 2020 paper Overly Optimistic Prediction Results on Imbalanced Data: Flaws and Benefits of Applying Over-sampling provides a sobering reminder to take Machine Learning studies with a grain of salt: Almost 50% of the 24 peer-reviewed studies that use machine learning based on a particular publicly-available dataset, were fundamentally flawed. These studies claimed near-perfect accuracy at predicting the risk of pre-term birth for a patient; after correcting the methodological flaw Vandewiele et al found that the actual accuracy was between 52 and 65%. Read on for a layman’s description of the flaw.

In Machine Learning, you train your model (e.g. a Neural Network) using some of your dataset. In this case the dataset is electrohysterography measurements of the uterine muscle; each measurement is labelled with the outcome “pre-term birth” or “full-term birth”. The model’s job is to predict, for a new measurement it hasn’t seen before (a new patient not in the training dataset), whether the patient is at high risk for pre-term birth.

I said “some” of your dataset: You only use part of your dataset to train the model, because you reserve some of it, say 20%, to test your model after training. This tells you if your model can “generalise” to give accurate results for datapoints that it didn’t see during training. Machine Learning models can easily memorise or “overfit” to the training data, and generalise poorly. This is especially true if there is very little training data relative to the size of the model. This particular dataset has only 300 datapoints (300 patients).

Now in many medical applications of Machine Learning there is the problem of “imbalanced classes”: Most births are full-term, so only 13% of the datapoints are for pre-term births. Machine Learning tends to perform better if you give ~50% of each “class” in each batch (iteration) of the training algorithm. So we have to over-sample the “pre-term birth” class, for example by providing each datapoint twice, or 3 or 4 times.

The flaw in the studies is this: They oversampled (copied the pre-term datapoints) before splitting the dataset into training vs. test sets. This means that the datapoints in the test set were seen during training. So of course they measured really high accuracy!

(In practice over-sampling uses more sophisticated methods to interpolate new “synthetic” datapoints from the real datapoints, but the flaw remains the same: information leakage between the training & test sets.)