Pradyumna Lanka1, D Rangaprakash1, and Gopikrishna Deshpande1,2,3
1AU MRI Research Center, Department of Electrical and Computer Engineering, Auburn University, Auburn, AL, United States, 2Department of Psychology, Auburn University, Auburn, AL, United States, 3Alabama Advanced Imaging Consortium, Auburn University and University of Alabama, Birmingham, AL, United States
Synopsis
In
this study, we highlight the fact that cross-validation accuracy might not be a
good measure of performance estimation in neuroimaging-based diagnostic
classification, especially with smaller sample sizes typically encountered in neuroimaging.
We trained an array of classifiers using resting state fMRI-based functional
connectivity measures from subjects in a particular age group using
cross-validation, and then tested on an independent set of subjects with the
same diagnosis (mild cognitive impairment and Alzheimer’s disease), but from a
different age group. We demonstrate that cross-validation accuracy might give
us an inflated estimate of the true performance of the classifiers.INTRODUCTION
Resting-state
functional connectivity Magnetic Resonance Imaging (Rs-fcMRI) models the
interactions between brain regions. These interactions are sensitive to the
disease states. Hence, machine learning
(ML) algorithms have been used to classify mental/neurological disorders based
on rs-fcMRI
1. There has been considerable interest towards using ML
to develop diagnostic tools and biomarkers. Unfortunately, due to various
factors such as subjective diagnostic criteria, heterogeneity of clinical
populations, small sample sizes of neuroimaging data and the complexity of
underlying altered brain networks, classification models are less
reliable/accurate
2. Given the large variations in accuracy estimates
and difficulty in generalizing the classification results to larger populations,
broad conclusions about the validity of ML classifiers for disease
classification based on Rs-fcMRI metrics are premature. Given such
uncertainties, we used a large array of popular ML classifiers with functional
connectivity (FC) features to investigate whether the often used
cross-validation accuracy is a measure of the true predictive performance of a
classifier by measuring accuracy using a truly independent test dataset that
was not used in training or cross-validation.
METHODS
We
first used simulated data to validate the classifiers using known ground truth.
1500 normally distributed features with means of 0.2, 0.5 and 0.8 were
simulated. The standard deviation (SD) for each feature was incremented from
0.1 to 0.8 in steps of 0.1 to test the classifiers’ robustness to noise. Rs-fcMRI
data from 132 subjects were obtained from the Alzheimer’s Disease Neuroimaging
Initiative (ADNI) database (http://www.loni.ucla.edu/ADNI). 100 subjects were used
for training/validation and 32 subjects were entirely kept aside until used for
independent diagnosis/testing. Note that the training/validation and testing
data had subjects from different age groups, in order to test the
generalizability of classifiers (Fig-1). Standard preprocessing pipeline for Rs-fcMRI
data was implemented using Data Processing Assistant for Resting-State fMRI toolbox
(DPARSF)
3. Mean timeseries were extracted from 200 functionally
homogeneous brain regions (CC200 template
4) and FC was calculated
for all region pairs. An initial “feature-filtering” was performed wherein only
the connectivity paths that were significantly different between the groups
(p<0.05, FDR corrected) in the training/validation data were retained (after
controlling for possible confounding factors), thereby reducing the number of
features from 19,100 to 655. No statistical tests were performed on the independent
test data to avoid introducing any bias. To further reduce the number of
features while retaining discriminative information, some of the classifiers
were embedded in the recursive cluster elimination (RCE) framework
5.
The classifiers used (Fig-4), were implemented both within (Fig-2) and outside
the RCE-framework (Fig-3). Using FC from 100 subjects, classification accuracy was
calculated using cross-validation and the classifier models obtained from the
different partitionings of the training data (repeats×folds) were saved. Test
accuracy was calculated on the 32 independent test subjects using the saved classifier
models by a voting procedure. Classifier models obtained in each iteration
during training would vote towards a decision on test subjects (accuracy was
the percentage of correct votes). The classification procedure was identical
for simulated and experimental data.
RESULTS &
DISCUSSION
Results
for simulated and experimental fMRI data are summarized in Fig-4 and plotted in
Fig-5. Accuracy obtained from simulated data was close to 100% for both
training/validation and test data for both smaller and larger SDs (except few
classifiers whose performance degraded with larger SDs), thus validating the
classifiers used. Results for fMRI data indicate that in disease classification
tasks, the actual predictive performance obtained from independent test data is
significantly lower than the cross-validation accuracy obtained from training
data. Since we used a voting procedure on testing data from the saved models,
if the majority of the classifier models predict accurately then the test data
would be assigned to the correct class. Despite this liberal strategy, the
performance was only slightly above chance (>25%) for most classifiers using
fMRI data. The performance measures from a small sample could not be generalizable
to populations with the same symptoms but in a different age group. Also, using
several cross-validation levels reduces the amount of data available for each
level, thus making feature selection or performance estimation unreliable in
smaller datasets. Given the variability in fMRI data and the possibility of selection
bias, it is better to use a completely independent test set rather than cross-validation
to infer the predictive power of the classifier for smaller datasets
7.
Given that cross-validation is widely used in neuroimaging to assess
classification performance, our results demonstrate the perils of using such a
strategy. Validation using independent test data is the gold standard in many
other scientific fields and we urge the neuroimaging community to adopt the
same standard.
Acknowledgements
No acknowledgement found.References
1. Lee
M.H. et al, Resting-State fMRI: A Review of Methods and Clinical Applications,
AJNR 2013; 34: 1866-1872.
2. Demirci
O. et al, A Review of Challenges in the Use of fMRI for Disease Classification
/ Characterization and A Projection Pursuit Application from A Multi-site fMRI
Schizophrenia Study, Brain Imaging and Behavior 2008; 2(3): 207-226.
3. Yan C and Zang
Y, DPARSF: a MATLAB toolbox for “pipeline” data analysis of resting-state fMRI.
Front. Syst. Neurosci., 2010;
4:13.
4. Craddock R.C. et al, A whole brain
fMRI atlas generated via spatially constrained spectral clustering, Hum. Brain
Mapp. 2012; 33(8):1914-1928.
5. Deshpande G.
et al, Recursive cluster elimination based support
vector machine for disease state prediction using resting state functional and
effective brain connectivity, PLoS One 2010;
5(12): e14277.
6. Yamashita O et al, Sparse
estimation automatically selects voxels relevant for the decoding of fMRI activity
patterns, Neuroimage (2008); 42(4):1414-29.
7. Isaksson A. et al, Cross-validation and bootstrapping are
unreliable in small sample classification, Pattern Recognition Letters (2008);
29(14): 1960-1965.