Perils in the Use of Cross-validation for Performance Estimation in Neuroimaging-based Diagnostic Classification
Pradyumna Lanka1, D Rangaprakash1, and Gopikrishna Deshpande1,2,3

1AU MRI Research Center, Department of Electrical and Computer Engineering, Auburn University, Auburn, AL, United States, 2Department of Psychology, Auburn University, Auburn, AL, United States, 3Alabama Advanced Imaging Consortium, Auburn University and University of Alabama, Birmingham, AL, United States


In this study, we highlight the fact that cross-validation accuracy might not be a good measure of performance estimation in neuroimaging-based diagnostic classification, especially with smaller sample sizes typically encountered in neuroimaging. We trained an array of classifiers using resting state fMRI-based functional connectivity measures from subjects in a particular age group using cross-validation, and then tested on an independent set of subjects with the same diagnosis (mild cognitive impairment and Alzheimer’s disease), but from a different age group. We demonstrate that cross-validation accuracy might give us an inflated estimate of the true performance of the classifiers.


Resting-state functional connectivity Magnetic Resonance Imaging (Rs-fcMRI) models the interactions between brain regions. These interactions are sensitive to the disease states. Hence, machine learning (ML) algorithms have been used to classify mental/neurological disorders based on rs-fcMRI1. There has been considerable interest towards using ML to develop diagnostic tools and biomarkers. Unfortunately, due to various factors such as subjective diagnostic criteria, heterogeneity of clinical populations, small sample sizes of neuroimaging data and the complexity of underlying altered brain networks, classification models are less reliable/accurate2. Given the large variations in accuracy estimates and difficulty in generalizing the classification results to larger populations, broad conclusions about the validity of ML classifiers for disease classification based on Rs-fcMRI metrics are premature. Given such uncertainties, we used a large array of popular ML classifiers with functional connectivity (FC) features to investigate whether the often used cross-validation accuracy is a measure of the true predictive performance of a classifier by measuring accuracy using a truly independent test dataset that was not used in training or cross-validation.


We first used simulated data to validate the classifiers using known ground truth. 1500 normally distributed features with means of 0.2, 0.5 and 0.8 were simulated. The standard deviation (SD) for each feature was incremented from 0.1 to 0.8 in steps of 0.1 to test the classifiers’ robustness to noise. Rs-fcMRI data from 132 subjects were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database ( 100 subjects were used for training/validation and 32 subjects were entirely kept aside until used for independent diagnosis/testing. Note that the training/validation and testing data had subjects from different age groups, in order to test the generalizability of classifiers (Fig-1). Standard preprocessing pipeline for Rs-fcMRI data was implemented using Data Processing Assistant for Resting-State fMRI toolbox (DPARSF)3. Mean timeseries were extracted from 200 functionally homogeneous brain regions (CC200 template4) and FC was calculated for all region pairs. An initial “feature-filtering” was performed wherein only the connectivity paths that were significantly different between the groups (p<0.05, FDR corrected) in the training/validation data were retained (after controlling for possible confounding factors), thereby reducing the number of features from 19,100 to 655. No statistical tests were performed on the independent test data to avoid introducing any bias. To further reduce the number of features while retaining discriminative information, some of the classifiers were embedded in the recursive cluster elimination (RCE) framework5. The classifiers used (Fig-4), were implemented both within (Fig-2) and outside the RCE-framework (Fig-3). Using FC from 100 subjects, classification accuracy was calculated using cross-validation and the classifier models obtained from the different partitionings of the training data (repeats×folds) were saved. Test accuracy was calculated on the 32 independent test subjects using the saved classifier models by a voting procedure. Classifier models obtained in each iteration during training would vote towards a decision on test subjects (accuracy was the percentage of correct votes). The classification procedure was identical for simulated and experimental data.


Results for simulated and experimental fMRI data are summarized in Fig-4 and plotted in Fig-5. Accuracy obtained from simulated data was close to 100% for both training/validation and test data for both smaller and larger SDs (except few classifiers whose performance degraded with larger SDs), thus validating the classifiers used. Results for fMRI data indicate that in disease classification tasks, the actual predictive performance obtained from independent test data is significantly lower than the cross-validation accuracy obtained from training data. Since we used a voting procedure on testing data from the saved models, if the majority of the classifier models predict accurately then the test data would be assigned to the correct class. Despite this liberal strategy, the performance was only slightly above chance (>25%) for most classifiers using fMRI data. The performance measures from a small sample could not be generalizable to populations with the same symptoms but in a different age group. Also, using several cross-validation levels reduces the amount of data available for each level, thus making feature selection or performance estimation unreliable in smaller datasets. Given the variability in fMRI data and the possibility of selection bias, it is better to use a completely independent test set rather than cross-validation to infer the predictive power of the classifier for smaller datasets7. Given that cross-validation is widely used in neuroimaging to assess classification performance, our results demonstrate the perils of using such a strategy. Validation using independent test data is the gold standard in many other scientific fields and we urge the neuroimaging community to adopt the same standard.


No acknowledgement found.


1. Lee M.H. et al, Resting-State fMRI: A Review of Methods and Clinical Applications, AJNR 2013; 34: 1866-1872.

2. Demirci O. et al, A Review of Challenges in the Use of fMRI for Disease Classification / Characterization and A Projection Pursuit Application from A Multi-site fMRI Schizophrenia Study, Brain Imaging and Behavior 2008; 2(3): 207-226.

3. Yan C and Zang Y, DPARSF: a MATLAB toolbox for “pipeline” data analysis of resting-state fMRI. Front. Syst. Neurosci., 2010; 4:13.

4. Craddock R.C. et al, A whole brain fMRI atlas generated via spatially constrained spectral clustering, Hum. Brain Mapp. 2012; 33(8):1914-1928.

5. Deshpande G. et al, Recursive cluster elimination based support vector machine for disease state prediction using resting state functional and effective brain connectivity, PLoS One 2010; 5(12): e14277.

6. Yamashita O et al, Sparse estimation automatically selects voxels relevant for the decoding of fMRI activity patterns, Neuroimage (2008); 42(4):1414-29.

7. Isaksson A. et al, Cross-validation and bootstrapping are unreliable in small sample classification, Pattern Recognition Letters (2008); 29(14): 1960-1965.


Fig-1: Diagnostic composition of the sample. Note that the training and testing data had subjects from different age ranges in order to test the generalizability of classifiers.

Fig-2: The flowchart depicting the RCE- framework for feature reduction and performance estimation.

Fig-3: The flowchart depicting the classification procedure for classifiers implemented outside the RCE-framework. We used a two-level cross-validation for parameter optimization and performance estimation for the training/validation data.

Fig-4: (Top) Cross-validation and test accuracy on the simulated dataset as the SD in the features is increased from 0.1 to 0.8. (Bottom) The overall cross-validation and test accuracy for ADNI data, along with the total accuracy for each subgroup.

Fig-5: A: Plot illustrating the cross-validation accuracy (Top) and test accuracy (Bottom) for several classifiers as a function of the SD for the simulated data. The cross-validation and test accuracies for most classifiers was 100%. B: Plot illustrating the cross-validation accuracy (Top) and test accuracy (bottom) for the ADNI Dataset.

Proc. Intl. Soc. Mag. Reson. Med. 24 (2016)