3757

What reference for reference ranges? How scanner and subject data heterogeneity impact MR hippocampal volumetry statistics in Alzheimer’s Disease

Jonas Richiardi^1,2, Bénédicte Maréchal^1,2,3, Alexis Roche⁴, Reto Meuli¹, and Tobias Kober^1,2,3

¹Department of Radiology, CHUV, Lausanne, Switzerland, ²Advanced Clinical Imaging Technology, Siemens Healthineers, Lausanne, Switzerland, ³LTS5, EPFL, Lausanne, Switzerland, ⁴CoVii Ltd, Porto, Portugal

Synopsis

The volume of specific brain structures is of clinical interest in many brain diseases. By using a volumetric reference range for healthy subjects, radiologists can contribute to refining diagnosis. However, both scanner and subject characteristics impact the construction and use of these reference ranges. Using a diverse dataset with 80 MRI scanners and 302 subjects, we show Alzheimer’s disease detection from hippocampal volume is robust to mismatch between training (development) and testing (deployment) environments despite showing some influence, but that estimates of atrophy rates can vary considerably depending on the training set used. Radiologists should interpret volumetry statistical results accordingly.

Introduction

Hippocampal volumetry provides a well-recognized marker for Alzheimer’s disease (AD), correlating with other biomarkers and clinical symptoms¹. By using reference ranges for healthy subjects, radiologists can contribute to refining diagnosis. However, the impact of distributional characteristics of data used for reference range building and evaluation is not clear – in other words, how much variability in estimates and prediction can we expect just by changing the training and testing datasets? Here, we investigate the consequences of scanner or subject sample variability.

Materials and Methods

We used 563 ADNI scans from 228 healthy controls (HC) and 74 AD patients, distributed between 80 MRI machines (Figure 1), including both cross-sectional and longitudinal (up to 4 measurements) subjects. We computed bilateral hippocampus volume (HV), total intracranial volume (TIV) with the MorphoBox prototype², and normalized bilateral HV as $NHV=\frac{HV}{TIV}$ . Scanner identity was given by unique combinations of site, vendor, system, head coil.

Experiment 1: Training variability and atrophy rates

We formed 1000 training subsamples by randomly picking 1/2 of scanners from the full dataset without replacement. For each subsample, we trained the mixed-effects model $NHV = 1+Age+Sex+Diagnosis+(1|ScannerID/SubjectID)$ ), which accounts for repeated measurements for some subjects.

We then assessed annual atrophy rates (ΔNHV) for HC and AD.

Experiment 2: Testing variability and discriminative performance

We formed a single training sample X_tr by randomly sampling 1/3 of scanners from the full dataset without replacement, assigning the rest to a testing sample X_te (no subject overlap). The same model as in experiment 1 was trained on X_tr. To assess sampling-related variations in discrimination, we formed S=1000 test subsamples X_te,s, each sampling without replacement 1/2 of X_te, For each image X_te,s,n an imaging marker $f(X_{te,s,n}) = NHV_n-NHV_{predicted}$ was obtained (setting Diagnosis to HC). For each X_te,s we computed the Area Under the Curve (AUC) of a ROC curve on f(X_te,s,n) stratified by diagnosis.

We measured mismatch between X_tr and each X_te,s by standardized mean differences between scanners (Vendor; Field strength; Number of coils; Voxel size; grey matter-CSF Contrast-to-Noise Ratio, squared and divided by voxel size in order to decorrelate from Voxel size (CNR_GC)) and subjects (Sex, Age, Diagnosis), testing for associations between mismatch and discrimination performance by linearly regressing the AUC on the standardized mean differences of these variables, ignoring non-independence of the subsamples.

We repeated experiment 2 with a confound-corrected model including voxel size and (CNR_GC)

Results

Experiment 1

Over 1000 random resamplings of 40 scanners (median N(HC)=210, median N(AD)=76), estimated ΔNHV for HC had a median of -0.19% (IQR -0.28% - -0.13%), while estimates for AD had a median of -0.76 % (IQR -0.88% - -0.66%) (Figure 2). This is slightly lower but consistent with the literature, and larger variance of AD rates was previously reported¹.

Experiment 2

The estimated ΔNHV was -0.20 % (HC) and -0.79 % (AD). Scanner random effects had a standard deviation roughly 4 times smaller than subject random effects (Figure 3).

The 1000 X_te,s each had 26 scanners, with a median N(HC) of 138 (IQR 127-149) and a median N(AD) of 52 (IQR 44-60). All subsets gave good discriminative performance with a median AUC of 0.92 (IQR 0.91-0.93) (Figure 4). Note that this figure is in terms of scans, not individual subjects, and that there were repeated measurements for some subjects.

All measures of subject differences between X_tr and X_te,s, plus differences in CNR_GC, Voxel size, and Vendor, were significantly associated with AUC, although explaining only 23.4% of variance. The confound-corrected model lowered variance explained to 15%, with a slightly lower AUC (median 0.9, IQR 0.89-0.92).

Conclusions

We showed that ΔNHV estimates fluctuate with dataset variability, and should be interpreted cautiously. However, practically, AD detection from NHV performs relatively well and is stable, even if scanner/subject differences explain some variation in performance.

This is promising since volume estimates from new hardware could be compared to reference ranges without unduly affecting diagnostic accuracy.

Acknowledgements

No acknowledgement found.

References

1 Frisoni, G. B., Fox, N. C., Jack, C. R., Jr., Scheltens, P. & Thompson, P. M. The clinical use of structural MRI in Alzheimer disease. Nat Rev Neurol 6, 67-77, doi:10.1038/nrneurol.2009.215 (2010).

2 Schmitter, D. et al. An evaluation of volume-based morphometry for prediction of mild cognitive impairment and Alzheimer's disease. Neuroimage Clin 7, 7-17, doi:10.1016/j.nicl.2014.11.001 (2015).

3 Jack, C. R., Jr. et al. Update on the magnetic resonance imaging core of the Alzheimer's disease neuroimaging initiative. Alzheimers Dement 6, 212-220, doi:10.1016/j.jalz.2010.03.004 (2010).

4 ADNI. ADNI MRI protocols, <http://adni.loni.usc.edu/methods/documents/mri-protocols/>

5 Nakagawa, S., Schielzeth, H. & O'Hara, R. B. A general and simple method for obtainingR2 from generalized linear mixed-effects models. Methods in Ecology and Evolution 4, 133-142, doi:10.1111/j.2041-210x.2012.00261.x (2013).

Figures

Figure 1: Scanner and subject characteristics (whole-sample). Structural imaging followed the ADNI-2 MP-RAGE protocol^3,4

Figure 2: Experiment 1: Distribution of atrophy rate (ΔNHV) estimates depending on random resamplings of the training data. Red: AD. Green: HC. Median estimates: -0.19 % (HC) and -0.76 % (AD). In most cases, AD atrophy rates are higher than HC rates, showing a certain robustness to sampling differences, although exact rates can vary significantly from dataset to dataset.

Figure 3: Experiment 2: Reference range random effects model on the training set. This model passed regression diagnostics,including for random effects, and had marginal and conditional R-squared values⁵ of 0.53 and 0.95 respectively. The thick black line is the population-level fit, while each of the coloured lines represents the fit for one of the 27 scanners. Note that the scanner variation captured by the random-effects model is much smaller than the range of subject variation. Atrophy rates were estimated at -0.20 % (HC) and -0.79 % (AD), in line with the results of random sampling of experiment 1.

Figure 4: Experiment 2: Distribution of Area Under Curve of a Receiver Operating Characteristic (ROC) curve for the discrimination of healthy controls and Alzheimer patients based on the imaging marker (distance from reference range prediction for normalised hippocampal volume, corrected for age and sex). This prediction residual is more negative for atrophied subjects farther from the reference range model. Note that relatively good discriminative performance is maintained throughout the random subsamplings, and that most test sets contain repeated scans for the same subject.

Proc. Intl. Soc. Mag. Reson. Med. 26 (2018)

3757