3512

Test-retest repeatability of convolutional neural networks in detecting prostate cancer regions on diffusion weighted imaging in 112 patients

Amogh Hiremath¹, Rakesh Shiradkar¹, Harri Merisaari^1,2, Prateek Prasanna¹, Otta Ettala³, Pekka Taimen⁴, Hannu J Aronen⁵, Peter J Boström³, Ivan Jambor^2,6, and Anant Madabhushi¹
¹Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, United States, ²Department of Diagnostic Radiology, University of Turku and Turku University Hospital, Turku, Finland, ³Department of Urology, University of Turku and Turku University hospital, Turku, Finland, ⁴Institute of Biomedicine, University of Turku and Department of Pathology, Turku University Hospital, Turku, Finland, ⁵Medical Imaging Centre of Southwest Finland, Turku University Hospital, Turku, Finland, ⁶Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, United States

Synopsis

We evaluated the short-term repeatability of convolutional neural networks (CNNs) in detecting prostate cancer (PCa) using DWI collected from patients who underwent same day test-retest MRI scans. DWI was post-processed using monoexponential fit (ADCm). Two models with similar architecture were trained on test-retest scans and short-term repeatability of network predictions in terms of intra-class correlation coefficient (ICC(3,1)) was evaluated. Although the observed ICC(3,1) was high for CNN when optimized for classification performance, our results suggest that network optimization with respect to classification performance might not yield the best repeatability. Higher repeatability was observed at lower learning rates.

Introduction

Recently, several convolutional neural network (CNN) based deep learning (DL) methods have been developed for cancer detection and diagnosis^1–3, especially for prostate cancer (PCa) detection using MRI^1,4. However, these methods need to be assessed for their test-retest repeatability to enable wide clinical use. High predictive power from single time point with low test-retest repeatability might be misleading. Recent studies have applied test-retest reliability analysis for feature selection of stable radiomic features^5,6 and have also looked at different pre-preprocessing schemes that could affect repeatability of radiomic features⁷. Although few studies have looked at repeatability of DL predictions^8,9, they have trained only a single model and evaluated repeatability on test-retest data. However, in this work, we evaluate test-retest repeatability of CNN predictions on unseen data in the context of PCa detection by training two separate models on test and retest data respectively. Additionally, hyper-parameters of the network (learning-rates, batch-sizes) were varied and analyzed for their effect on repeatability. We used AlexNet¹⁰ to evaluate repeatability of CNNs as it is a widely used architecture in the medical imaging community^11–13.

Methods

One hundred and twelve patients with PCa scheduled for prostatectomy underwent two prostate 3T MR examinations (S_A and S_B) performed on the same day approximately 15 minutes apart following repositioning on MR scanner table. Diffusion weighted imaging (DWI) was performed using a single shot spin-echo-based sequence, monopolar diffusion gradient, echo-planar read out with acquisition parameters: TR/TE time (3141/51msec), 12b-values (0,100,300,500,700,900,1100,1300,1500,1700,1900,2000s/mm²), acquisition voxel size (1.16×1.16×5.0mm³) and acquisition time (8.48 minutes). Monoexponential fit (ADCm) of DWI data with 2-b values (0,2000s/mm²) was performed. A radiologist working in consensus with a pathologist annotated cancer regions on MRI data with whole mounts prostatectomy sections as ground truth. The test-retest ADCm maps (S_A and S_B), were divided into training set (A_train, B_train), N=40, validation set (A_val, B_val), N=38, and test set (A_test, B_test), N=34 .

ADCm image slices with lesion delineation were marked positive while others were considered as negative samples. Images were cropped to contain only the prostate region and were resized to 224x224 pixels. Augmentations such as random rotation, flipping and cropping were performed on A_train to increase the size of the data. An AlexNet¹⁰ model N_A was trained using Adam-optimizer and was optimized for highest A_val AUC across learning-rates (1e-06,5e-06,1e-05,5e-05) and batch-sizes (16,32,64,128,256). Likewise, another model N_B was trained on B_train with same hyper-parameters as N_A. The models N_A and N_B were compared in terms AUC on A_test and B_test respectively. ICC(3,1)¹⁴ was evaluated on a combined test set (S_test: A_test+B_test). Figure 1 shows the overall experimental setup of evaluating repeatability.

Results

The models N_A and N_B yielded an AUC (PCa detection) of 0.77; 95% confidence interval (CI): (0.74–0.80) and 0.78; 95% CI: (0.75–0.81) on A_test and B_test, respectively, with optimized learning-rate and batch-size being 1e-05 and 256, respectively. The repeatability, ICC(3,1) between N_A and N_B model predictions on S_test was found to be 0.88; 95% CI¹⁴: (0.86-0.89). Figure 2 shows the ICC(3,1) trends of AlexNet evaluated on S_test for a range of learning-rates and batch-sizes. The ICC(3,1) trends for learning-rates on S_test clearly show that repeatability increases as learning-rate is reduced. Additionally, an intermediate batch-size of 64 yields slightly better repeatability than smaller (16) and larger batch-sizes (256). The best repeatability was obtained between N_A^* and N_B^* with learning-rate and batch-size of 5e-05 and 64, respectively, where the ICC(3,1) was found to be 0.97; 95% CI: (0.96-0.98).

Discussion

Lower learning-rates may hinder network training convergence, whereas larger learning-rates may result in a sub-optimal solution or unstable learning process. Similarly, very small batch-sizes may lead to noisy gradient updates whereas larger batch-sizes may lead to poor generalization of the network¹⁵. In our study, we observed a similar trend in terms of repeatability where larger learning rates led to lower ICC(3,1) score and intermediate batch-size of 64 yielded slightly better repeatability than smaller (16) and larger batch-sizes (256). Most of the previous studies have performed hyper-parameter tuning by optimizing their network to best classification performance^8,16. However, our work suggests that optimizing the network with respect to classification performance might not yield the best repeatability. The scatter plot in Figure 3 depicts that predictions between models N_A^* and N_B^* (with highest repeatability score) have lower variance compared to predictions between N_A and N_B (optimized to classification performance). Additionally, the activation maps in Figure 4 show that models N_A^* and N_B^* focus on identical regions in making decisions. We acknowledge that our work has limitations. The repeatability of CNNs was analyzed only on ADCm maps, and we have reported only slice-level results and not aggregated patient-level results since all patients in the cohort had PCa with gleason score ≥3+3. The repeatability analysis of other conventionally used CNNs such as VGGNet, ResNet and GoogLeNet, and, hyper-parameters such batch-normalization, drop-outs, activation-units and optimizers are left for future research.

Conclusion

Even though CNN (AlexNet) yielded a high ICC(3,1) of 0.88 when optimized for classification performance over a range of hyper-parameters, our results suggest that network optimization with respect to classification performance might not yield the best repeatability. Higher repeatability was observed at lower learning rates.

Acknowledgements

Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under award numbers 1U24CA199374-01, R01CA202752-01A1R01CA208236-01A1R01 CA216579-01A1R01 CA220581-01A11U01 CA239055-01

National Institute for Biomedical Imaging and Bioengineering 1R43EB028736-01National Center for Research Resources under award number 1 C06 RR12463-01VA Merit Review Award IBX004121A from the United States Department of Veterans Affairs Biomedical Laboratory Research and Development Service

The DOD Prostate Cancer Idea Development Award (W81XWH-15-1-0558). The DOD Peer Reviewed Cancer Research Program (W81XWH-16-1-0329)

The Ohio Third Frontier Technology Validation FundThe Wallace H. Coulter Foundation Program in the Department of Biomedical Engineering and The Clinical and Translational Science Award Program (CTSA) at Case Western Reserve University.

Funding from Instrumentarium Science Foundation, Sigrid Jusélius Foundation, Turku University Hospital, TYKS-SAPA research funds were used to cover the cost of MRI examinations.

References

1. Chen, Q. et al. A Transfer Learning Approach for Malignant Prostate Lesion Detection on Multiparametric MRI. Technol. Cancer Res. Treat. 18, 1533033819858363 (2019).

2. Sumathipala, Y. et al. Prostate cancer detection from multi-institution multiparametric MRIs using deep convolutional neural networks. J Med Imaging (Bellingham) 5, 044507 (2018).

3. Maicas, G., Bradley, A. P., Nascimento, J. C., Reid, I. & Carneiro, G. Pre and post-hoc diagnosis and interpretation of malignancy from breast DCE-MRI. Med Image Anal 58, 101562 (2019).

4. Aldoj, N., Lukas, S., Dewey, M. & Penzkofer, T. Semi-automatic classification of prostate cancer on multi-parametric MR imaging using a multi-channel 3D convolutional neural network. Eur Radiol (2019) doi:10.1007/s00330-019-06417-z.

5. van Timmeren, J. E. et al. Test–Retest Data for Radiomics Feature Stability Analysis: Generalizable or Study-Specific? Tomography 2, 361–365 (2016).

6. Gu, J. et al. Prediction of Immunohistochemistry of Suspected Thyroid Nodules by Use of Machine Learning-Based Radiomics. AJR Am J Roentgenol 1–10 (2019) doi:10.2214/AJR.19.21626.

7. Repeatability of Multiparametric Prostate MRI Radiomics Features | Scientific Reports. https://www.nature.com/articles/s41598-019-45766-z.

8. Cole, J. H. et al. Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker. Neuroimage 163, 115–124 (2017).

9. Hosny, A. et al. Deep learning for lung cancer prognostication: A retrospective multi-cohort radiomics study. PLoS Med 15, (2018).

10. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 1097–1105 (Curran Associates Inc., 2012).

11. Maqsood, M. et al. Transfer Learning Assisted Classification and Detection of Alzheimer’s Disease Stages Using 3D MRI Scans. Sensors (Basel) 19, (2019).

12. Zhang, C. et al. A visual encoding model based on deep neural networks and transfer learning for brain activity measured by functional magnetic resonance imaging. J. Neurosci. Methods 325, 108318 (2019).

13. Wang, S.-H. et al. Alcoholism Identification Based on an AlexNet Transfer Learning Model. Front Psychiatry 10, 205 (2019).

14. Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 86, 420–428 (1979).

15. LeCun, Y. A., Bottou, L., Orr, G. B. & Müller, K.-R. Efficient BackProp. in Neural Networks: Tricks of the Trade: Second Edition (eds. Montavon, G., Orr, G. B. & Müller, K.-R.) 9–48 (Springer Berlin Heidelberg, 2012). doi:10.1007/978-3-642-35289-8_3.

16. Krois, J. et al. Deep Learning for the Radiographic Detection of Periodontal Bone Loss. Sci Rep 9, (2019).

Figures

Figure 1. Overall experimental design for evaluating repeatability of AlexNet in detecting malignant prostate regions on ADCm maps. Two different models N_A and N_B were trained on ADCm scans A_train and B_train respectively and evaluated on a combined test set, S_test (A_test + B_test). We combined the test sets A_test and B_test since S_A and S_B scans were not co-registered with respect to each other and there was no correspondence between the images. The predictions P₁ and P₂ by N_A and N_B respectively on S_test are used to calculate repeatability, ICC(3,1), of the CNN.

Figure 2. Trends of ICC (3,1) of AlexNet in detecting malignant regions on ADCm maps on S_test. Columns 1 represents trends of mean ICC(3,1) over learning rates calculated across range of batch sizes while column 2 represents trends of mean ICC(3,1) over batch sizes calculated across range of learning rates.

Figure 3. Scatter plots depicting the correlation (pearson correlation score : r, ICC(3,1)) between AlexNet predictions in detecting malignant regions on prostate ADCm maps, trained on A_train and B_train and evaluated on S_test. Column 1 represents scatter plots corresponding to network predictions of N_A and N_B with hyper-parameters optimized with respect to A_val AUC. Column 2 illustrates the scatter plots corresponding to network instances with highest ICC(3,1) score (N_A^* and N_B^*) over a range of learning-rates (LR: 5e-5,1e-5,5e-6,1e-6) and batch sizes (BS: 16,32,64,128,256).

Figure 4. Activation maps of AlexNet for N_A^* and N_B^* on a,b) malignant regions on prostate ADCm maps and c,d) benign regions on prostate ADC maps. The activations map show that network’s attention is towards darker ADCm region for malignant regions compared to benign regions where the network’s attention is more towards brighter area. We can also observe N_A^* and N_B^* are drawing inferences from identical regions.

Proc. Intl. Soc. Mag. Reson. Med. 28 (2020)

3512