3631

Prostate Cancer Diagnosis Using an Explainable Credibility Estimation Network Incorporating a Rejection Mechanism

Rong Wei¹, Yu Xia¹, Yi Zhu², Jinyu Yang¹, Ge Gao³, Xiaoying Wang³, Jue Zhang¹, and Jianxiu Lian²
¹Peking University, Beijing, China, ²Philips Healthcare, Beijing, Beijing, China, ³Peking University First Hospital, Beijing, China

Synopsis

Keywords: Prostate, Prostate

Motivation: The need to improve prostate cancer diagnosis through advanced understanding of lesion characteristics and reducing false positives led to this research.

Goal(s): To create a pioneering integrated system using deep learning, capable of accurately assessing the benignity or malignancy of prostate MRI images, whilst reducing labeling costs and enhancing the reliability of classifications.

Approach: The approach involves training a convolutional network with multi-parametric MRI images, incorporating credibility analysis to provide visually interpretable prostate cancer prediction results and reject low-credibility predictions.

Results: The results showed improved reliability and efficacy, with the model discarding low-credibility predictions, thus mitigating potential risks associated with prediction failures.

Impact: This study equips clinical practitioners with the ability to comprehend the decision-making process of the CAD system and manage the output results through an intuitive display. This results enhance diagnostic accuracy, potentially impacting clinicians' decision-making and patient outcomes.

Introduction

Prostate cancer (PCa) is the second most common malignant tumor in men worldwide, with an estimated 1.4 million new cases diagnosed in 2020 [1]. Deep learning models have made significant progress in the field of prostate Magnetic Resonance (MR) computer-aided diagnosis (CAD) systems [2] [3]. However, such image-level classification makes it challenging to understand the general characteristics of the lesion, leading to several false positives. More crucially, the cancer target area must be precisely described in the segmentation labels for these bottom-up detection algorithms, which substantially drives up the labeling cost.
In this study, we introduce a pioneering system capable of evaluating the benignity or malignancy of PCa multiparametric MRI (mp-MRI) images as a single integrated unit, effectively reducing labeling costs. More importantly, our model provides a visually interpretable basis and a credibility analysis of the results, significantly boosting the reliability of the classification.

Methods

A cohort of 163 patients (from the years 2013-2016) both with and without PCa were selected for the study. Images were obtained using T2-weighted imaging (T2WI), diffusion-weighted imaging (DWI) and apparent diffusion coefficient (ADC) imaging on a 3.0 Tesla magnetic resonance scanner (Ingenia; Philips Healthcare), following standardized protocols.
These images were then subjected to preprocessing steps, including segmentation and normalization. Subsequently, the preprocessed DWI, T2WI, and ADC images were amalgamated into three-channel image groups. These groups were used to train a VGG-16 network [4] as shown in Figure 1, with the aid of transfer learning. Gradient-weighted Class Activation Mapping (Grad-CAM) [5] and Monte Carlo (MC) Dropout [6] techniques were employed to interpret the output of our classification network and to assess the credibility of the model. Outputs with low credibility, as defined by a predetermined credibility index threshold, were rejected.
In the most extreme binary classification model scenario, half the network in the model predicts a result of 0, while the other half predicts 1, resulting in a variance of 0.25. Given this, we can express the credibility index C as: $$$C=1-\left(\frac{\mathrm{D}}{0.25}\right)$$$, where D signifies the model's variance. The mean value then computed pixel by pixel for the activation map produced by the model, creating the final credibility estimation map. To enhance the model's reliability, prediction results that fall below the established credibility threshold are discarded. This approach ensures a more accurate and reliable model.
To evaluate the model's performance, receiver operating characteristic (ROC) curves were plotted and metrics such as area under the curve (AUC), false positive rate (FPR), and negative predictive value (NPV) were computed.

Results

A representative set of high-credibility prediction results is shown in Figure 2, while Figure 3 displays a set of visualization results with low credibility. Figure 2 shows that the activation regions in CEN match the lesion regions annotated by the radiologists. This suggests that providing only image-level prostate cancer classification labels can also learn the main lesion features, thus substantially reducing the radiologists' labeling stress. Even though the model predicts some high feature activation areas in Figure 3, the credibility map of these regions is not activated, demonstrating that the model is unsure if these regions have PCa. Therefore, such results need to be rejected. In fact, this is an image of a healthy prostate that was misclassified, and such errors can be effectively eliminated by rejecting low-credibility results from the model.
In validation stage, our model yielded optimal results with a credibility index threshold set at 0.80. In test stage, 280 images of 50 patients in the test dataset were classified, and the output results were accepted or rejected according to the calculated credibility index and the set threshold value. As demonstrated in Table 1, the model mainly dismissed images leading to false positives. What is even more noteworthy is that there are no false negative samples in the rejected set, which is of great significance for the needs of clinical diagnosis. In the end, there was a notable improvement in the AUC value of the classification network incorporating a rejection function, compared to the original VGG-16 classification network (0.93 vs. 0.87, P<0.05). This highlights the enhanced reliability and efficacy of our innovative method.

Conclusions

In conclusion, our explainable credibility estimation network, which includes a rejection option, provides physicians with a comprehensive understanding of the decision-making process and the ability to regulate output. Moreover, the proposed method leverages a credibility analysis technique to discard uncertain predictions, thereby mitigating the potential risks associated with prediction failures. Our proposed model deploys credibility analysis as a means of providing reliable and stable predictions that satisfy the rigorous safety standards of clinical settings.

Acknowledgements

No acknowledgement found.

References

[1] Rawla, P. (2019). Epidemiology of prostate cancer. World journal of oncology, 10(2), 63.

[2] Reda, I., Khalil, A., Elmogy, M., Abou El-Fetouh, A., Shalaby, A., Abou El-Ghar, M., ... & El-Baz, A. (2018). Deep learning role in early diagnosis of prostate cancer. Technology in cancer research & treatment, 17, 1533034618775530.

[3] Abraham, B., & Nair, M. S. (2019). Automated grading of prostate cancer using convolutional neural network and ordinal class classifier. Informatics in Medicine Unlocked, 17, 100256.

[4] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[5] Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618-626).

[6] Gal, Y., & Ghahramani, Z. (2016, June). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning (pp. 1050-1059). PMLR.

Figures

Figure 1 presents the network structure diagram proposed in this study. In comparison to the original VGG network, our study introduces an MC-drop in the testing phase (repeated testing was conducted 50 times in the experiment), enabling the network to vote during this phase. Credibility in the model is obtained by measuring the variance in the votes. Finally, Grad-Cam is used to visualize the feature activation map and the credibility map.

Figure 2 presents three high-credibility instances using the proposed method. The top row exhibits the segmented prostate T2WI images. The middle row demonstrates the feature activation map, highlighting the regions the network utilized for determining benign or malignant conditions. The bottom row shows the proposed credibility map, unveiling the network's varying credibility levels for different regions.

Figure 3 depicts three examples of low credibility of proposed method. The first row presents the segmented prostate T2WI images. The second row shows the feature activation map, which indicates the areas the network relies on for benign or malignant decisions. The third row features the credibility map proposed in this study, illustrating the network's varying levels of credibility across different regions.

Table 1 presents the false-negative and false-positive rates for both accepted and rejected samples. Samples exceeding the threshold of credibility are categorized as accepted, whereas those falling below are designated as rejected. These rejected samples are reserved for physicians to conduct a more thorough analysis.

Proc. Intl. Soc. Mag. Reson. Med. 32 (2024)

3631

DOI: https://doi.org/10.58530/2024/3631