1519

Comparison of whole-prostate radiomics models of disease severity derived from expert and AI based prostate segmentations
Paul E Summers1, Lars Johannes Isaksson2, Matteo Johannes Pepa2, Mattia Zaffaroni2, Maria Giulia Vincini2, Giulia Corrao2,3, Giovanni Carlo Mazzola2,3, Marco Rotondi2,3, Sara Raimondi4, Sara Gandini4, Stefania Volpe2,3, Zaharudin Haron5, Sarah Alessi1, Paola Pricolo1, Francesco Alessandro Mistretta6, Stefano Luzzago6, Federico Cattani7, Gennaro Musi3,6, Ottavio De Cobelli3,6, Marta Cremonesi8, Roberto Orecchia9, Giulia Marvaso2,3, Barbara Alicja Jereczek-Fossa2,3, and Giuseppe Petralia3,10
1Division of Radiology, IEO, European Institute of Oncology IRCCS, Milano, Italy, 2Division of Radiation Oncology, IEO, European Institute of Oncology IRCCS, Milano, Italy, 3Department of Oncology and Hemato-oncology, University of Milan, Milano, Italy, 4Department of Experimental Oncology, IEO, European Institute of Oncology IRCCS, Milano, Italy, 5Radiology Department, National Cancer Institute, Putrajaya, Malaysia, 6Division of Urology, IEO, European Institute of Oncology IRCCS, Milano, Italy, 7Unit of Medical Physics, IEO, European Institute of Oncology IRCCS, Milano, Italy, 8Radiation Research Unit, IEO, European Institute of Oncology IRCCS, Milano, Italy, 9Scientific Directorate, IEO, European Institute of Oncology IRCCS, Milano, Italy, 10Precision Imaging and Research Unit, IEO, European Institute of Oncology IRCCS, Milano, Italy

Synopsis

A persisting concern is that downstream models of clinical endpoints may depend on whether the contours were drawn by an expert or an AI. Prediction models for surgical margin status, and pathology-based lymph nodes, tumor stage and ISUP grade group were formed using clinical and radiological features along with whole-prostate radiomic features based on manual and AI segmentations of the prostate in 100 patients who proceeded to prostatectomy after multiparametric-MRI. The models based on AI segmented prostates differed from those based on manual segmentation, but with similar if not better performance. Further testing of generalizability of the models is required.

Introduction

It is recognized that differences in image acquisition and processing of medical images can impact the performance of machine learning models that predict clinical endpoints based on said images. For prostate cancer, a series of challenges have seen considerable improvement in the segmentation of the prostate gland.1-3 As yet however, there is little certainty about whether the small differences that remain between ground truth and automatically defined prostate contours will lead to significant differences between models that use radiomics features derived from those segmentations. We therefore examined the differences between models formed with radiomics features from prostates defined manually and by deep-learning based segmentation. We further compared the contributions of leading radiomics features in the prediction model.

Methods

One hundred (100) patients who had undergone PI-RADS compliant MRI and subsequent prostatectomy in our Institution since 2015 were included in this study and the prostate of each patient was segmented from the T2-weighted axial MRI images by an expert radiologist. A second set of segmentations was created by training a custom deep learning architecture on the radiologist’s contours,4-6 relative to which it had a Dice similarity index of 0.910. For each segmentation in each patient, one thousand eight hundred and ten (1810) prostate radiomic features were calculated independently with the pyradiomics python package (v3.0.1). The target variables for prediction were surgical marginal status (R0 vs R1), pathology-based lymph node status (pN0 vs pN1), pathology tumor stage (pT2 vs pT3) and pathology ISUP grade group (≤3 vs ≥4). The clinical (age, iPSA, biopsy total Gleason score, ISUP grade, and risk class) and radiological (prostate volume, PI-RADS category, and EPE score) and radiomic feature set was reduced via a hierarchical clustering procedure based on absolute rank correlation to a group of 50 archetypal features for each target variable – segmentation type combination. Gradient-boosted decision-tree models for each target variable were separately trained with features from the manual and automatic segmentation feature clusters, then compared in terms of their AUC values from 32 times repeated 5-fold cross validations.

Results

The range of AUC values occurring in the 5-fold cross validations for the different models based on radiomics features derived from expert or AI-based segmentations are shown in Figure 1 and summarized in Table 1. Except for pathological lymph node status, the performance was significantly better with features derived from the AI-based segmentations. Despite the differences being significant in some cases, their magnitudes are relatively small an are likely to vary relative to those seen here in an independent test population. We note however, that in the clustering stage there were differences in the choice of representative features for a given target variable depending on the segmentation used. Similarly, the leading features in terms of feature importance for prediction of a given target variable also depended on the segmentation used.

Discussion

Overall, the prediction models based on radiomics features derived AI-based segmentation of the prostate tended to perform slightly better than those derived from a manual segmentation. This was not a universal finding however, as prediction of pathological lymph node status was in fact non-significantly worse based on the AI segmentation. Importantly however, the generally small differences in the segmentation contour were sufficient to impact the downstream processes of feature clustering and model formation, such that while model performance remained similar, the features contained within the model may differ according to the type of segmentation used. There are a series of weaknesses to recognize in this study. First and foremost, the same relatively small number of subjects are involved for training and validation of both the segmentation and predictive models and the generalizability of the results to a wider population needs to be tested. The cross-validation procedures used for both the segmentation and radiomics models should provide a degree of protection against over-fitting, nonetheless, the number of features available remains relatively large.

Conclusion

Our results illustrate a relative equivalence between radiomics models formed with either manual or automatic segmentation despite the specific features being adopted witin the models. This is reassuring insomuch as the agreement (DICE score) between the AI and manual segmentations is comparable to that reported between radiologists, and consequently the resulting performance of predictive radiomic models based on the AI segmentations would be indistinguishable from those performed based on a human reader despite the models differing in the features they use. Generalization to larger and wider populations is to be tested.

Acknowledgements

No acknowledgement found.

References

1. Litjens G, Toth R, van de Ven W, et al. Evaluation of prostate segmentation algorithms for MRI: The PROMISE12 challenge. Med Image Anal. 2014;18(2):359-373.

2. Farahani K, Jaffe C, Bloch N, et al. NCI-ISBI 2013 Challenge - Automated Segmentation of Prostate Structures. https://wiki.cancerimagingarchive.net/display/Public/NCI-ISBI+2013+Challenge+-+Automated+Segmentation+of+Prostate+Structures. Accessed November 8, 2021.

3. Samuel G. Armato, Lubomir Hadjiyski, and Karen Drukke. SPIE-AAPM-NCI Prostate MR Classification Challenge. https://prostatex.grand-challenge.org/. Accessed November 8, 2021.

4. Gugliandolo SG, Pepa M, Isaksson LJ, et al. MRI-based radiomics signature for localized prostate cancer: a new clinical tool for cancer aggressiveness prediction? Sub-study of prospective phase II trial on ultra-hypofractionated radiotherapy (AIRC IG-13218). Eur Radiol. 2020 Aug 27.

5. Isaksson LJ, Raimondi S, Botta F, et al. Effects of MRI image normalization techniques in prostate cancer radiomics. Phys Med. 2020;71:7-13.

6. Isaksson LJ, Summers P, Raimondi S, et al. Mixup (sample pairing) can improve performance of deep segmentation networks. Accepted J. Artif Intell Soft Comput Res

Figures

Figure 1: Comparison of AUC performances of the models trained on features from manual (blue) and automatic segmentations (orange), respectively. Each bar represents the range of the out-of-bag predictions of 32 repeated 5-fold cross validations. P-values indicate the significance of difference according to the nonparametric Mann-Whitney U-test.


Proc. Intl. Soc. Mag. Reson. Med. 30 (2022)
1519
DOI: https://doi.org/10.58530/2022/1519