Paul E Summers1, Lars Johannes Isaksson2, Matteo Johannes Pepa2, Mattia Zaffaroni2, Maria Giulia Vincini2, Giulia Corrao2,3, Giovanni Carlo Mazzola2,3, Marco Rotondi2,3, Sara Raimondi4, Sara Gandini4, Stefania Volpe2,3, Zaharudin Haron5, Sarah Alessi1, Paola Pricolo1, Francesco Alessandro Mistretta6, Stefano Luzzago6, Federico Cattani7, Gennaro Musi3,6, Ottavio De Cobelli3,6, Marta Cremonesi8, Roberto Orecchia9, Giulia Marvaso2,3, Barbara Alicja Jereczek-Fossa2,3, and Giuseppe Petralia3,10
1Division of Radiology, IEO, European Institute of Oncology IRCCS, Milano, Italy, 2Division of Radiation Oncology, IEO, European Institute of Oncology IRCCS, Milano, Italy, 3Department of Oncology and Hemato-oncology, University of Milan, Milano, Italy, 4Department of Experimental Oncology, IEO, European Institute of Oncology IRCCS, Milano, Italy, 5Radiology Department, National Cancer Institute, Putrajaya, Malaysia, 6Division of Urology, IEO, European Institute of Oncology IRCCS, Milano, Italy, 7Unit of Medical Physics, IEO, European Institute of Oncology IRCCS, Milano, Italy, 8Radiation Research Unit, IEO, European Institute of Oncology IRCCS, Milano, Italy, 9Scientific Directorate, IEO, European Institute of Oncology IRCCS, Milano, Italy, 10Precision Imaging and Research Unit, IEO, European Institute of Oncology IRCCS, Milano, Italy
Synopsis
A persisting concern is that downstream models
of clinical endpoints may depend on whether the contours were drawn by an
expert or an AI. Prediction models for surgical margin
status, and pathology-based lymph nodes, tumor stage and ISUP grade group were
formed using clinical and radiological features along with whole-prostate
radiomic features based on manual and AI segmentations of the prostate in 100
patients who proceeded to prostatectomy after multiparametric-MRI. The models
based on AI segmented prostates differed from those based on manual
segmentation, but with similar if not better performance. Further testing of generalizability
of the models is required.
Introduction
It
is recognized that differences in image acquisition and processing of medical
images can impact the performance of machine learning models that predict
clinical endpoints based on said images. For prostate cancer, a series of
challenges have seen considerable improvement in the segmentation of the
prostate gland.1-3 As yet however, there is little certainty about whether
the small differences that remain between ground truth and automatically
defined prostate contours will lead to significant differences between models
that use radiomics features derived from those segmentations. We therefore examined the differences between
models formed with radiomics features from prostates defined manually and by deep-learning
based segmentation. We further compared the contributions of leading radiomics
features in the prediction model.Methods
One hundred (100)
patients who had undergone PI-RADS compliant MRI and subsequent prostatectomy
in our Institution since 2015 were included in this study and the prostate of
each patient was segmented from the T2-weighted axial MRI images by an expert
radiologist. A second set of segmentations was created by training a custom
deep learning architecture on the radiologist’s contours,4-6
relative to which it had a Dice similarity index of 0.910. For each
segmentation in each patient, one thousand eight hundred and ten (1810)
prostate radiomic features were calculated independently with the pyradiomics
python package (v3.0.1).
The target
variables for prediction were surgical marginal status (R0 vs R1),
pathology-based lymph node status (pN0 vs pN1), pathology tumor stage (pT2 vs
pT3) and pathology ISUP grade group (≤3 vs ≥4). The clinical (age, iPSA, biopsy total Gleason score, ISUP grade, and
risk class) and radiological (prostate volume, PI-RADS category, and EPE score)
and radiomic feature set was reduced via a hierarchical clustering
procedure based on absolute rank correlation to a group of 50 archetypal
features for each target variable – segmentation type combination.
Gradient-boosted decision-tree models for each target variable were separately
trained with features from the manual and automatic segmentation feature
clusters, then compared in terms of their AUC values from 32 times repeated
5-fold cross validations. Results
The range of AUC
values occurring in the 5-fold cross validations for the different models based
on radiomics features derived from expert or AI-based segmentations are shown
in Figure 1 and summarized in Table 1. Except for pathological lymph node
status, the performance was significantly better with features derived from the
AI-based segmentations. Despite the differences being significant in some
cases, their magnitudes are relatively small an are likely to vary relative to
those seen here in an independent test population.
We note however, that in the clustering stage
there were differences in the choice of representative features for a given
target variable depending on the segmentation used. Similarly, the leading
features in terms of feature importance for prediction of a given target
variable also depended on the segmentation used.Discussion
Overall, the prediction models based on
radiomics features derived AI-based segmentation of the prostate tended to
perform slightly better than those derived from a manual segmentation. This was
not a universal finding however, as prediction of pathological lymph node
status was in fact non-significantly worse based on the AI segmentation. Importantly
however, the generally small differences in the segmentation contour were
sufficient to impact the downstream processes of feature clustering and model
formation, such that while model performance remained similar, the features
contained within the model may differ according to the type of segmentation
used.
There are a series of weaknesses to recognize
in this study. First and foremost, the same relatively small number of subjects
are involved for training and validation of both the segmentation and
predictive models and the generalizability of the results to a wider population
needs to be tested. The cross-validation procedures used for both the segmentation
and radiomics models should provide a degree of protection against
over-fitting, nonetheless, the number of features available remains relatively
large.Conclusion
Our
results illustrate a relative equivalence between radiomics models formed with
either manual or automatic segmentation despite the specific features being
adopted witin the models. This is reassuring insomuch as the agreement (DICE
score) between the AI and manual segmentations is comparable to that reported
between radiologists, and consequently the resulting performance of predictive
radiomic models based on the AI segmentations would be indistinguishable from
those performed based on a human reader despite the models differing in the
features they use. Generalization to larger and wider populations is to be
tested.Acknowledgements
No acknowledgement found.References
1. Litjens G, Toth R, van de Ven W, et al. Evaluation
of prostate segmentation algorithms for MRI: The PROMISE12 challenge. Med Image
Anal. 2014;18(2):359-373.
2. Farahani K, Jaffe C, Bloch N, et al. NCI-ISBI 2013 Challenge - Automated Segmentation
of Prostate Structures.
https://wiki.cancerimagingarchive.net/display/Public/NCI-ISBI+2013+Challenge+-+Automated+Segmentation+of+Prostate+Structures.
Accessed November 8, 2021.
3. Samuel G. Armato, Lubomir Hadjiyski, and Karen Drukke.
SPIE-AAPM-NCI Prostate MR Classification Challenge. https://prostatex.grand-challenge.org/.
Accessed November 8, 2021.
4. Gugliandolo SG, Pepa M,
Isaksson LJ, et al. MRI-based radiomics
signature for localized prostate cancer: a new clinical tool for cancer aggressiveness
prediction? Sub-study of prospective phase II trial on ultra-hypofractionated
radiotherapy (AIRC IG-13218). Eur Radiol. 2020 Aug 27.
5. Isaksson LJ, Raimondi S, Botta F, et al. Effects
of MRI image normalization techniques in prostate cancer radiomics. Phys Med.
2020;71:7-13.
6. Isaksson LJ, Summers P, Raimondi S, et al. Mixup
(sample pairing) can improve performance of deep segmentation networks.
Accepted J. Artif Intell Soft Comput Res