4106

Assessing the variability of contours performed by DL algorithms in prostate MRI
Jeremiah Sanders1, Henry Mok2, Alexander Hanania3, Aradhana Venkatesan4, Chad Tang2, Teresa Bruno2, Howard Thames5, Rajat Kudchadker6, and Steven Frank2
1Imaging Physics, UT MD Anderson Cancer Center, Houston, TX, United States, 2Radiation Oncology, UT MD Anderson Cancer Center, Houston, TX, United States, 3Radiation Oncology, Baylor College of Medicine, Houston, TX, United States, 4Diagnostic Radiology, UT MD Anderson Cancer Center, Houston, TX, United States, 5Biostatistics, UT MD Anderson Cancer Center, Houston, TX, United States, 6Radiation Physics, UT MD Anderson Cancer Center, Houston, TX, United States

Synopsis

Quantitative techniques for characterizing deep learning (DL) algorithms are necessary to inform their clinical application, use, and quality assurance. This work analyzes the performance of DL algorithms for segmentation in prostate MRI at a population level. We performed computational observer studies and spatial entropy mapping for characterizing the variability of DL segmentation algorithms and evaluated them on a clinical MRI task that informs the treatment and management of prostate cancer patients. Specifically, we analyzed the task of prostate and peri-prostatic anatomy segmentation in prostate MRI and compared human and computer observer populations against one another.

Introduction

Quantitative techniques for characterizing deep learning (DL) algorithms are necessary to inform their clinical application, use, and quality assurance. Furthermore, predictions from DL models should be understood in the context of human predictions for the same clinical tasks if they are expected to assist or substitute humans in clinical decision making. A limitation of DL algorithms is their lack of generality due to the limited number of data sets available from retrospective databases. As a result, some DL algorithms are expected to make highly variable predictions on clinical tasks and therefore need to be sufficiently characterized prior to clinical use. Methods to evaluate multiple algorithms are needed to gain more insight into the variability of these algorithms. The purpose of this work was to investigate the variability of DL algorithms for contouring at a population level by training several DL algorithms for prostate and organ at risk (OAR) segmentation in prostate MRI. Each DL algorithm was treated as an independent observer of the development MRIs and predictions from populations of DL algorithms were compared with target annotations from a population of human observers performing the same task.

Methods

25 patients underwent low-dose-rate prostate brachytherapy (LDRPBT) and were scanned with a fully balanced SSFP pulse sequence (Table 1). A human observer study comprising 7 human analysts was conducted. The analysts included 4 radiation oncologists (ROs), 1 radiologist, 1 dosimetrist, and 1 imaging physicist, who are all involved in the treatment workflow for LDRPBT. Several techniques to minimize human observer bias were implemented in the study. The analysts contoured the prostate, external urinary sphincter (EUS), seminal vesicles (SV), rectum, and bladder in the 25 MRIs.

54 DL algorithms were successively trained on a common set of development MRIs and human segmentation masks. This yielded 54 unique DL algorithms to perform the segmentation task, each of which had unique convolution kernels and were produced as an attempt to replicate the variation in the human segmentation masks. 295 retrospectively collected 3D prostate MRIs acquired with 2 pulse sequences that yielded T2w and T2/T1w image contrast, respectively, were used for developing the DL models (250/45 training/cross validation).

All DL experiments were performed on an NVIDIA DGX-1 workstation (Figure 1). Each of the DL models made predictions on the 25 post-implant MRIs. A 50% confidence threshold on each network’s softmax function was used to produce the binary mask of each organ from that network. Global similarity metrics were computed between the most experienced RO and all 54 DL algorithms. They were also computed amongst human observers using the most experienced RO as the reference. The similarity metrics included precision (P=TP/(TP+FP)), recall (R=TP/(TP+FN)), Dice similarity coefficient (DSC=2*P*R/(P+R)), and Matthew’s correlation coefficient (MCC=(TP*TN-FP*FN)/√((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))). Additionally, spatial distributions of the entropy of the segmentation masks were computed and compared across human and computer populations for the 25 patients. The entropy of the segmentation masks H for an observer population was computed voxel-wise; H(x,y,x)=-∑p(xi,yj,zk)log(p(xi,yj,zk)), where p(xi,yj,zk) is the probability distribution of the tissue class of voxel (i,j,k) for the observer population. The algorithms were incorporated into a commercial treatment planning system (TPS) for visualization.

Results

Global similarity metrics are reported in Table 2 for the experiments. Spatial entropy maps for 8 example patients compared between the human and computer observer populations revealed patterns of clustering in specific anatomic regions of the MRIs (Figure 2). The clusters with the highest entropy were located around the circumference of the target organ (prostate), especially at junctions between the target and surrounding OARs. The most common regions of high entropy clusters were observed at the bladder neck, the junction between the prostate and SV, the region along the prostate and anterior rectal wall, and the junction between the prostate apex and EUS. The computer observer population was less variable than the human observer population for some patients (e.g. Figure 2, patients 3 and 4), but was also highly variable on other patients (e.g. Figure 2, patients 7 and 8). The autocontouring predictions and spatial entropy maps computed in the TPS for one example patient (Figure 3) showed regions of high entropy clusters overlaid on their corresponding prostate MRI, providing feedback on anatomical regions of high variability in contouring for review and refinement.

Discussion/Conclusions

Global similarity metrics are commonly used to characterize the performance of DL algorithms for segmentation tasks. However, these metrics are limited in that they do not provide information about local segmentation performance. Moreover, the clinical context for these metrics is often absent in reporting of segmentation algorithms (e.g. what does a prostate DSC of 0.87 really mean?). This work attempts to help address these questions for the task of prostate and OAR segmentation in prostate MRI. We compared four global similarity metrics between human and computer observer populations to provide context on the numerical values of these metrics. We also presented spatial entropy mapping case studies for 8 example patients, demonstrating that populations of DL algorithms produced patterns of lower, similar, and higher spatial entropy of tissue classes as compared to a group of analysts involved in LDRPBT clinical workflow. These findings have several potential implications for the use of autocontouring algorithms for MRI-based LDRPBT.

Acknowledgements

No acknowledgement found.

References

[1] Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell. 39(4):640–651, 2017.

[2] Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. arXiv:1505.04597.

[3] Sanders JW, Lewis GD, Thames HD, et al. Machine segmentation of pelvic anatomy in MRI-assisted radiosurgery (MARS) for prostate cancer brachytherapy. Int J Radiat Oncol Biol Phys. 108(5):1292–1303, 2020.

[4] Sanders JW, Venkatesan AM, Levitt CA, et al. Fully balanced SSFP without an endorectal coil for postimplant QA of MRI-assisted radiosurgery (MARS) of prostate cancer: a prospective study. Int J Radiat Oncol Biol Phys. S0360-30162034342-X, 2020.

[5] Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure. 405 (2):442–451, 1975.

[6] Batty M. Spatial Entropy. Geographic Analysis. 6(1):1–31, 1974.

[7] Ma J, Moerland MA, Venkatesan AM, et al. Pulse sequence considerations for simulation and postimplant dosimetry of prostate brachytherapy. Brachytherapy. 16:743–753, 2017.

[8] Blanchard P, Ménard C, Frank S. Clinical Use of Magnetic Resonance Imaging Across the Prostate Brachytherapy Workflow. Brachytherapy. 16(4):734–742.

Figures

Table 1. MRI acquisition parameters for post-implant LDRPBT MRIs. All MRIs were acquired with a fully balanced SSFP pulse sequence (Siemens: CISS, GE: FIESTA-C). Images were acquired on a 3T Prisma using 2 18-channel external array coils and no endorectal coil.

Figure 1. Schematic illustrating the workflow for creating computer observers and comparing them against human observers.

Table 2. Comparison of global similarity metrics between computer and human observer populations. σ=standard deviation, μ=mean, x=median, max=maximum.

Figure 2. Spatial entropy maps for human and computer observer populations in 8 example patients. Spatial entropy maps were computed by first grouping the predictions from DL models trained with the same loss function. Four loss functions were investigated including crossentropy, DSC, Jaccard, and focal.

Figure 3. Autocontouring predictions (prostate: thick white line, OARs: blue lines) displayed with spatial entropy maps (transparent regions; yellow–30% of max entropy, orange-70% of max entropy, red-90% of max entropy) overlaid on post-implant prostate MRIs in a commercial treatment planning system. Arrows indicate regions of high spatial entropy clusters at organ junctions where physicians can review and potentially refine the automated predictions.

Proc. Intl. Soc. Mag. Reson. Med. 29 (2021)
4106