1417

Clinical Evaluation of a Fully Convolutional Neural Network for Automatic MS Lesion Segmentation on MRI

Amalie Monberg Hindsholm¹, Claes Nøhr Ladefoged¹, Flemming Littrup Andersen¹, Stig Præstekjær Cramer¹, Liselotte Højgaard¹, and Ulrich Lindberg¹
¹Clinical Physiology, Nuclear Medicine and PET, Rigshospitalet University, Copenhagen, Denmark

Synopsis

Automatic segmentation of MRI-visible multiple sclerosis (MS) lesions could potentially reduce assessment time and inter- and intra-rater variability. Recently, automatic methods using deep convolutional neural networks (CNN) have obtained great results in image segmentation. This work implements a state-of-the-art 2D CNN-based segmentation method from literature and extends and recalibrates it to a local MS dataset of 91 patients. A clinical evaluation is performed on an independent MS dataset of 53 patients, where 94% of predicted segmentation masks were deemed valuable for clinical use.

Introduction

Automatic detection of MRI-visible lesions from multiple sclerosis (MS) has been researched for more than twenty years, as an automatic process could increase robustness and decrease inter-rater variability¹. In recent years, deep learning techniques utilising convolutional neural networks (CNN) have been introduced in image segmentation with great success, with deep-learning lesion segmentation tools obtaining the highest ranks in the three most recent international MS lesion segmentation challenges^2-4. So far, the evaluation framework of most automatic methods has been purely computational, with no direct evaluation of clinical value.
In this study we present an implementation and clinical evaluation of a state-of-the-art segmentation model, on a local, clinical dataset.

Methods

We implemented the automatic white matter hyperintensity (WMH) segmentation model suggested by Li et al.⁵ which at the time was ranked first place in the 2017 MICCAI WMH segmentation challenge⁶. The model is a 2D CNN with a U-net structure and a two-channel input of axial T1-weighted and T2-weighted FLAIR images. It is initialised with a He-initialiser and uses dice loss as a loss function. The model trained with data from the challenge was extended to a higher segmentation performance on the MS dataset, by exploring a series of adjustments. One of these were different stages of recalibration on the MS dataset, in which the original model was used for transfer learning for either retraining of the last convolutional layer or end-to-end retraining of the whole network. Furthermore, two regularisation techniques were implemented: a sequence-of-slices input in order to increase the spatial information of the model input and dropout with a drop-out fraction of 0.2 in all convolutional layers in effort to decrease overfitting. The performance of the initial implementation and extended versions were evaluated by computing four metrics, measuring agreement between model prediction and reference masks: dice similarity coefficient (DSC), average volume distance (AVD), precision and recall.
The local MS dataset consisted of 91 patients divided into: 71 for training, 10 for validation and 10 for final testing, with manual delineation masks as references. To evaluate the clinical value of the automatic segmentation model, predicted masks were assessed by a clinical expert for their applicability in clinical practice. A test dataset of 53 clinical patients without manual references were delineated by the segmentation model, and all masks were given blinded to the rater. Each mask was assessed and given a score from 1-3, with 1 being perfect and 3 being unacceptable for clinical use (figure 1). Two sets of manual delineation masks of ten test patients, performed by two individual specialists, were likewise presented to the clinical raters together with one set of predicted masks (figure 1). This was done in order to compare the clinical rating of manual references and predicted masks.

Results

The model version which obtained the highest metric performance was to re-train the network end-to-end on the local dataset using the original segmentation model for transfer learning, combined with implementation of both regularisation techniques. This combination resulted in the following average segmentation performance across the ten test patients: DSC: 0.53 (0.14), AVD: 133.75 (109.75), precision: 0.63 (0.19) and recall: 0.88 (0.08).
In the clinical evaluation of the extended and recalibrated model, 94 % of the 53 masks were rated at least acceptable in clinical practice, with 34% being rated as perfect (figure 2). In two out of three masks which were rated unacceptable it was due to large false positive delineation in the plexus choroideus (figure 2). Generally, the model performed well on detection of lesions, but was oversensitive towards hyperintensities, delineating larger areas than the reference and a number of false positives (figure 4). This is also evident from the high average recall and lower precision score. None of the ten test-patient masks delineated by the model were rated clinically unacceptable, whereas 3/10 masks delineated by rater 1 were rated unacceptable.

Discussion

Results of the clinical evaluation shows that this automatic segmentation model can produce segmentation maps with a high degree of clinical value. The algorithm does however suffer from moderate false positive rates, but this can possibly be alleviated by subsequent manual correction by an expert with limited time expenditure, still providing a faster alternative to entirely manual delineation. In the comparison between the automatic method and the two clinical experts, the algorithm obtained a higher share of accepted masks. Although this comparison was conducted on a minimal test dataset, it highlights the difficulty of MS lesion delineation, as a mask conducted by one clinical expert can be deemed unacceptable by another. It also exemplifies the problem of using overlap-based computation metrics, such as the DSC, as the primary evaluation tool, since the manual references are heavily influenced by inter-rater variability. All patients were in early stages of MS, and the model should therefore be trained on a larger, more diverse dataset.

Conclusion

After extension and recalibration, a clinical evaluation showed that the model had a high degree of clinical applicability. With subsequent assessment by clinical experts, the algorithm will be able to contribute a higher robustness to the delineation process as well as a decreased assessment time.

Acknowledgements

No acknowledgement found.

References

1. García-Lorenzo D, Francis S, Narayanan S, Arnold DL, Collins DL. Review of automatic segmentation methods of multiple sclerosis white matter lesions on conventional magnetic resonance imaging. Med Image Anal. 2013;17(1):1-18. doi:10.1016/j.media.2012.09.004

2. van Ginneken B, Heimann T, Styner M. 3D segmentation in the clinic: A grand challeng. Int Conf Med Image Comput Comput Assist Interv. 2007;10(May 2014):7-15. http://grand-challenge2008.bigr.nl/proceedings/pdfs/msls08/Styner.pdf.

3. Carass A, Roy S, Jog A, et al. HHS Public Access. 2018. doi:10.1016/j.neuroimage.2016.12.064.Longitudinal

4. M. Styner J. Lee BCMCOCHTSM-PVJSW. Editorial: 3D Segmentation in the clinic: A grand Challenge II: MS Lesion Segmentation. In: Grand Challenge Workshop: Multiple Sclerosis Lesion Segmentation Challenge. Midas. 2008:1-6.

5. Li H, Jiang G, Zhang J, et al. Fully Convolutional Network Ensembles for White Matter Hyperintensities Segmentation in MR Images MICCAI WMH Segmentation Challenge, Deep Learning, Ensemble Models.; 2018.

6. Kuijf HJ, Biesbroek JM, de Bresser J, et al. Standardized Assessment of Automatic Segmentation of White Matter Hyperintensities; Results of the WMH Segmentation Challenge. IEEE Trans Med Imaging. 2019;(c):1-1. doi:10.1109/tmi.2019.2905770

Figures

Figure 1. Setup for the clinical evaluation of MS lesion segmentation masks. Masks generated by the extended model for 53 clinical patients as well as generated and manually delineated masks of the ten independent test patients were all rated on a scale from perfect to unacceptable.

Figure 2. Results of the clinical evaluation of a total of 63 predicted segmentation masks and 20 manual masks. The evaluation of clinical data results in a high acceptance rate. The comparison of manual and generated delineations shows that none of the ten generated masks were deemed unacceptable, but that one, respectively three, of the manually delineated masks were.

Figure 3. Examples of three axial slices from three patients with a perfect (A), clinically acceptable (B) and clinically unacceptable (C) delineation mask respectively. In B a false-negative delineation in the brain stem is visible in the first slice. In C large false-positive delineations in the plexus choroideus can be observed in the second slice.

Figure 4. Delineation masks of four representative slices from three patients. A relative correspondence between the automatically generated masks and manual reference masks is observed. The generated mask delineates a larger volume in D, detecting one large lesion as opposed to two smaller in the manual mask.

Proc. Intl. Soc. Mag. Reson. Med. 28 (2020)

1417