Amalie Monberg Hindsholm1, Claes Nøhr Ladefoged1, Flemming Littrup Andersen1, Stig Præstekjær Cramer1, Liselotte Højgaard1, and Ulrich Lindberg1
1Clinical Physiology, Nuclear Medicine and PET, Rigshospitalet University, Copenhagen, Denmark
Synopsis
Automatic
segmentation of MRI-visible multiple sclerosis (MS) lesions could
potentially reduce assessment time and inter- and intra-rater
variability. Recently, automatic methods using deep convolutional
neural networks (CNN) have obtained great results in image
segmentation. This work implements a state-of-the-art 2D CNN-based
segmentation method from literature and extends and recalibrates it
to a local MS dataset of 91 patients. A clinical evaluation is
performed on an independent MS dataset of 53 patients, where 94% of
predicted segmentation masks were deemed valuable for clinical use.
Introduction
Automatic
detection of MRI-visible lesions from multiple sclerosis (MS) has
been researched for more than twenty years, as an automatic process
could increase robustness and decrease inter-rater variability1.
In recent years, deep learning techniques utilising convolutional
neural networks (CNN) have been introduced in image segmentation with
great success, with deep-learning lesion segmentation tools obtaining
the highest ranks in the three most recent international MS lesion
segmentation challenges2-4.
So far, the evaluation framework of most automatic methods has been
purely computational, with no direct evaluation of clinical value.
In
this study we present an implementation and clinical evaluation of a
state-of-the-art segmentation model, on a local, clinical dataset. Methods
We
implemented the automatic white matter hyperintensity (WMH)
segmentation model suggested by Li et al.5
which
at the time was ranked first place in the 2017 MICCAI WMH
segmentation challenge6.
The model is a 2D CNN with a U-net structure and a two-channel input
of axial T1-weighted and T2-weighted FLAIR images. It is initialised with a
He-initialiser and uses dice loss as a loss function.
The
model trained with data from the challenge was extended to a higher
segmentation performance on the MS dataset, by exploring a series of
adjustments. One of these were different stages of recalibration on
the MS dataset, in which the original model was used for transfer
learning for either retraining of the last convolutional layer or
end-to-end retraining of the whole network.
Furthermore,
two regularisation techniques were implemented: a sequence-of-slices
input in order to increase the spatial information of the model input
and dropout with a drop-out fraction of 0.2 in all convolutional
layers in effort to decrease overfitting. The performance of the initial implementation
and extended versions were evaluated by computing four metrics,
measuring agreement between model prediction and reference masks:
dice similarity coefficient (DSC), average volume distance (AVD),
precision and recall.
The
local MS dataset consisted of 91 patients divided into: 71 for
training, 10 for validation and 10 for final testing, with manual
delineation masks as references. To evaluate the clinical value of
the automatic segmentation model, predicted masks were assessed by a
clinical expert for their applicability in clinical practice. A test
dataset of 53 clinical patients without manual references were
delineated by the segmentation model, and all masks were given
blinded to the rater. Each mask was assessed and given a score from
1-3, with 1 being perfect and 3 being unacceptable for clinical use
(figure 1). Two sets of manual delineation masks of ten test
patients, performed by two individual specialists, were likewise
presented to the clinical raters together with one set of predicted
masks (figure 1). This was done in order to compare the clinical
rating of manual references and predicted masks.Results
The
model version which obtained the highest metric performance was
to re-train the network end-to-end on the local dataset using the original
segmentation model for transfer learning, combined with
implementation of both regularisation techniques. This combination
resulted in the following average segmentation performance across the
ten test patients: DSC: 0.53 (0.14), AVD: 133.75 (109.75), precision:
0.63 (0.19) and recall: 0.88 (0.08).
In
the clinical evaluation of the extended and recalibrated model, 94 %
of the 53 masks were rated at least acceptable in clinical practice,
with 34% being rated as perfect (figure 2). In two out of three masks
which were rated unacceptable it was due to large false positive
delineation in the plexus
choroideus
(figure 2). Generally, the model performed well on detection of lesions, but
was oversensitive towards hyperintensities, delineating larger areas
than the reference and a number of false positives (figure 4). This
is also evident from the high average recall and lower precision
score. None of the ten test-patient masks delineated by the model
were rated clinically unacceptable, whereas 3/10 masks delineated by
rater 1 were rated unacceptable. Discussion
Results
of the clinical evaluation shows that this automatic segmentation
model can produce segmentation maps with a high degree of clinical
value. The algorithm does however suffer from moderate false positive
rates, but this can possibly be alleviated by subsequent manual
correction by an expert with limited time expenditure, still
providing a faster alternative to entirely manual delineation. In the
comparison between the automatic method and the two clinical experts,
the algorithm obtained a higher share of accepted masks. Although
this comparison was conducted on a minimal test dataset, it
highlights the difficulty of MS lesion delineation, as a mask
conducted by one clinical expert can be deemed unacceptable by
another. It also exemplifies the problem of using overlap-based
computation metrics, such as the DSC, as the primary evaluation tool,
since the manual references are heavily influenced by inter-rater
variability. All patients were in early stages of MS, and the model
should therefore be trained on a larger, more diverse dataset.Conclusion
After
extension and recalibration, a clinical evaluation showed that the
model had a high degree of clinical applicability. With subsequent
assessment by clinical experts, the algorithm will be able to
contribute a higher robustness to the delineation process as well as
a decreased assessment time.Acknowledgements
No acknowledgement found.References
1. García-Lorenzo D,
Francis S, Narayanan S, Arnold DL, Collins DL. Review of automatic
segmentation methods of multiple sclerosis white matter lesions on
conventional magnetic resonance imaging. Med
Image Anal.
2013;17(1):1-18. doi:10.1016/j.media.2012.09.004
2. van Ginneken B,
Heimann T, Styner M. 3D segmentation in the clinic: A grand challeng.
Int
Conf Med Image Comput Comput Assist Interv.
2007;10(May 2014):7-15.
http://grand-challenge2008.bigr.nl/proceedings/pdfs/msls08/Styner.pdf.
3. Carass A, Roy S, Jog
A, et al. HHS Public Access. 2018.
doi:10.1016/j.neuroimage.2016.12.064.Longitudinal
4. M. Styner J. Lee
BCMCOCHTSM-PVJSW. Editorial: 3D Segmentation in the clinic: A grand
Challenge II: MS Lesion Segmentation. In: Grand Challenge Workshop:
Multiple Sclerosis Lesion Segmentation Challenge. Midas.
2008:1-6.
5. Li H, Jiang G, Zhang
J, et al. Fully
Convolutional Network Ensembles for White Matter Hyperintensities
Segmentation in MR Images MICCAI WMH Segmentation Challenge, Deep
Learning, Ensemble Models.;
2018.
6. Kuijf HJ, Biesbroek JM, de Bresser J,
et al. Standardized
Assessment of Automatic Segmentation of White Matter
Hyperintensities; Results of the WMH Segmentation Challenge. IEEE
Trans Med Imaging.
2019;(c):1-1. doi:10.1109/tmi.2019.2905770