4353

Better Inter-observer agreement for Stroke Segmentation on DWI in Deep Learning Models than Human Experts
Shao Chieh Lin1,2, Chun-Jung Juan2,3,4, Ya-Hui Li2, Ming-Ting Tsai2, Chang-Hsien Liu2, Hsu-Hsia Peng5, Teng-Yi Huang6, Yi-Jui Liu7, and Chia-Ching Chang2,8
1Ph.D. program in Electrical and Communication Engineering, Feng Chia University, Taichung, Taiwan, 2Department of Medical Imaging, China Medical University Hsinchu Hospital, Hsinchu, Taiwan, 3Department of Radiology, School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan, 4Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 5Department of Biomedical Engineering and Environmental Sciences, National Tsing Hua University, Hsinchu, Taiwan, 6Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, 7Department of Automatic Control Engineering, Feng Chia University, Taichung, Taiwan, 8Department of Management Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan

Synopsis

Inter-observer agreement is commonly used to evaluate the consistency of clinical diagnosis for two or more doctors. However, it is seldom to use to evaluate the consistency of clinical diagnosis for two or more deep learning models. In this study, four deep learning models for segmentation of stroke lesion were trained using GTs defined by two neuroradiologists with two ADC thresholds. We found the addition of an ADC threshold (0.6 × 10-3 mm2/s) helps eliminate inter-observer variation and achieve best segmentation performance. The inter-observer in two deep learning models shows the more consistent degree compared with inter-observer in two neuroradiologists.

Introduction

Recently
automatic segmentation of stroke lesion using deep learning model trained by the
hyperintense lesion manually contoured on DWI as ground truth (GT), has been
performed on ischemic stroke for a long time at hyper-acute [1], acute [2],
subacute [3], and chronic [4] stages. The performance of automatic segmentation
in deep learning model has been found to be dependent on the ADC threshold
defining GT by the hyperintense lesion on DWI [5]. Inter-observer agreement is
commonly used to evaluate the consistency of clinical diagnosis for two doctors.
However, it has been rarely used to evaluate the consistency of clinical
diagnosis for two deep learning models. In this study, the four deep learning
models for automatic segmentation of stroke lesion were trained, respectively,
using GTs defined by two neuroradiologists with two ADC thresholds. Then, the agreements
between two neuroradiologists and between two deep learning models were
analyzed to investigate the diagnostic consistency in doctors and deep learning
models.

Materials and Methods

Patients: This study conveniently recruited 266 patients (121 for train dataset and 145 for test dataset) with clinical symptoms of acute ischemic infarction no more than 7 days of last known onset.
MRI scans: MR studies were performed at either 1.5T (MR450w and Signa HDxt; GE Healthcare) or 3.0T (MR 750w and Signa HDxt; GE Healthcare) scanners. Single shot spin echo DWI were acquired on axial planes with diffusion gradients (b factors) of 0 and 1000 sec/mm2 applied in each of three orthogonal directions.
Data processing: Image pre-processing and Unet model training were described in the Fig. 1. Step1. ADC maps were generated via pixel-by-pixel computation from b0 and b1000 images based on a mono-exponential model using a formula of SIb1000=SIb0 × e-bD. Step 2. a brain mask was applied to trim non-brain structures and noises by applying a threshold of <250 a.u. on DWI, erosion and dilation operations, and an ADC threshold >1.8 × 10-3 mm2/s [6]. Steps 3~4. For train data, segmentation of ischemic infarction by manually contouring the hyperintense lesion on DWI was independently performed by two neuroradiologists (observer A and B). For test data, the stroke lesions were semiautomatically defined by a MR scientist and verified by two neuroradiologists (observer A and B) in consensus. Step 5. ADC thresholds (0.6 × 10-3 mm2/sec) was added to make four GTs by observer A and B on DWI alone and on DWI with ADC threshold, respectively. Step 6. DWI and ADC maps were sent for data input. Step 7. A set of DWI and 2 sets of ADC maps were combined as ensemble images. Step 8~9. The ensemble images and four GTs were passed into the Unet as input images and ground truth, then four well-trained Unet models (Model A and B by observer A and B based on GTs from DWI alone and with ADC threshold) were generated. Step 10, the test data were sent to the four Unet models for automatic segmentation of stroke lesions. Step 11. Four prediction maps were generated. Step 12. Dice similarity coefficient (DSC) was used to evaluate the similarity of GTs defined by different neuroradiologists and the performance of our proposed semantic segmentation models.
Statistical analysis: Bland-Altman plot and intraclass correlation coefficient (ICC) were applied to assess the interobserver agreement [7]. Nonparametric Mann-Whitney U test was applied for group comparison. A P value less than 0.05 was considered as statistically significant.

Results

Box-Whisker plots of DSC of stroke lesions in test data in terms of observers on DWI and combined with ADC thresholds were demonstrated in Fig. 2, showing higher DSC on DWI combined with ADC thresholds than on DWI alone no matter in deep learning model trained by observer A (all P < .001) or observer B (all P < .05). Box-Whisker plots of inter-observer agreements in two neuroradiologists and in deep learning models using DWI alone and with ADC thresholds were demonstrated in Fig. 3. The interobserver DSC on DWI alone was significantly lower than that on DWI combined with an ADC threshold of 0.6 × 10-3 mm2/s (P < .001). Bland-Altman plots examining the inter-observer agreement in two neuroradiologists and in deep learning models for the stroke lesions defined on DWI alone and DWI with ADC threshold (0.6 × 10-3 mm2/s) were shown in Fig. 4 and Fig. 5. ICCs between two neuroradiologists were 0.980 and 0.997 (P < .001) for DWI alone and DWI with ADC threshold respectively, and ICCs between two deep learning models were 0.996 and 0.999 (P < .001) for DWI alone and DWI with ADC threshold respectively.

Discussion

Our results showed that the addition of an ADC threshold of 0.6 × 10-3 mm2/s outperformed DWI alone by eliminating interobserver variation and achieving better segmentation performance of acute stroke lesions. In addition, artificial intelligence also allowed better agreement in prediction between two DLMs than in ground truth two between experts. In conclusion, artificial intelligent outperforms human experts regarding the interobserver agreement and achieves better segmentation performance by combining an ADC threshold of 0.6 × 10-3 mm2/s with DWI.

Acknowledgements

No acknowledgement found.

References

  1. Nag MK, Koley S, China D et al (2017) Computer-assisted delineation of cerebral infarct from diffusion-weighted MRI using Gaussian mixture model. Int J Comput Assist Radiol Surg 12:539-552.
  2. Chen L, Bentley P, Rueckert D (2017) Fully automatic acute ischemic lesion segmentation in DWI using convolutional neural networks. Neuroimage Clin 15:633-643.
  3. Kamnitsas K, Ledig C, Newcombe VFJ et al (2017) Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med Image Anal 36:61-78.
  4. Pustina D, Coslett HB, Turkeltaub PE, Tustison N, Schwartz MF, Avants B (2016) Automated segmentation of chronic stroke lesions using LINDA: Lesion identification with neighborhood data analysis. Hum Brain Mapp 37:1405-1421.
  5. Juan CJ, Liu YJ, Lin SC, and Jeng YH. ADC and Size Dependent Segmentation Performance using Deep Learning. ISMRM 2021, 4131.
  6. Purushotham A, Campbell BC, Straka M et al (2015) Apparent diffusion coefficient threshold for delineation of ischemic core. Int J Stroke 10:348-353.
  7. Juan CJ, Chang HC, Hsueh CJ et al (2009) Salivary glands: echo-planar versus PROPELLER Diffusion-weighted MR imaging for assessment of ADCs. Radiology 253:144-152.

Figures

Figure 1. The flowchart of data processing for evaluating inter-observer agreement in two neuroradiologists and in their deep learning models.

Figure 2. DSC of prediction for four models, Model A and B trained by observer A and B based on GTs from DWI alone (1.8 × 10-3 mm2/s) and with ADC threshold (0.6 × 10-3 mm2/s).

Figure 3. DSC of inter-observer agreement in GT of two neuroradiologist from train dataset and in prediction of two Unet models from test dataset for DWI alone (1.8 × 10-3 mm2/s) and with ADC threshold (0.6 × 10-3 mm2/s).

Figure 4. Bland-Altman plots demonstrate interobserver agreement in two neuroradiologists on (A) DWI alone and (B) DWI with ADC threshold (0.6 × 10-3 mm2/s).

Figure 5. Bland-Altman plots demonstrate interobserver agreement in predictions by two Unet models on (A) DWI alone and (B) DWI with ADC threshold (0.6 × 10-3 mm2/s).

Proc. Intl. Soc. Mag. Reson. Med. 30 (2022)
4353
DOI: https://doi.org/10.58530/2022/4353