3815

Quantitative Assessment of Segmented Masks: A Deep Learning Regression and Classification Study

Ponnam Mahendhar GOUD¹, Ashish Saxena¹, Chitresh Bhushan², Sandeep Kaushik³, Soumya Ghose², and Dattesh Shanbhag¹
¹GE HealthCare, Bengaluru, India, ²GE HealthCare, Niskayuna, NY, United States, ³GE HealthCare, Munich, Germany

Synopsis

Keywords: Other AI/ML, Spinal Cord

Motivation: Ability to track real-world performance of AI based spine segmentation models without access to ground-truth data.

Goal(s): Develop AI models which allow prediction of spine vertebrae segmentation model performance in real time.

Approach: Developed a regression and classification deep learning (DL) models that determines quality of segmentation results in terms of Dice overlap metric from a parent segmentation DL model.

Results: For regression model, dice prediction error of 4.3% was obtained, while for categorical classification model, sensitivity between 63-87% observed across evaluation categories. Combination of regression and classification models improves model performance evaluation with sensitivity between 71 to 91%.

Impact: : We developed DL models to automatically evaluate accuracy of spine-vertebrae segmentation models during their deployment in clinical practice without access to ground-truth in both quantitatively (Dice) and qualitatively (Perfect, good, medium, poor). This ensures automatic-logging model effectiveness in real-world data.

Introduction

The accurate segmentation of spine is a critical step in multiple medical imaging tasks such as MRI scan planning, detecting anomalies, and aiding in surgical planning [1]. Deep learning-based approaches have accelerated development of vertebrae segmentation for these multiple tasks [2]. However, evaluating performance of these deep learning models once deployed in clinical practice is mostly based on user feedback. It would be important to automate this feedback loop, monitoring the performance in real time and altering the user of any anomalies observed. This is also critical from the regulatory standpoint which encourages real -world performance reporting for model effectiveness in clinical practice [3]. In this work, we aim to evaluate the quality of spine segmentation masks from deep learning models, without access to ground-truth markings. We demonstrate this by training a neural network to generate Dice overlap based quality metrics from DL predicted vertebrae masks.

Methods

Data

Data for our study came from 310 patients, from multiple sites, field strength (1.5T and 3T) and patient conditions (degenerative spine, scoliosis, and metal implants). We utilized 3-plane SSFSE localizer data from coronal and sagittal orientation across cervical, cervico-thoracic, thoraco-lumbar, and lumbar stations. Ground-truth masks were generated on Sagittal T1 data and transferred using rigid registration to all three orientation localizer images (Figure 1(a-b)). In the real-world scenario, spine segmentation masks may have missed or mislabeled vertebrae. To emulate this scenario in our dataset, we intentionally removed a few vertebrae from a subset of the segmentation masks (Figure 1(c-d)).

Model details

Our deep learning methodology comprises two approaches, namely regression and classification using VGG (Visual Geometry Group) architecture as the backbone for tuning of our neural network [4]. Other hyper-parameters based on based on preliminary experiments and grid search were batch size =8, Adam optimizer (LR = 0.00001). In the regression model, our network (Figure 2a) was trained to predict the dice score of the given 3D mask with respect to its original GT mask using the mean absolute error (MAE) as the loss function. In classification model (Figure 2b), we categorized the masks into 4 classes, namely, Perfect (samples with dice score > 0.95), Good (dice score of 0.8-0.95), Medium (dice score of 0.5-0.8) and Poor (dice score of <0.5) (Figure 3). Here, we used a Categorical Cross Entropy loss function to train the model.

Results and Discussion

From the regression model, the R2 value (0.95) between predicted and actual Dice scores shows an excellent linear correlation (Figure 4a). Average error between actual and predicted Dice score was found to be 4.28%. Within sub-classes of Dice score, this error was found to be highest in the Good (Dice 0.8 - 0.95) and Medium (Dice 0.5 - 0.8) classes compare to the other two extreme classes (Figure 4b). This shows the inability of the model to properly regress mid-range Dice score data. This is also reflected in the Classification experiment, wherein the model shows a lower sensitivity of 66% and 63% for good and medium class samples, respectively (Figure 4c). The classification model has an overall accuracy of 75%. We observed that most of the misclassifications happened for images with Dice score in vicinity of boundary range separating the four classes. This is evident from the confusion matrix (Figure 5a). Combining classification and regression models could provide a robust solution. When presented with a test sample, our classification model furnishes a Dice class, while our regression model yields Dice score. For predicted Dice score of a random sample, we test a Dice class reassignment condition by expanding Dice class range within the bound of mean absolute error in the regression model. Thus, we determine a new class for test sample. Figure 5b shows the confusion matrix for the devised joint classification and regression model analysis. The joint model results in substantial increase in classification sensitivity for each of the four classes (Figure 5c). Overall accuracy of this joint model was found to be 82.1%, which is 9.5% higher than classification model alone.

Conclusion

This study demonstrates a novel approach in evaluating the quality of the spine segmentation masks generated during spine MR scan planning. We trained a regression and a classification model and combined the predictions of these two models to evaluate the quality of mask to be perfect, good, medium, or poor class. Combining predictions from both models increases the precision in class segmentations for a given test sample. This will allow reliable prediction of vertebrae segmentation model performance in real world scenarios.

Acknowledgements

No acknowledgement found.

References

1. Chang, Heyou, et al. "Multi-vertebrae segmentation from arbitrary spine MR images under global view." Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part VI 23. Springer International Publishing, 2020.
2.Das, Pabitra, et al. "Deep neural network for automated simultaneous intervertebral disc (IVDs) identification and segmentation of multi-modal MR images." Computer Methods and Programs in Biomedicine 205 (2021): 106074.

3.Choi, Kukjin, et al. "Deep learning for anomaly detection in time-series data: review, analysis, and guidelines." IEEE Access 9 (2021): 120043-120065.

4. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Figures

Figure 1: (a) Cervical and (b) Lumbar images in sagittal view with vertebra and disc masks. (c-d) Corrupted lumbar mask images.

Figure 2: (a) Regression and (b) classification model architecture

Figure 3: Sample images from each of the four classes defined with Dice score range: Perfect (0.95 - 1), Good (0.8 - 0.95), Medium (0.5 - 0.8), and Poor (0 - 0.5)

Figure 4: (a) Actual versus predicted dice score from the regression model with linear fitting. (b) Respective error box plot in different dice ranges.

Figure 5: (a) classification model confusion matrix. (b) Join regression and classification model confusion matrix. (c) Classification sensitivity for each class classified from classification model and joint classification and regression model.

Proc. Intl. Soc. Mag. Reson. Med. 32 (2024)

3815

DOI: https://doi.org/10.58530/2024/3815