Ponnam Mahendhar GOUD1, Ashish Saxena1, Chitresh Bhushan2, Sandeep Kaushik3, Soumya Ghose2, and Dattesh Shanbhag1
1GE HealthCare, Bengaluru, India, 2GE HealthCare, Niskayuna, NY, United States, 3GE HealthCare, Munich, Germany
Synopsis
Keywords: Other AI/ML, Spinal Cord
Motivation: Ability to track real-world performance of AI based spine segmentation models without access to ground-truth data.
Goal(s): Develop AI models which allow prediction of spine vertebrae segmentation model performance in real time.
Approach: Developed a regression and classification deep learning (DL) models that determines quality of segmentation results in terms of Dice overlap metric from a parent segmentation DL model.
Results: For regression model, dice prediction error of 4.3% was obtained, while for categorical classification model, sensitivity between 63-87% observed across evaluation categories. Combination of regression and classification models improves model performance evaluation with sensitivity between 71 to 91%.
Impact: : We developed DL models to automatically evaluate accuracy of
spine-vertebrae segmentation models during their deployment in clinical
practice without access to ground-truth in both quantitatively (Dice) and qualitatively
(Perfect, good, medium, poor). This ensures automatic-logging model effectiveness in real-world
data.
Introduction
The accurate segmentation of spine is a
critical step in multiple medical imaging tasks such as MRI scan planning,
detecting anomalies, and aiding in surgical planning [1]. Deep learning-based
approaches have accelerated development of vertebrae segmentation for these
multiple tasks [2]. However, evaluating performance of these deep learning
models once deployed in clinical practice is mostly based on user feedback. It
would be important to automate this feedback loop, monitoring the performance
in real time and altering the user of any anomalies observed. This is also
critical from the regulatory standpoint which encourages real -world
performance reporting for model effectiveness in clinical practice [3]. In this
work, we aim to evaluate the quality of spine segmentation masks from deep
learning models, without access to ground-truth markings. We demonstrate this by
training a neural network to generate Dice overlap based quality metrics from
DL predicted vertebrae masks.
Methods
Data
Data for our study came from 310 patients, from multiple sites, field strength (1.5T and 3T) and patient conditions (degenerative spine, scoliosis, and metal implants). We utilized 3-plane SSFSE localizer data from coronal and sagittal orientation across cervical, cervico-thoracic, thoraco-lumbar, and lumbar stations. Ground-truth masks were generated on Sagittal T1 data and transferred using rigid registration to all three orientation localizer images (Figure 1(a-b)). In the real-world scenario, spine segmentation masks may have missed or mislabeled vertebrae. To emulate this scenario in our dataset, we intentionally removed a few vertebrae from a subset of the segmentation masks (Figure 1(c-d)).Model details
Our deep learning methodology comprises two approaches, namely regression and classification using VGG (Visual Geometry Group) architecture as the backbone for tuning of our neural network [4]. Other hyper-parameters based on based on preliminary experiments and grid search were batch size =8, Adam optimizer (LR = 0.00001). In the regression model, our network (Figure 2a) was trained to predict the dice score of the given 3D mask with respect to its original GT mask using the mean absolute error (MAE) as the loss function. In classification model (Figure 2b), we categorized the masks into 4 classes, namely, Perfect (samples with dice score > 0.95), Good (dice score of 0.8-0.95), Medium (dice score of 0.5-0.8) and Poor (dice score of <0.5) (Figure 3). Here, we used a Categorical Cross Entropy loss function to train the model.Results and Discussion
From the regression model, the R2 value (0.95) between predicted and actual Dice scores shows an excellent linear correlation (Figure 4a). Average error between actual and predicted Dice score was found to be 4.28%. Within sub-classes of Dice score, this error was found to be highest in the Good (Dice 0.8 - 0.95) and Medium (Dice 0.5 - 0.8) classes compare to the other two extreme classes (Figure 4b). This shows the inability of the model to properly regress mid-range Dice score data. This is also reflected in the Classification experiment, wherein the model shows a lower sensitivity of 66% and 63% for good and medium class samples, respectively (Figure 4c). The classification model has an overall accuracy of 75%. We observed that most of the misclassifications happened for images with Dice score in vicinity of boundary range separating the four classes. This is evident from the confusion matrix (Figure 5a). Combining classification and regression models could provide a robust solution. When presented with a test sample, our classification model furnishes a Dice class, while our regression model yields Dice score. For predicted Dice score of a random sample, we test a Dice class reassignment condition by expanding Dice class range within the bound of mean absolute error in the regression model. Thus, we determine a new class for test sample. Figure 5b shows the confusion matrix for the devised joint classification and regression model analysis. The joint model results in substantial increase in classification sensitivity for each of the four classes (Figure 5c). Overall accuracy of this joint model was found to be 82.1%, which is 9.5% higher than classification model alone.Conclusion
This study demonstrates a novel approach in evaluating the quality of the spine segmentation masks generated during spine MR scan planning. We trained a regression and a classification model and combined the predictions of these two models to evaluate the quality of mask to be perfect, good, medium, or poor class. Combining predictions from both models increases the precision in class segmentations for a given test sample. This will allow reliable prediction of vertebrae segmentation model performance in real world scenarios.Acknowledgements
No acknowledgement found.References
1. Chang,
Heyou, et al. "Multi-vertebrae segmentation from arbitrary spine MR images
under global view." Medical Image Computing and Computer Assisted
Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October
4–8, 2020, Proceedings, Part VI 23. Springer International Publishing,
2020.
2.Das,
Pabitra, et al. "Deep neural network for automated simultaneous
intervertebral disc (IVDs) identification and segmentation of multi-modal MR
images." Computer Methods and Programs in Biomedicine 205
(2021): 106074.
3.Choi,
Kukjin, et al. "Deep learning for anomaly detection in time-series data:
review, analysis, and guidelines." IEEE Access 9 (2021):
120043-120065.
4. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional
networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.