1147

A Report on the International Workshop on Osteoarthritis Imaging Segmentation Challenge: A Multi-Institute Evaluation on a Standard Dataset

Arjun D. Desai^1,2, Francesco Caliva³, Claudia Iriondo³, Naji Khosravan⁴, Aliasghar Mortazi⁴, Sachin Jambawalikar⁵, Drew Torigian⁶, Jutta Ellerman⁷, Mehmet Akçakaya⁸, Ulas Bagci⁴, Radhika Tibrewala³, Io Flament³, Matt O'Brien³, Sharmila Majumdar³, Mathias Perslev⁹, Akshay Pai⁹, Christian Igel⁹, Erik B. Dam⁹, Sibaji Gaj¹⁰, Mingrui Yang¹⁰, Kunio Nakamura¹⁰, Xiaojuan Li¹⁰, Cem M. Deniz¹¹, Vladimir Juras¹², Ravinder Regatte¹¹, Garry E. Gold², Brian A. Hargreaves², Valentina Pedoia³, and Akshay S. Chaudhari²
¹Electrical Engineering, Stanford University, Stanford, CA, United States, ²Radiology, Stanford University, Stanford, CA, United States, ³Radiology, University of California San Francisco, San Francisco, CA, United States, ⁴University of Central Florida, Orlando, FL, United States, ⁵Radiology, Columbia University Medical Center, New York, NY, United States, ⁶Radiology, University of Pennsylvania, Philadelphia, PA, United States, ⁷Radiology, University of Minnesota, Minneapolis, MN, United States, ⁸Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, United States, ⁹Computer Science, University of Copenhagen, Copenhagen, Sweden, ¹⁰Biomedical Engineering, Cleveland Clinic, Cleveland, OH, United States, ¹¹Radiology, New York University Langone Health, New York, NY, United States, ¹²Biomedical Imaging and Image-Guided Therapy, Medical University of Vienna, Vienna, Austria

Synopsis

Cartilage thickness can be predictive of joint health. However, manual cartilage segmentation is tedious and prone to inter-reader variations. Automated segmentation using deep-learning is promising; yet, heterogeneity in network design and lack of dataset standardization has made it challenging to evaluate the efficacy of different methods. To address this issue, we organized a standardized, multi-institutional challenge for knee cartilage and meniscus segmentation. Results show that CNNs achieve similar performance independent of network architecture and training design and, given the high segmentation accuracy achieved by all models, only a weak correlation between segmentation accuracy metrics and cartilage thickness was observed.

Introduction

Cartilage degradation is one of the hallmarks of osteoarthritis (OA). Changes in cartilage thickness have shown to be predictive of radiographic joint-space loss and joint arthroplasty¹. While manual segmentation is the gold-standard for evaluating cartilage morphology, it is time-consuming and has high inter-reader variability². Recent advances in deep-learning and convolutional neural networks (CNNs) have simplified the development of automatic segmentation tools^3–5. However, challenges in characterizing network training and the heterogeneity of datasets used for network evaluation have limited pervasive utilization of these techniques. To address these limitations, a segmentation challenge was organized at the International Workshop on Osteoarthritis Imaging (IWOAI) in June 2019. Here, we present a summary of the challenge submissions and discuss efficacy of diverse, multi-institutional deep-learning approaches for segmenting knee cartilage and meniscus.

Methods

As part of the challenge, six teams trained CNNs to segment femoral cartilage, tibial cartilage, patellar cartilage, and menisci in 3D sagittal double-echo steady-state (DESS) scans from the Osteoarthritis Initiative⁶. The dataset consisted of 88 subjects with Kellgren-Lawrence grades (KLG) between 1 and 4 scanned at two timepoints, who were randomly split into cohorts of 60 for training, with baseline KLG 1/2/3/4 distribution (1,22,36,1), 14 for validation (1,4,8,1), and 14 for testing (0,5,8,1). Challenge participants were blinded to the all subject-identifying information.

Approaches among all teams varied considerably in network architectures, training protocols, and data augmentation methods. Deniz, et al. trained a multi-class 3D U-Net with dilated convolutions using a joint weighted cross-entropy and soft-Dice loss^7,8. Gaj, et al. modified the DeeplabV3+ architecture to use dense convolutional blocks and trained with a soft-Dice loss⁹. Iriondo, et al. designed a multi-stage segmentation network built with a cascaded ensemble of 3D and 2D V-Nets and used intensity and geometric transforms for data augmentation. Perslev, et al. sampled 2D slices from multiple planes in the volume to train a 2D U-Net with batch normalization and nearest-neighbor upsampling¹⁰. Khosravan, et al. implemented a generative adversarial framework where discriminators trained to differentiate between real and generated 2D slices and 2D volumetric projections of segmentations supervised the segmentation network¹¹. Following the challenge, a sixth submission (Desai, et al.) utilized a simplified 2D, multi-class U-Net optimized on a soft-Dice objective⁵.

Dice overlap (Dice), volumetric overlap error (VOE), coefficient of variation (CV), and average symmetric surface distance (ASSD) assessed pixel-wise segmentation accuracy compared to expert-annotated ground truth. Differences in cartilage thickness were used to evaluate the efficacy of automatic versus ground-truth segmentations. Dice overlap between network segmentations (termed Dice correlation) was used to evaluate the segmentation similarity between different networks. Correlation between pixel-wise segmentation metrics (Dice, VOE, CV, and ASSD) and cartilage thickness error was measured using Pearson correlation coefficients. Statistical comparisons were performed using Kruskal-Wallis tests, and corresponding Dunn post-hoc tests with Bonferroni correction (α=0.05).

Results

Comparisons of segmentation accuracy among all networks showed similar segmentation performance (Fig. 1). No significant differences were observed in Dice, CV, VOE, ASSD for femoral cartilage (p=1.0), tibial cartilage (p=1.0), patellar cartilage (p=1.0), and menisci (p=1.0) among the four top-performing networks (Gaj, Iriondo, Perslev, Desai, respectively). Dice correlation between individual networks was >0.9 for femoral cartilage, >0.88 for tibial cartilage, >0.86 for menisci, and >0.85 for patellar cartilage (Fig. 2). Networks also displayed similar segmentation accuracy trends across DESS slices in the medial-lateral direction (Fig. 3). There was no systematic underestimation or overestimation of cartilage thickness (Fig. 4). Bland-Altman limits of agreement 95% confidence intervals were 0.36mm for femoral cartilage, 0.32mm for tibial cartilage, and 0.42mm for patellar cartilage. Comparisons of thickness errors showed no significant difference among a majority of the networks (p=0.99).

Correlation between pixel-wise segmentation accuracy metrics and cartilage thickness ranged from very-weak to moderate (highest Pearson-R=0.41). Highest correlations were observed with femoral cartilage thickness (Pearson-R>0.25), while very-weak correlation was observed with tibial cartilage (Pearson-R<0.2). CV had the highest correlation with femoral and patellar cartilage thickness (Pearson-R=0.41,0.32).

Discussion

Despite the vast variety of network approaches, most methods achieved similar segmentation and thickness accuracy across all tissues. High Dice correlation between all networks suggested strong concordance in volumetric similarity of the segmentations. Near-identical slice-wise Dice accuracies and failure regions indicated that all networks systematically performed worse in the intercondylar notch and the medial compartment, which is more commonly affected in subjects with OA. The similarity in performance and limitations may suggest that independent networks, regardless of their design and training framework, learn to represent and segment the knee in similar ways.

While networks performed similarly, there was higher variability in their thickness estimates. Median cartilage thickness errors were <0.2mm, roughly half the resolution of the DESS voxels. The correlation between standard segmentation metrics and cartilage thickness was weak and tissue-dependent, suggesting that using traditional evaluation metrics on high-performing models may not be predictive of differences in thickness accuracy outcomes.

Conclusion

Through the IWOAI segmentation challenge, we established a standardized and easy-to-use dataset to train and evaluate knee segmentation algorithms. Moreover, using deep-learning-based segmentation algorithms from multiple institutions, we showed that networks with varying training paradigms achieve similar performance and that amongst models achieving high segmentation performance, current segmentation accuracy metrics are weakly correlated with cartilage thickness endpoints.

Acknowledgements

Research support provided by NIH AR0063643, NIH EB002524, NIH AR062068, NIH EB017739, NIH EB015891, NIH AR074453, NIH AR070902, NIH AR073552, NSF 1656518, GE Healthcare, and Philips.

References

1. Eckstein F, Kwoh CK, Boudreau RM, et al. Quantitative MRI measures of cartilage predict knee replacement: A case-control study from the Osteoarthritis Initiative. Ann Rheum Dis. 2013;72(5):707-714. doi:10.1136/annrheumdis-2011-201164

2. Eckstein F, Kwoh CK, Link TM. Imaging research results from the Osteoarthritis Initiative (OAI): A review and lessons learned 10 years after start of enrolment. Ann Rheum Dis. 2014;73(7):1289-1300. doi:10.1136/annrheumdis-2014-205310

3. Norman B, Pedoia V, Majumdar S. Use of 2D U-net convolutional neural networks for automated cartilage and meniscus segmentation of knee MR imaging data to determine relaxometry and morphometry. Radiology. 2018;288(1):177-185. doi:10.1148/radiol.2018172322

4. Liu F. SUSAN: segment unannotated image structure using adversarial network. Magn Reson Med. 2019;81(5):3330-3345. doi:10.1002/mrm.27627

5. Desai AD, Gold GE, Hargreaves BA, Chaudhari AS. Technical Considerations for Semantic Segmentation in MRI using Convolutional Neural Networks. 2019. http://arxiv.org/abs/1902.01977.

6. Peterfy CGMD, Ph D, D ESP, D MNP. Review The osteoarthritis initiative : report on the design rationale for the magnetic resonance imaging protocol for the knee. Osteoarthr Cartil. 2008;16(12):1433-1441. doi:10.1016/j.joca.2008.06.016

7. Deniz CM, Xiang S, Hallyburton RS, et al. Segmentation of the Proximal Femur from MR Images using Deep Convolutional Neural Networks. Sci Rep. 2018;8(1). doi:10.1038/s41598-018-34817-6

8. Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. :1-8.

9. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol 11211 LNCS. Springer Verlag; 2018:833-851. doi:10.1007/978-3-030-01234-2_49

10. Perslev M, Dam EB, Pai A, Igel C. One Network to Segment Them All: A General, Lightweight System for Accurate 3D Medical Image Segmentation. In: ; 2019:30-38. doi:10.1007/978-3-030-32245-8_4

11. Khosravan N, Mortazi A, Wallace M, Bagci U. PAN: Projective Adversarial Network for Medical Image Segmentation. 2019. http://arxiv.org/abs/1906.04378. Accessed September 4, 2019.

Figures

Fig. 1: Performance summary of networks submitted for segmentation challenge for all tissues as measured by average symmetric surface distance (ASSD, in mm), coefficient of variation (CV), Dice overlap (Dice), volumetric overlap error (VOE), and thickness error (in mm). Network performances are indicated by violin plots, which overlay distributions over box plots. Longer plots indicate larger variance in network performance among scans. Thickness metrics were not calculated for meniscus. All networks on average achieved similar performance across the different metrics.

Fig. 2: Dice correlations were used to evaluate inter-network variations among segmentations from different networks for femoral cartilage (A), tibial cartilage (B), patellar cartilage (C), and meniscus (D). Strong inter-network correlation was observed for femoral cartilage, tibial cartilage, and menisci and moderately strong correlation was displayed for patellar cartilage.

Fig. 3: Depth-wise region of interest distribution. Segmentation accuracy using Dice as a function of slice location from the medial (M) to lateral (L) end. The field of view is normalized (0-100%) based on the first and last slice with a ground truth segmentation in each scan. All networks have similar trends in performance across different regions of the knee. All networks share failure points at the intercondylar notch (~40% FOV) and have considerably lower performance in the medial condyle.

Fig. 4: Bland-Altman plots were used to compare thickness (per scan, Kellgren-Lawrence Grade: 1-4) between automatic and ground-truth segmentations for femoral (A), tibial (B), and patellar (C) cartilage. Negligible bias (gray dotted line) was observed for all three tissues among all networks; however, the limits of agreement (2*standard error of mean) confidence interval (between gray dashed lines) were broad, indicating high variance in thickness difference (<0-underestimation, >0-overestimation). Note: Different scales for axes.

Fig. 5: Correlation between pixel-wise segmentation metrics and cartilage thickness error as measured by the Pearson’s correlation coefficient (ρ). Minimal correlation was observed for all tissues across networks, all of which achieved high segmentation performance. This may suggest that given high performance among models as measured by pixel-wise segmentation metrics, there is a negligible difference in diagnostic metrics.

Proc. Intl. Soc. Mag. Reson. Med. 28 (2020)

1147