1406

Meniscal Tear Detection with Machine Learning: Initial Experience
Eric M Bultman1, Akshay S Chaudhari1, Arjun D Desai1, and Garry E Gold1

1Radiology, Stanford University, Stanford, CA, United States

Synopsis

Despite rapid recent advances in convolutional neural networks used for image classification, generalizability of these networks to medical image data has not been thoroughly investigated. In this work, we utilize two networks designed to classify ImageNet natural-image data – Inception-v3 and ResNet-50 – and investigate their performance in classifying meniscal tears on MR examinations of the knee. Using limited segmentation and manual tear identification, slice-wise sensitivity of 0.68 and 0.58 is achieved for the respective networks. Applying the “two-slice-touch” rule, sensitivity is significantly increased, but with concomitant decrease in specificity. Our results support the feasibility of utilizing CNNs for meniscal tear identification.

Introduction

Recent advances in convolutional neural networks (CNNs) have achieved supra-manual accuracy in classifying ImageNet natural-image data1. However, it is unclear whether the same high classification accuracy is translatable to medical images. In this work, we investigate the capability of two CNNs that achieve state-of-the-art accuracy for classification of natural images in the ImageNet database – Inception-v32 and ResNet-503 – to identify meniscal tears on MR examinations of the knee.

Methods

Clinical reports from an institutional database of knee MR exams were reviewed and 60 exams with meniscal tears were identified. 25 normal exams and 15 exams depicting non-meniscal pathology (ligamentous injuries, fractures) were also identified, and these 100 patients/exams were used to characterize network performance. 15 exams (7 with tears) were designated as test data. For remaining exams, data were randomly divided into training/validation sets on slice-wise/category-wise bases in an 85/15 ratio.

Sagittal T2-FatSat and coronal PD-FatSat images were used for training, as increased contrast from fluid-sensitivity of these sequences enhances detection of meniscal pathology4. Images were acquired per clinical protocol with 512x512 resolution and 2.5mm slice thickness. Using Matlab (Mathworks), images extending between anterior/posterior or medial/lateral meniscal boundaries were segmented into 340x100 windows containing the menisci and tibiofemoral joints. Segmented image slices depicting meniscal tears were identified by a fourth-year radiology resident with reference to clinical reports. All slices extending between tear boundaries were considered as tears, even if intermediate slices did not unequivocally depict tearing. No medial/lateral meniscal discrimination was made.

Training was performed using Keras/Tensorflow libraries5,6 modified to support 16-bit input. Network top layers were modified to support a binary classification task (Figure 1), with 20%-dropout to reduce over-fitting. Weighted cross-entropy loss was used to account for class imbalance. Network-specific Keras preprocessing functions were used. Images were resized using bicubic interpolation to 2992/2242 for Inception/ResNet, respectively. Augmentation consisting of ±3º rotations and horizontal image flips was used during training. Networks were initialized with ImageNet weights or without pre-trained weights (Xavier initialization) to investigate whether classification power of ImageNet weights is translatable to MR images. Training was conducted using an SGD optimizer, learning rate=5·10-3, batch size=32, and early stopping criterion of maximum validation accuracy with patience=5 epochs. Training was performed using an NVIDIA GeForce GPU (NVIDIA Corporation).

Results

Results of training with ImageNet and with Xavier weight initialization are depicted in Figure 2, and model predictions using 15 test cases are presented as confusion matrices in Figure 3. With ImageNet weight initialization, sensitivity of Inception-V3 for tear detection is somewhat greater than ResNet-50 on a slice-wise basis, while specificity is similar between the networks. Without use of pretrained weights, neither network attains good sensitivity.

Trained networks initialized with ImageNet weights were analyzed using the “two-slice-touch” rule7. Results are presented in Figure 4, and demonstrate increased tear detection sensitivity but decreased specificity for both networks.

Discussion / Conclusions

This preliminary study demonstrates the feasibility of using 2D-CNNs to identify meniscal tears. Although both networks attained reasonable sensitivity when initialized with ImageNet weights, networks initialized without pretrained weights failed to generate useful predictive models. This likely relates to the low amount of training data (4123 images) used to parameterize models with large numbers of free parameters (ResNet 25.6-million, Inception 23.9-million). High validation accuracy but poor test sensitivity using Inception-V3 without pretrained weights indicates the network may be overfitting training data. Interestingly, ResNet-50 initialized without pretrained weights did not achieve high validation accuracy, suggesting training parameters may have been suboptimal. Successful training using ImageNet weight initialization implies a subset of these weights – presumably corresponding to basic image features – may be generalizable to medical image data.

Upon application of the “two-slice-touch” rule, both networks show greater sensitivity for tear detection than suggested by confusion matrices, though with concomitantly decreased specificity. Sensitivity of the Inception-v3 network was greater than for ResNet-50 under both analyses. This may be due to superior feature extraction through Inception’s use of dilated convolutions and larger receptive fields. Alternatively, there may be an element of feature loss with ResNet-50 due to smaller input size.

This study is limited by its use of 2D-CNNs, which treat slices as independent samples despite the fact that most meniscal tears extend over multiple contiguous slices. Recent work8 has demonstrated feasibility of meniscal tear identification using a 3D-CNN for classification. However, most clinical knee exams utilize 2D acquisitions, and development of 2D-tear classification methods is of potentially greater clinical significance.

Future work will investigate whether network performance improves with meniscal segmentation prior to training, and whether supplementation with data of different contrast weighting can improve classification accuracy.

Acknowledgements

The authors acknowledge the support of the Moskowitz Scholar Fund.

References

1. Dodge S and Karam L. A Study and Comparison of Human and Deep Learning Recognition Performance Under Visual Distortions. arXiv: 1705.02498 [cs], May 2017.

2. Szegedy C, Vanhoucke V, Ioffe S et al. Rethinking the Inception Architecture for Computer Vision. arXiv: 1512.00567 [cs], Dec 2015.

3. He K, Zhang X, Ren S et al. Deep Residual Learning for Image Recognition. arXiv: 1512.03385 [cs], Dec 2015.

4. Nguyen J, De Smet A, Graf B et al. MR Imaging–based Diagnosis and Classification of Meniscal Tears. RadioGraphics 2014;34(4):981-99.

5. Chollet, F. Keras: Deep learning library for theano and tensorflow. URL: https://keras.io/k.

6. Abadi M, Barham P, Chen J et al. Tensorflow: a system for large-scale machine learning. OSDI 2016, 16:265-83.

7. De Smet A, Tuite M. Use of the “Two-Slice-Touch” Rule for the MRI Diagnosis of Meniscal Tears. AJR 2006;187:911-14.

8. Pedoia V, Norman B, Mehany S et al. 3D convolutional neural networks for detection and severity staging of meniscus and PFJ cartilage morphological degenerative changes in osteoarthritis and anterior cruciate ligament subjects. JMRI October 2018, In Press, doi:10.1002/jmri.26246.

Figures

Figure 1: Schematic representation of network top layers modified for a binary classification task. 20% dropout was applied to outputs of both networks after the average pooling step to reduce over-fitting.

Figure 2: Results from training the Inception-v3 and ResNet-50 networks with a stopping criterion based on validation accuracy. Similar validation accuracy was achieved in all cases except for ResNet-50 initialized without pre-trained weights, where training was ceased after 8 epochs due to use of an early stopping criterion.

Figure 3: Classification results depicted as confusion matrices for Inception-v3 and ResNet-50 initialized with ImageNet weights or without pre-trained weights. Slice-wise sensitivity and specificity for tear detection is noted below the matrices.

Figure 4: Under the “two-slice-touch” rule, a meniscal tear is only considered present if identified on at least two contiguous slices. Upon application of this rule, Inception-v3 correctly classifies all test cases with meniscal tears and most cases without tears. Sensitivity of ResNet-50 is slightly lower, with 6/7 cases correctly classified. Interestingly, the negative exams misclassified as positive by the networks were different.

Proc. Intl. Soc. Mag. Reson. Med. 27 (2019)
1406