1334

Evaluation of input data and UNet based convolutional network architectures for automated muscle annotation in 2D and 3D

Martijn Froeling ¹, Lara Schlaffke², Marlena Rohm², Ivana Isgum³, Hermien E Kan⁴, and Jelmer M Wolterink³

¹Department of Radiology, University medical center utrecht, Utrecht, Netherlands, ²Department of Neurology BG, University Hospital Bergmannsheil, Ruhr-University Bochum, Bochum, Germany, ³Image Sciences Institute, University Medical Center Utrecht, Utrecht, Netherlands, ⁴Dept of Radiology, C.J. Gorter Center for High Field MRI, Leiden University Medical Center, Leiden, Netherlands

Synopsis

Manual annotation of muscle is still one of the most time-consuming steps in skeletal muscle MRI research. In this study we have investigated three aspects of automated muscle annotation using deep convolutional networks. First, we directly compare five different network architectures. Second, we compare the effect of providing various input data all based on Dixon imaging. Third, we investigate the effect of the amount of training data provided to the network. In summary we found that UNet-like convolutional networks allow for accurate and precise annotation of calf muscle in 2D and 3D and that the data provided is the strongest predictor of success.

Introduction

Manual annotation of muscle is still one of the most time-consuming steps in skeletal muscle MRI research. Consequently, many solutions have been proposed (1, 2) for automatic annotation of muscle volumes, including deep convolutional networks (3, 4). However, there are many possible different neural network architectures and it is unknown which one performs best. In addition, MRI can provide various contrast that can be an input of these networks. Since manual annotation is time consuming, the amount of annotated training data should be minimal. In this study we have investigated these three aspects of automated muscle annotation using deep convolutional networks. First, we directly compare five different network architectures. Second, we compare the effect of providing various input data all based on Dixon imaging. Third, we investigate the effect of the amount of training data provided to the network.

Methods

Five different 2D and 3D segmentation networks were designed, i.e. UNet, UResNet, ResNet, UDenseNet, DenseNet. The networks have varying numbers of convolution layers (depth) and trainable parameters, and are based on popular architectures in the literature (5–7) (see Fig. 1). To improve the flow of gradients through backpropagation ResNet and DenseNet have skip connections. All networks were implemented in Mathematica 11 using a toolkit based upon MX-Net (github.com/mfroeling/UNET). As a loss function, we used a linear combination of a cross entropy, a soft Dice and a Brier loss layer (8–11).

MRI Dixon data of the calf was acquired bilaterally in 59 subjects using a 4-point Dixon scan (TE 2.6/3.36/4.12/4.88 ms, TR 210 ms, 1.5x1.5x3mm³). Seven muscles were manually annotated in all legs by an expert and checked by two other experts. Both the left and right leg were cropped to a FOV of 32x112x112 voxels after which the left leg was mirrored to match the right. Therefore, in total 118 annotated calfs were available for training, validation and testing (see Fig. 2). In addition to varying the network architecture, we varied the input information to the network. This was either the in-phase image, the out-phase image, the water image, or a three-channel concatenation of these three images.

To evaluate the effect of the amount of training data, networks were either trained using only the middle slice of the left leg (59 slices, SET1) or three equidistant slices of both legs (118x3 = 354 slices, SET2). The data was split into a training (80%), validation (15%) and testing (5%) sets. All training was performed with a batch size of 24 and 250 epochs using a NVIDIA Titan Xp graphics card. The number of feature maps in the first layer was always 48, by setting n = 48 or (n,r) = (12,4) (see Fig. 1B). Results were evaluated using the Dice and MSD (in voxels) scores.

Results

Examples of the loss and error-rate as a function of epochs is shown in Fig 3. Overall the network architecture and the amount of data had little influence on the training results (see Fig. 4). Including more data in a network with more convolution layers in general allowed for slightly faster convergence, but always resulted in similar Dice and MSD values. However, we found that the kind of input data used by the network had a large effect on the training results. In-phase images performed considerably worse than out-phase or water images. When all three images were combined the results were worse than training on the water of out-phase images alone.

We found that the best performing 2D network was a ResNet network, and consequently also trained a 3D variant of this model with identical hyperparameter settings. For this model, segmentation using the out-phase data as input resulted in a mean Dice of 0.901 (range 0.815 to 0.949) and a mean MSD of 0.408 (range 0.200 to 0.677) voxel on the training set (5% of scans) (see Fig. 5).

Discussion and conclusion

Our results showed that the kind of data provided to a UNet-like convolutional network for calf segmentation is a more important factor than the exact layout of individual building blocks. The in-phase data has poor contrast between water and fat and does not clearly show tendons and aponeuroses and water and out-phase images are preferred. Nevertheless, this may be due to a bias in the manual segmentations, which are typically performed on out-phase images.

In summary we found that UNet-like convolutional networks allow for accurate and precise annotation of calf muscle in 2D and 3D and that the data provided is the strongest predictor of success.

Acknowledgements

No acknowledgement found.

References

1. Fouré A, Ogier AC, Le Troter A, et al.: Diffusion Properties and 3D Architecture of Human Lower Leg Muscles Assessed with Ultra-High-Field-Strength Diffusion-Tensor MR Imaging and Tractography: Reproducibility and Sensitivity to Sex Difference and Intramuscular Variability. Radiology 2018; 287:592–607.

2. Fischer M, Schwartz M, Yang B, Schick F: Random Forest based Calf Muscle Segmentation from MR data incorporating Prior Information. In Proc 26rd Annu Meet ISMRM, Paris, Fr; 2018:2840.

3. Snezhko E, Azzabou N, Baudin P-Y, Carlier PG: Convolutional neural network segmentation of skeletal muscle NMR images. .

4. Konda A, Crump K, Podlisny D, et al.: Fully automatic segmentation of all lower body muscles from high resolution MRI using a two-step DCNN model. In Proc 26rd Annu Meet ISMRM, Paris, Fr; 2018:1398.

5. Huang G, Liu Z, van der Maaten L, Weinberger KQ: Densely Connected Convolutional Networks. 2016.

6. He K, Zhang X, Ren S, Sun J: Deep Residual Learning for Image Recognition. 2015. 7. Ronneberger O, Fischer P, Brox T: U-Net: Convolutional Networks for Biomedical Image Segmentation. 2015.

8. Milletari F, Navab N, Ahmadi S-A: V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 2016.

9. Mannor S, Peleg D, Rubinstein R: The cross entropy method for classification. In Proc 22nd Int Conf Mach Learn - ICML ’05. New York, New York, USA: ACM Press; 2005:561–568.

10. Young RMB: Decomposition of the Brier score for weighted forecast-verification pairs. Q J R Meteorol Soc 2010; 136:

11. Sudre CH, Li W, Vercauteren T, Ourselin S, Cardoso MJ: Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. 2017.

Figures

Figure 1: A) schematic representation of the networks 1) UNet: in each layer the data is down-sampled and the number of features is doubled. 2) UResNet: Similar to 1), the number of trainable parameters is halved by decreasing the number of features of the first convolution layer. 3) ResNet: Similar to 2) with the addition of a skip layer. 4) UDenseNet: Similar to 1), in each network layer the number of convolution layers repetitions is increased. 5) DenseNet: Similar to 4) with the addition of skip connections connecting all convolution layers. B) Number of convolution layers and trainable parameters.

Figure 2: Examples of the input DIXON data with the manual annotations. A) The in-phase data which has bright fat and muscle signal. B) The out-phase data where the fat is darker the muscle signal and the tendons are more visible. C) The water only signal which is similar to the out-phase data but has no signal from fat.

Figure 3: The loss as a function of the number of epochs for the Unet, ResNet, and DenseNet using the water images and all data as input. On the left training using a single slice of only the left leg is shown, on the right training using 3 slices of both the left and right leg is shown. Using more data makes the network converge quicker. The number of trainable parameters and number of convolution layers does not have a large impact on the training results.

Figure 4: The mean, min and max Dice and mean surface distance (MSD) scores over all 7 segmentations for the different networks and input data. The best scores (highest Dice, lowest MSD) per image type are indicated in bold. The best overall performance is indicated in green. The best mean scores per image type over all networks is in indicated in orange. Overall the out-phase and water inputs perform best and every network type performs best on one of the criteria (mean, min, or max) at least once.

Figure 5: 3D Training Using the ResNet architecture with 48 features in the first layer. A) Loss and Error Rate as a function of epochs. B) Axial, sagittal and coronal cross section of the manual and automatic muscle segmentation. C) 3D rendering of the manual and automatic muscle annotation for the right and left leg of two subjects.

Proc. Intl. Soc. Mag. Reson. Med. 27 (2019)

1334