1787

Evaluating VoxelMorph, a deep learning-based non-linear diffeomorphic registration algorithm, against native ANTs SyN
Victoria Madge1,2, Philip Novosad1,2, Daniel A. Di Giovanni1,3, and D. Louis Collins1,2,3
1McConnell Brain Imaging Centre, Montreal Neurological Institute and Hospital, Montreal, QC, Canada, 2Department of Biomedical Engineering, McGill University, Montreal, QC, Canada, 3Department of Neurology and Neurosurgery, McGill University, Montreal, QC, Canada

Synopsis

VoxelMorph is a deep-learning based non-linear diffeomorphic registration algorithm which claims to perform comparably to the state-of-the-art. However, the previous evaluation did not compare against manual gold-standard anatomical segmentations, used only the Dice metric for comparison, and compared against a modified version of a state-of-the-art algorithm, ANTs SyN. Here, VoxelMorph is evaluated against an unmodified version of ANTs SyN using multiple metrics based on manual labels. Results show VoxelMorph is less robust than ANTs SyN and underperforms in the presence of simulated deformations, and in registration of BrainWeb20 images to the VoxelMorph atlas.

Introduction

Image registration is the process of aligning two or more images from different time points, subjects, or modalities, into the same space1 and is a fundamental step in many medical image processing applications, including diagnostics, treatment, and monitoring of various brain pathologies, as well as image-guided surgical planning and intra-operative brain shift correction2. Non-linear diffeomorphic registration is capable of large and small deformations of brain images while preserving topology3, but tends to suffer from long computation times.

Recently, deep learning has shown success in a wide variety of medical image processing tasks, including registration4. VoxelMorph is a deep learning-based non-linear diffeomorphic registration algorithm promising fast, topology-preserving registration of brain images comparable to state-of-the-art algorithms, such as the symmetric image normalization method by Advanced Normalization Tools (ANTs SyN)5. However, the previous evaluation performed between VoxelMorph and ANTs SyN could be improved. First, the training and testing of the VoxelMorph model was performed using fully automatic FreeSurfer-based segmentations of MRI volumes5, and not manual gold standard segmentations6. Thus, errors in automatic segmentation could confound registration quality metrics. Second, the smoothness parameters of the ANTs SyN algorithm were altered to be more similar to those of VoxelMorph5, thus restraining ANTs SyN's capacity to achieve a successful registration. Finally, the previous comparison was based only on Dice scores, which alone is not a complete evaluation of goodness-of-fit6.

Here, VoxelMorph is compared to native ANTs SyN in two experiments; first, using simulated deformations of the VoxelMorph atlas, and second, using a digital phantom database with manual gold standard labels – BrainWeb207. The methods are compared using Dice and Hausdorff metrics as well as the standard deviation of intensity between difference images, yielding a more comprehensive evaluation.

Methods

Data for the first experiment was generated from the VoxelMorph atlas8 by iteratively applying random Gaussian deformations at ten plausible average displacement severities ranging from 0 to 5 mm, following data augmentation methods9, yielding 70 deformed MR images.

BrainWeb20, which comprises twenty T1-weighted digital MR phantoms created from twenty normal adults7 was used for the second experiment. For each of the 20 subjects, the ten manually corrected fuzzy mask brain structure volumes were used to simulate T1w MRI volumes. The MRI volumes underwent skull stripping, affine registration to the VoxelMorph atlas, and intensity normalization (as done for VoxelMorph5).

Both experiments underwent atlas-based registration, using the VoxelMorph atlas as the target volume for (1) the VoxelMorph model and (2) ANTs SyN with its default smoothing parameters. For both methods, the deformed output MR images and their warps were saved and applied to the corresponding image labels.

For the first experiment, warped labels were compared to the original VoxelMorph atlas labels to test each methods performance in the presence of deformation. For the second experiment, evaluation was performed by comparing the warped BrainWeb20 labels of the registered subjects to each other in the target space. For both databases, registration quality was measured by the intensity difference between the output and the target image. Comparison metrics for both databases include: Hausdorff distance, Cohen’s Kappa, and Dice score, calculated using the EvaluateSegmentation tool10.

Results and Discussion

VoxelMorph takes approximately 30 seconds to register each image, while ANTs SyN takes approximately 43 minutes.

Figure 1 depicts the deformed input volumes, and the recovered outputs from VoxelMorph and ANTs SyN, along with intensity difference images for Experiment 1. ANTs SyN demonstrated superior performance at all deformation magnitudes (p < 0.05) except 5 mm, indicating more accurate registrations.

Figure 2 depicts boxplots of measured whole brain Dice scores and Hausdorff distances for Experiment 1. ANTs SyN yields better Dice scores for all simulated deformations and better Hausdorff distances for all but the 5 mm deformation, where there is no difference. Two 2-way repeated measures ANOVA performed on Dice score and Hausdorff distance, respectively, reveal significant differences between ANTs SyN and VoxelMorph (p < 0.01).

Figure 3 depicts the preprocessed BrainWeb20 inputs, and the outputs and difference images for VoxelMorph and ANTs SyN for four subjects in Experiment 2. Average standard deviation intensity differences confirm better performance for VoxelMorph (p < 0.05), as seen visually in the figure.

Figures 4 and 5 depict the Dice scores and Hausdorff distances, respectively, of whole brain, grey matter, white matter, and cerebral spinal fluid (CSF) BrainWeb20 labels. Dice scores are consistent across both methods in the inter-subject comparison (Figure 4), where ANTs SyN significantly outperforms VoxelMorph (p < 0.001). Inconsistent Hausdorff distances in white matter labels (Figure 5) suggest reduced robustness from VoxelMorph, whereas ANTs SyN shows more consistency and better performance for all tissues (p < 0.001). Poorer Hausdorff distance comparisons in Subjects 43 (for CSF) and 49 (all tissues), in both methods (Figure 5), are due to anatomical variability and poor registration, respectively.

Conclusion

This evaluation of VoxelMorph and ANTs SyN shows statistically better performance for ANTs SyN using Dice and Hausdorff distance metrics, and mixed results using intensity differences. Further evaluation, using more direct metrics of registration error with additional MRI data are needed.

Acknowledgements

The authors would like to acknowledge funding from the Canadian Institutes of Health Research (CIHR), the Natural Sciences and Engineering Research Council (NSERC), and Healthy Brains for Healthy Lives (HBHL). The authors would also like to thank the Department of Biomedical Engineering, and the Faculty of Medicine at McGill University for their financial aid.

References

1. B. Zitová and J. Flusser, “Image registration methods: a survey,” Image Vis. Comput., vol. 21, no. 11, pp. 977–1000, Oct. 2003.

2. J. B. A. Maintz and M. A. Viergever, “A survey of medical image registration,” Med. Image Anal., vol. 2, no. 1, pp. 1–36, 1998.

3. M. Wang and P. Li, “A Review of Deformation Models in Medical Image Registration,” J. Med. Biol. Eng., vol. 39, no. 1, pp. 1–17, 2019.

4. G. Haskins, U. Kruger, and P. Yan, “Deep Learning in Medical Image Registration: A Survey.”

5. A. V. Dalca, G. Balakrishnan, J. Guttag, and M. R. Sabuncu, “Unsupervised learning for fast probabilistic diffeomorphic registration,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11070 LNCS, pp. 729–738, 2018.

6. A. Fenster and B. Chiu, “Evaluation of segmentation algorithms for medical imaging,” Annu. Int. Conf. IEEE Eng. Med. Biol. - Proc., vol. 7 VOLS, pp. 7186–7189, 2005.

7. B. Aubert-Broche, M. Griffin, G. B. Pike, A. C. Evans, and D. L. Collins, “Twenty new digital brain phantoms for creation of validation image data bases,” IEEE Trans. Med. Imaging, vol. 25, no. 11, pp. 1410–1416, 2006.

8. R. Sridharan et al., “Quantification and analysis of large multimodal clinical image studies: Application to stroke,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8159 LNCS, pp. 18–30, 2013.

9. P. Novosad, V. Fonov, D. Louis Collins, T. Alzheimer’s Disease, and N. Initiative, “Accurate and robust segmentation of neuroanatomy in T1-weighted MRI by combining spatial priors with deep convolutional neural networks,” 2019.

10. A. A. Taha and A. Hanbury, “Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool,” BMC Med. Imaging, vol. 15, no. 1, 2015.

Figures

Figure 1. Deformed inputs (row 1), and outputs from ANTs SyN (row 2), and VoxelMorph (row 4) for Experiment 1. Difference images for each method were calculated by subtracting the original VoxelMorph atlas from each output (rows 3 and 5). One can see that there are greater intensity differences with VoxelMorph.

Figure 2. Boxplots of Dice scores (a) and Hausdorff distances (b) for ANTs SyN (green) and VoxelMorph (blue) for Experiment 1. Each boxplot summarizes seven iterations of registered deformed images at the same average displacement severity level (0 ≤ α ≤ 5 mm). One can see that ANTs SyN outperforms VoxelMorph for almost all simulated displacement values.

Figure 3. Preprocessed BrainWeb20 inputs (column 1), outputs from ANTs SyN (column 2), and VoxelMorph (column 4), their difference images (column 3 and 5, respectively), and VoxelMorph atlas (target) for Experiment 2. Average standard deviations of intensity for each difference image reveal slightly better performance from VoxelMorph (p < 0.05).

Figure 4. Dice score colourmaps of BrainWeb20 images registered using VoxelMorph (top) and ANTs SyN (bottom) for Experiment 2. Median Dice scores: whole brain, 0.96 (VoxelMorph) and 0.98 (ANTs SyN); grey matter, 0.68 (VoxelMorph) and 0.76 (ANTs SyN); white matter 0.68 (VoxelMorph) and 0.77 (ANTs SyN); CSF 0.46 (VoxelMorph) and 0.64 (ANTs SyN). A two-way repeated measures ANOVA shows significantly better Dice scores for ANTs SyN (p < 0.001).

Figure 5. Hausdorff distance colourmaps of BrainWeb20 images registered using VoxelMorph (top) and ANTs SyN (bottom) for Experiment 2. Median Hausdorff distances: whole brain, 10.32 mm (VoxelMorph) and 7.28 mm (ANTs SyN); grey matter, 11.18 mm (VoxelMorph) and 10.49 mm (ANTs SyN); white matter 14.46 mm (VoxelMorph) and 11.90 mm (ANTs SyN); CSF 13.08 (VoxelMorph) and 12.73 (ANTs SyN). A two-way repeated measures ANOVA shows significantly better Hausdorff distances for ANTs SyN (p < 0.001).

Proc. Intl. Soc. Mag. Reson. Med. 28 (2020)
1787