1165

Intrinsic reproducibility issues in deep learning-based MR reconstruction
Chungseok Oh1, Woojin Jung2, and Jongho Lee1
1Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea, Republic of, 2AIRS Medical Inc., Seoul, Korea, Republic of

Synopsis

Deep learning requires a large number of parameter settings, which can be prone to reproducibility issues, undermining the reliability and validity of the outcomes. In this study, we list the common sources that may induce the reproducibility issues in deep learning-based MR reconstruction. The effect size of each source on the network performance was investigated. From the results of this study, we recommend to share a trained network.

Introduction

Recently, a number of areas including parallel imaging1 and quantitative susceptibility mapping2 have utilized deep learning-based MR reconstruction to improve performances. Despite the gains, however, questions have risen regarding the generalization of the network performance for the inputs of different scan parameters3 as well as the reliability of the reconstruction for small perturbation in input images.4 Another venue to explore network characteristics is reproducibility. In deep reinforcement learning, for example, studies have shown that networks suffer from low reproducibility due to sources like initialization seeds,5,6 limiting validation of the results. More generally, numerical operations in neural networks have been suggested to have reproducibility issues.7
In this study, we tested a list of potential sources that may induce “intrinsic” reproducibility issues, which we defined as issues originating from hardware and software for deep learning, in deep learning-based MR reconstruction tasks. The sources were tested for training as well as inference of a network. Our study implies that sharing a trained network is recommended to address the intrinsic reproducibility issues.

Methods

From extensive literature search and investigation, we identified and tested six sources that may cause intrinsic reproducibility issues. They are 1) GPU model, 2) CUDA convolution benchmarking option, 3) CUDA convolution determinism option, 4) data type, 5) mini-batch order, and 6) network weight initialization. The first three sources are related to both training and inference, whereas the last three sources are only related to training because they are not used or varied in inference. This intrinsic reproducibility issue is tested in two kinds of MR reconstruction tasks: quantitative susceptibility mapping (QSM) using QSMnet,2 and parallel imaging using variational network.1 The effect size of each source on the network performance was investigated in QSM whereas key performance difference is confirmed in parallel imaging.

[Quantitative susceptibility mapping]
For QSM, the same network architecture and datasets from QSMnet were utilized. To test if the training and inference of the network are reproducible, the network was trained and tested repeatedly using the same computational infrastructure using a single GPU (NVIDIA Titan Xp), the same code implementation using PyTorch,8 CUDA convolution benchmarking on, CUDA convolution determinism off, the same order of mini-batch, and the same network weight initialization values. Then, reproducibility in training was verified by checking training loss for 1000 steps. The reproducibility in inference was verified by comparing the reconstructed test images. After confirming that the reproducibility is guaranteed, the effect size of each source (e.g., GPU model) was investigated by changing the options (Fig. 1; BOLD represents baseline options). For the evaluation of the network performance, NRMSE was measured in the test dataset. A pairwise t-test was conducted for statistical analysis.

[Parallel imaging]
For parallel imaging, dataset1 and codes that are available online were utilized (https://github.com/rixez/pytorch_mri_variationalnetwork). The network was trained five times not using the baseline options in Figure 1. The performances were evaluated using NRMSE. For inference, one network was tested five times, reporting NRMSE in the test dataset. A pairwise t-test was conducted.

Results

When all six sources were set as the baseline, it was confirmed that the intrinsic reproducibility was guaranteed for both training and inference in QSMnet.
Figure 2 shows the effects of each source on performance in the training of QSMnet. For all the sources, the performances of the network were different, suggesting that reproducibility was not guaranteed. In some cases, moreover, the performance differences reached statistically significant differences (e.g., CUDA convolution (Fig. 2b-c), mini-batch order (Fig. 2e), and network weight initialization (Fig. 2f)), demonstrating intrinsic reproducibility is an important issue. For the data type, float64 and float16 were unable to process because of a GPU memory limit and poor precision, respectively. For the reproducibility in the inference of QSMnet (Fig. 3), the results suggest that the GPU model and CUDA convolution determinism have no difference when the options were changed. On the other hand, the CUDA convolution benchmarking had an extremely small performance difference (1.6e-8%).
In the variational network (Fig. 4), different training showed differences in performance with a statistical significance whereas the inference had negligible difference (3.4e-7%), confirming our findings from QSMnet.

Discussion and Conclusion

In this work, we demonstrate that the intrinsic reproducibility issues, originating from hardware and software of deep learning, exist in deep learning-based MR reconstruction. Six sources were tested to show the reproducibility issues in training and inference, but additional sources such as OS of the system or deep learning software version may potentially affect the reproducibility. As demonstrated in the QSMnet, the reproducibility issues in training can lead to statistically significant performance differences. On the other hand, the effect size was negligible in the inference of the network. A potential explanation for higher susceptibility to the reproducibility issues in the training process is that it requires more random operations such as mini-batch or network weight initialization. Additionally, floating-point errors may be amplified during training. Therefore, sharing a trained network may address the intrinsic reproducibility issue.

Acknowledgements

This work was supported by Creative-Pioneering Researchers Program through Seoul National University(SNU) and by the Korea government (MSIT) (No. NRF-2017M3C7A1047864)

References

1. Hammernik, K. et al. Learning a variational network for reconstruction of accelerated MRI data. Magn. Reson. Med. 79, 3055–3071 (2018).

2. Yoon, J. et al. Quantitative susceptibility mapping using deep neural network: QSMnet. Neuroimage 179, 199–206 (2018).

3. Jung, W., Bollmann, S. & Lee, J. Overview of quantitative susceptibility mapping using deep learning: Current status, challenges and opportunities. NMR Biomed. 1–14 (2020) doi:10.1002/nbm.4292.

4. Antun, V., Renna, F., Poon, C., Adcock, B. & Hansen, A. C. On instabilities of deep learning in image reconstruction and the potential costs of AI. Proc. Natl. Acad. Sci. U. S. A. 117, 30088–30095 (2020).

5. Nagarajan, P. et al. The Impact of Nondeterminism on Reproducibility in Deep Reinforcement Learning. 2nd Reprod. Mach. Learn. Work. ICML 2018 9116, 64–73 (2018).

6. Nagarajan, P., Warnell, G. & Stone, P. Deterministic Implementations for Reproducibility in Deep Reinforcement Learning. (2018).

7. Pytorch reproducibility. https://pytorch.org/docs/stable/notes/randomness.html.

8. Paszke, A. et al. PyTorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, (2019).

Figures

Figure 1. List of sources inducing the intrinsic reproducibility issues in deep learning-based MR reconstruction. In the training of the network, six sources are related, whereas, in the inference of the network, three sources are related. For each source, the tested options to investigate the effect size are listed in the second column, and the baseline options are marked by bold type. We confirmed that the training and inference with the baseline options met the reproducibility in QSMnet.

Figure 2. The effects of six sources on the network performance in the training of QSMnet. The mean values ± standard deviations of NRMSE over the test dataset are presented. (a) The performances were different according to GPU model. (b-c) When the CUDA convolution options were changed, the performances were different in each trial (p = 5.6e-8 for (b) and 1.2e-2 for (c)). (d) For the data type, float64 and float16 were unable to process. (e-f) The performances were different for each seed of mini-batch order and weight initialization (p = 1.7e-8 for (e) and 4.8e-4 for (f)).

Figure 3. The effects of three sources on the network performance in the inference of QSMnet. The mean values ± standard deviations of NRMSE over the test dataset are presented. (a) When the network was tested using different GPU models, the reconstructed images were exactly the same. (b) When the CUDA convolution benchmarking option was changed, the performance was different between different options (on vs. off), but the effect size was no larger than 1.6e-8%. (c) When the CUDA convolution determinism option was changed, the reconstructed images were exactly the same.

Figure 4. The effects of the reproducibility issues on the network performance in the training and the inference of the variational network. The mean values ± standard deviations of NRMSE over the test dataset are presented. (a) When the networks were trained five times, the performances were different in each trial (p = 1.9e-8). (b) When the network was inferenced five times, the performances were different but negligible (no larger than 3.4e-7%).

Proc. Intl. Soc. Mag. Reson. Med. 30 (2022)
1165
DOI: https://doi.org/10.58530/2022/1165