3875

Estimating Uncertainty of Deep Learning for Tomographic Image Reconstruction Through Local Lipschitz
Danyal Bhutto1,2, Bo Zhu2, Jeremiah Zhe Liu3,4, Neha Koonjoo2,5, Bruce R Rosen2,5, and Matthew S Rosen2,5,6
1Biomedical Engineering, Boston University, Boston, MA, United States, 2Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, United States, 3Google, Mountain View, CA, United States, 4Biostatistics, Harvard University, Cambridge, MA, United States, 5Harvard Medical School, Boston, MA, United States, 6Physics, Harvard University, Boston, MA, United States

Synopsis

Keywords: Machine Learning/Artificial Intelligence, Image Reconstruction

As deep learning approaches for image reconstruction become increasingly used in the radiological space, strategies to estimate reconstruction uncertainties become critically important to ensure images remain diagnostic. We estimate reconstruction uncertainty through calculation of the Local Lipschitz value, demonstrate a monotonic relationship between the Local Lipschitz and Mean Absolute Error, and show how a threshold can determine whether the deep learning technique was accurate or if an alternative technique should be employed. We also show how our technique can be used to identify out-of-distribution test images and outperforms baseline metrics, i.e. deep ensemble and Monte-Carlo dropout.

Introduction

All imaging modalities from ultrasound to magnetic resonance imaging (MRI) employ tomographic image reconstruction techniques, i.e. the conversion of sensor domain data to the image domain. DL for tomographic image reconstruction has shown huge promise in solving inverse problems of this type, particularly in the medical field. Since the goal is to perform a diagnostic task, the reconstruction needs to occur with high confidence due to the health risk to patients. Thus, it is critical to estimate the uncertainty of deep learning (DL) for tomographic image reconstruction techniques in practice so that we know the reconstruction occurred appropriately. We propose calculating the Local Lipschitz value to estimate model uncertainty. We demonstrate the monotonic relationship between the Local Lipschitz and Mean Absolute Error (MAE) and use it to determine a threshold $$$\Upsilon$$$, through selective prediction, below which the DL technique performed appropriately. We also demonstrate through perturbation of an input, similar to the Local Lipschitz calculation by adding noise, that we can identify out-of-distribution (OOD) test images and outperform baseline methods of deep ensemble and Monto-Carlo dropout.

Materials and Methods

Using the AUTOMAP neural network1, we trained and tested 2D T1-weighted brain MR images acquired at 3T collected from the MGH-USC Human Connectome Project (HCP) public dataset2 as the in-distribution (ID) dataset. We used the NYU fastMRI knee images dataset3 as the OOD dataset. All test sets consist of 5,000 images.
To measure the uncertainty of DL for tomographic image reconstruction tasks, we propose estimating the local Lipschitz constant. A function $$$f$$$ is $$$L$$$-Lipschitz continuous if for any input $$$x,y \in \mathbb{R}^{n}$$$, there exists a nonnegative constant $$$L \ge 0$$$.
$$\| f(x) - f(y) \| \le L \| x - y \|$$
To calculate the Local Lipschitz, we need to determine the upper bound caused by average case perturbations. Let us denote $$$\Phi$$$ for AUTOMAP, $$$L_{\Phi}$$$ for the Lipschitz constant, $$$x$$$ for input image, and $$$e$$$ for error. We can write the Local Lipschitz calculation from the above equation as follows.
$$\| \Phi(x+e) - \Phi(x) \| \le L_{\Phi} \| x+e-x \|$$
$$\| \Phi(x+e) - \Phi(x) \| \le L_{\Phi} * \| e \|$$
Given the small magnitude of $$$\|e\|$$$, we can reorganize the above equation and place an upper bound by empirically calculating $$$L_{\Phi}$$$, where image $$$x' = x + e$$$.
$$L_{\Phi} \ge \frac{\| \Phi(x') - \Phi(x) \|}{ \| x' - x \|}$$
Now we can calculate the $$$ L_{\Phi} $$$ for any image after we perturb it by Gaussian noise $$$e$$$ and compare the variations in the output space over the input space. Through comparing the difference in output to difference in input, the $$$ L_{\Phi} $$$ value corresponds to how the network weights affect the output image reconstructed. In Fig 1a, we demonstrate the strong monotonic relationship between the Local Lipschitz and MAE and the table in Fig 1b shows the Spearman correlation values at each noise level. We observe as MAE increases, the $$$ L_{\Phi} $$$ increases. We also compare our method to detect OOD images to baselines methods of deep ensemble and Monto-Carlo dropout. For deep ensemble, we trained four AUTOMAP models and for Monto-Carlo dropout, we trained an AUTOMAP model with a dropout layer to output 50 images.

Results and Discussion

Fig 2a represents an image reconstruction pipeline where a threshold $$$\Upsilon$$$ can determine if the DL model or an alternative technique be used. If the Local Lipschitz is below $$$\Upsilon$$$, the DL performed with an appropriate accuracy. Otherwise, an alternative technique should be employed due to too large of an uncertainty. Fig 2b and 2c show how selective prediction is performed to determine $$$\Upsilon$$$ at different noise levels. The $$$L_{\Phi}$$$ was sorted in descending order with its corresponding MAE. The mean MAE decreases as more images with the highest $$$L_{\Phi}$$$ are referred. A radiologist can determine acceptable accuracy and a threshold based on the percentage referred. We also demonstrate how our method can be used to determine OOD images. In Fig 3, we display the receiver operating characteristic curves (ROC) and area under the curves (AUC) values of five different methods to detect OOD images from the ID validation dataset. Using the $$$L_{\Phi}$$$ values as signal, a single AUTOMAP model has AUC of 86.84% and performs comparably to the baseline methods with 87.25% for deep ensemble and 86.97% for Monto-Carlo dropout. The baseline methods output multiple images; thus, we can calculate variance and use as signal. For the single AUTOMAP model, we can generate four outputs by perturbing through adding noises of the same distribution to the input and calculate variance of the outputs. Using variance as signal, the AUTOMAP model outperforms all other methods with AUC of 99.97%. Thus, our perturbation method of using noise similar to the Local Lipschitz signal, can determine OOD images in a clinical setting better than baseline methods using only a single model.

Conclusion

DL for tomographic image reconstruction has shown huge promise in solving inverse problems, particularly in the medical field. We provide a simple and scalable technique for estimating uncertainty through calculating the Local Lipschitz value, demonstrate its relationship to MAE and determine a threshold for using the DL model or not, and use it to detect OOD test images.

Acknowledgements

We acknowledge support for this work from the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1840990 and the NSF NRT: National Science Foundation Research Traineeship Program (NRT): Understanding the Brain (UtB): Neurophotonics DGE-1633516NSF.

References

1. Zhu, Bo, et al. "Image reconstruction by domain-transform manifold learning." Nature 555.7697 (2018): 487-492.

2. Fan Q, Witzel T, Nummenmaa A, Dijk KRAV, Horn JDV, Drews MK, et al. MGH–USC Human Connectome Project datasets with ultra-high b-value diffusion MRI. Neuroimage 2016; 124: 1108–1114.

3. Zbontar J, Knoll F, Sriram A, Murrell T, Huang Z, Muckley MJ, et al. fastMRI: An Open Dataset and Benchmarks for Accelerated MRI. Arxiv 2018.

Figures

Fig 1: Empirical evidence of the monotonic relationship between $$$L_{\Phi}$$$ and MAE (mean absolute error) for ID dataset for noise levels ranging from 5% to 100%. A) For all the noise levels, each plot shows that as MAE increases, the $$$L_{\Phi}$$$ increases as well. B) shows a table with the Spearman correlation values of each noise level. The values indicate a strong correlation between $$$L_{\Phi}$$$ for each image and its MAE value, establishing a strong monotonic relationship between the Local Lipschitz and accuracy.

Fig 2: Reconstruction benchmark: A) represents the image reconstruction pipeline where a threshold $$$\Upsilon$$$ can determine if the DL model or an alternative technique be used. B and c) show how selective prediction is performed to determine $$$\Upsilon$$$ at different noise levels. The $$$L_{\Phi}$$$ was sorted in descending order with its corresponding MAE. The mean MAE decreases as more images with the highest $$$L_{\Phi}$$$ are referred. A radiologist can determine acceptable accuracy and a threshold based on the percentage referred.


Fig 3: Receiver operating characteristic curves (ROC) and area under the curves (AUC) values of five methods to detect OOD images. Using $$$L_{\Phi}$$$ values, AUTOMAP has AUC of 86.84% and performs comparably to baseline methods. The baseline methods output multiple images; thus, we can calculate variance. For AUTOMAP, we can calculate variance by adding noises of the same distribution to the input and generate four outputs. AUTOMAP with variance outperforms all other methods with AUC of 99.97% in detecting OOD.

Proc. Intl. Soc. Mag. Reson. Med. 31 (2023)
3875
DOI: https://doi.org/10.58530/2023/3875