Deep learning has shown great success in MR image segmentation, enhancement and reconstruction. However, most methods, if not all, rely on a pair of the input image and the ground-truth image to train the network for a given task. In practice, it is often hard to get the corresponding ground-truth MR images due to limitations in data acquisition. In this study, we aim to use the convolutional neural network (CNN) structure itself as a constraint without using ground-truth images in an optimization task and to evaluate its performance in MR image denoising and super-resolution applications.
In image restoration problems such as denoising and super-resolution, the goal is to recover the original image $$$x$$$ from a corrupted image $$$x_0$$$, which can be formulated as an optimization task given as:
$$min_xE(x,x_0)+R(x)$$
in which $$$E(x, x_0)$$$ is a data consistency term and $$$R(x)$$$ is the prior knowledge or constraint term. Since such tasks are often ill-posed for a unique solution $$$x$$$, the constraint term $$$R(x)$$$ is required to get a reasonable solution. Recently, deep learning methods have been developed to use paired $$$x$$$ and $$$x_0$$$ to train an optimized network to restore $$$x$$$ from $$$x_0$$$ during testing. $$$R(x)$$$ is then embedded in such optimized networks. However, these methods heavily rely on the availability and abundance of the ground-truth $$$x$$$. On the contrary, the popular compressed sensing method uses hand-crafted image sparsity in different domains as the constraint $$$R(x)$$$ and is successful in MR image reconstruction. Combining the two, instead of using a hand-crafted $$$R(x)$$$, the key idea of a deep image prior is to use the CNN structure itself as a constraint and optimize its parameters without paired $$$x$$$ and $$$x_0$$$ with the hypothesis that the output of a CNN given a fixed input satisfies certain requirements such as smoothness and good visual quality. Therefore, the optimization can be expressed as:
$$min_{\theta}E(f_{\theta}(z),x_0)$$
in which $$$f_{\theta}(z)$$$ represents the CNN output with parameters $$$\theta$$$ and input $$$z$$$. As in GAN models, $$$z$$$ is usually sampled from uniform random noise and $$$f_{\theta}(z)$$$ has a similar network structure as the generative network. To get the CNN parameters, $$$\theta$$$ is randomly initialized and optimized using a training process with $$$E(f_{\theta}(z),x_0)$$$ as the loss function. One notable difference with the conventional CNN is that $$$\theta$$$ is dependent on $$$x_0$$$ and is part of the data rather than the model. In the denoising application, $$$E(f_{\theta}(z),x_0)$$$ is the l1 or l2 difference between $$$f_{\theta}(z)$$$ and $$$x_0$$$. If trained long enough, $$$E(f_{\theta}(z),x_0)$$$ will become 0. However, as the CNN structure is found to have low “impedance” for the true image than noise, during the training process, the true image is generated before the noise, so that when stopped early, $$$f_{\theta}(z)$$$ can approximate $$$x$$$ very well. In the super-resolution application, $$$E(f_{\theta}(z),x_0)$$$ is the difference between a down-sampled $$$f_{\theta}(z)$$$ and $$$x_0$$$. From all solutions that can achieve $$$E(x, x_0)=0$$$, the hypothesis is that $$$f_{\theta}(z)$$$, as the output from a CNN, will likely have the best visual quality.