1947

Cascaded U-net with Deformable Convolution for Dynamic Magnetic Resonance Imaging

Zhehong Zhang¹, Yuze Li², and Huijun Chen²
¹Department of Engineering Physics, Tsinghua University, Beijing, China, ²Department of Biomedical Engineering, School of Medicine, Tsinghua University, Beijing, China

Synopsis

The concatenation of several-element U-nets operating in both k-space and image domains is a deep learning network model that has been used for magnetic resonance image (MRI) reconstruction. Here, we present a new method that incorporates deformable 2D convolution kernels into the model. The proposed method leverages motion information of dynamic MRI and thus deformable convolution kernel naturally adapts to image structures. We demonstrate the improved performance of the proposed method using CINE dataset.

Introduction

MRI is intrinsically slow due to physical and physiological limitations. Deep learning-based image reconstruction methods like the convolutional neural network have been demonstrated for dynamic MRI, providing new opportunities for fast and high-quality reconstruction. In comparison to flat convolution neural networks, the concatenation of several-element U-net is a more flexible model that works across different scales and shows outstanding performance, especially when operating in both image and k-space domains¹.

However, the reconstruction of dynamic MRI is inherently limited to model geometric transformations due to the fixed geometric structures of the convolution kernels, often leading to over smoothing artifacts. Deformable convolution, a novel convolution method used in computer vision tasks such as object detection and video deblurring, augments the spatial sampling locations in the modules with additional offsets^2-4. Deformable convolution kernels naturally adapt to image structures and could effectively reduce the blurring.

In this study, we aimed to demonstrate that deep learning-based dynamic image reconstruction can benefit from the incorporation of deformable convolution. We used cascaded U-net and replaced the standard convolution layer which extracts the feature of the input. The proposed method was compared against the zero-filling method, compressed sensing, standard cascaded U-net, and the fully-sampled reference.

Theory and Methods

Deformable Convolution: In normal convolution with fixed uniform convolution kernels, each location $$$p_0$$$ on the output feature map $$$y$$$ is the summation of sampled values over the input feature map $$$x$$$ using a regular grid $$$R$$$ weighed by $$$w$$$ and we have $$$y(p_0)=\sum\limits_{p_n \in R}w(p_n)\cdot x(p_0+p_n)$$$. In deformable convolution, the regular grid is augmented with offsets and the feature map is computed as $$$y(p_0)=\sum\limits_{p_n \in R}w(p_n)\cdot x(p_0+p_n+\Delta p_n)$$$. The offsets are derived from the additional convolution layers over preceding feature maps, illustrated in Figure 1. Because the offsets are usually fractional, bilinear interpolation is used to sample the pixels. With deformable convolution, we replaced the normal convolution at the first convolution layer at every U-net block that operates in the image domain.

Network Architecture: Our models comprise cascading U-net where each U-net block operates either on k-space or image domains, following each other in the whole network. Figure 2 shows a cascaded U-net termed IK W-net composed of two blocks operating in the image domain and then the k-space domain. Every U-net block takes an undersampled k-space as input. If the block operates in the image domain, inverse Fourier transform is performed to turn the k-space input into low-resolution images. Each block has 22 convolution layers, three max-pooling layers, three up-sampling layers, and one residual connection. The convolution kernel sizes are 3 x 3. As is described in Roberto et al¹, four potential types of cascaded U-net were tested: a) W-net IK, b) WW-net IKIK, c) WWW-net IKIKIK. We carried out data consistency for the k-space at the end of each U-net block with updated k-space as output, and this data consistency implementation was a noiseless setting. The loss function used to train the model was the mean squared error.

Training: The network was trained on 2D images obtained from retrospectively undersampled CINE data of 5 subjects. The public data set was downloaded from https://www.kaggle.com/c/second-annual-data-science-bowl/data. The retrospective undersampled data were generated from variable-density sampling along the phase encoding dimension with acceleration factor $$$R=2$$$, $$$R=4$$$, and $$$R=8$$$.

Results and Discussion

Table 1 shows the peak signal-to-noise ratio (PSNR) of cascaded IKIKIK U-net with deformable convolution under the circumstances of different acceleration factors and we compared it with standard cascaded U-net, compressed sensing with temporal total variation, and zero-filling. For acceleration factor R = 2, 4, and 8, the PSNR of images reconstructed by the proposed method was 39.60, 34.08, and 30.86, respectively, which were higher than other methods in the same condition. On average, cascaded IKIKIK U-net with deformable convolution achieved consistently better results.

Table 2 demonstrates the influence of network depth on the performance of deformable convolution. Multiple derivations of offset fields helped the improvement of output. Cascaded U-net with deformable convolution achieved better results as the depth increased. Judging from the average PSNR, increasing depth did not result in better images with standard cascaded U-net, but generated images of higher quality with the proposed method.

Figure 3 compares the output of a single frame reconstructed by multiple types of cascaded network. The acceleration factor here was 8. In comparison to standard cascaded U-net, the deformable convolution led to better results of deblurring especially on the ventricular wall where motion is highly significant.

Conclusion

We proposed to leverage the motion information in dynamic MRI by incorporating deformable convolution into cascaded U-net. The deformable convolution layers learn the offset field for the sampling of convolution kernels. The deformable convolution helps improve the reconstruction performance by alleviating over smoothing artifacts. As more U-net block is added to the network, the proposed method obtains better results with increasing exploitation of the offset field. The next step is to extend the deformable convolution into 3D (2D+t) offset field with varying numbers of sampling points in each frame, therefore more information can be sampled in reliable frames.

Acknowledgements

No acknowledgement found.

References

1. Souza, Roberto, et al. “Dual-Domain Cascade of U-Nets for Multi-Channel Magnetic Resonance Image Reconstruction.” Magnetic Resonance Imaging, vol. 71, 2020, pp. 140–153.

2. Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. "Deformable Convolutional Networks." 2017 IEEE International Conference on Computer Vision (ICCV), 2017, 764-73.

3. Xu, Xiangyu, Li, Muchen, and Sun, Wenxiu. "Learning Deformable Kernels for Image and Video Denoising." 2019.

4. Chan, Kelvin C. K, Wang, Xintao, Yu, Ke, Dong, Chao, and Loy, Chen Change. "Understanding Deformable Alignment in Video Super-Resolution." 2020.

Figures

Figure 1: The illustration of the generation offset fields and the performance of deformed sampling. The offsets are obtained by applying a convolution layer over the same input feature map. The channel dimension of the offset fields is twice as much as that of the input feature maps, given the 2D offsets.

Figure 2: An example of the concatenation of U-net blocks which operate in either image domain or k-space domain with data consistency layer in between. The deformable convolution layer, which extracts the motion feature and generates offsets from a series of dynamic MRI, is at the beginning of the U-net block on the image domain.

Table 1: Peak signal-to-noise ratio (PSNR) of different methods with various acceleration factors. #1-5 are the five testing subsets. The proposed method is IKIKIK WWW-net which comprises 6 U-net blocks. Deformable convolution is able to improve the performance of reconstruction with different acceleration factors.

Table 2: The comparison of PSNR with different depths of cascaded U-net tested on five subsets. The proposed method achieves better results with increasing depth of the network. Multiple exploitations of offset fields improve image quality.

Figure 3: The reconstruction of one frame in a typical subset. The acceleration factor R = 8. In comparison to standard cascaded U-net, compressed sensing, and the zero filling method, the deformable convolution is able to alleviate the over smoothing artifact as is shown in the ventricular wall.

Proc. Intl. Soc. Mag. Reson. Med. 29 (2021)

1947