1669

Improved cardiac cine MRI using a deep learning-based ESPIRiT reconstruction with video SWIN transformers

Terrence Jao¹, Christopher Sandino², and Shreyas Vasanawala¹
¹Radiology, Stanford University, Stanford, CA, United States, ²Electrical Engineering, Stanford University, Stanford, CA, United States

Synopsis

Keywords: Myocardium, Cardiovascular, SWIN, ESPIRiT, Resnet, CINE

Motivation: Cardiac CINE MRI is used clinically for characterizing heart morphology and function, but requires multiple breath-holds to minimize motion artifacts from respiration.

Goal(s): Deep learning based ESPIRiT (DL-ESPIRiT) is able to reconstruct dynamic MRI data with high reconstruction accuracy. However, the method still has difficulty resolving fine anatomic structures. We aim to improve reconstruction accuracy using newer transformer architecture.

Approach: We replace the ResNet deep learning backbone by a modified Swin Image Restoration network (SwinIR) with video Swin transformers, called DL-ESPIRiT VSwinIR

Results: DL-ESPIRiT VSwinIR has substantially improved reconstruction accuracy and can accelerate acquisitions by up to 20x.

Impact: Fast CINE acquisitions using our DL-ESPIRiT VSwinIR may enable multiple CINE slices to be acquired in a shortened breath-hold, which will allow heart morphology and function assessment in patients with difficulty breath-holding or following instructions.

Introduction

Cardiac CINE MRI is used clinically for characterizing heart morphology and function, but requires multiple breath-holds to minimize motion artifacts from respiration. Accelerated acquisitions can reduce the number and duration of required breath-holds at the cost of aliasing and other reconstruction artifacts. Deep learning based ESPIRiT (DL-ESPIRiT) enables reconstruction of dynamic MRI data with up to 12x undersampling with higher reconstruction accuracy than modern compressed sensing algorithms.¹ However, DL-ESPIRiT still has difficulty resolving fast, fine anatomic structures such as the papillary muscles. We previously added a self-attention squeeze-excitation block into the DL-ESPIRiT that saw modest performance gains.² Inspired by the strong performance of transformers³, in this work, we replace the ResNet backbone of DL-ESPIRiT with a modified Swin Image Restoration network⁴ (SwinIR) that incorporates video Swin transformers^5,6 to capture both spatial and temporal features, which we call DL-ESPIRiT VSwinIR.

Methods: Network Architecture

The DL-ESPIRiT VSwinIR network takes input from a zero filled reconstruction of a 2D cardiac CINE slice and ESPIRiT maps computed from time averaged k-space data and alternates between the VSwin-IR neural network and data consistency steps in an unrolled fashion (Fig.1). The VSwin-IR network is adapted from the Swin Image Restoration⁴ (SwinIR) network, which is composed of shallow feature extraction (SFE), deep feature extraction (DFE), and image reconstruction (IR) modules (Fig.2a).

The SFE module,$$$H_{SFE}$$$, is a convolution that maps an intermediate image after data consistency, $$$x^k\in\mathbb{R}^{T \times H \times W \times C_{in}}$$$ to shallow features $$$F_{SF}\in\mathbb{R}^{T \times H \times W \times C}$$$, where $$$T$$$, $$$H$$$, $$$W$$$, $$$C_{in}$$$, and $$$C$$$ are image frames, height, width, input channel number, and feature channel number.
The DFE module, $$$H_{DFE}$$$, is composed of R residual Swin transformer blocks followed by a convolution layer. Specifically, $$ F_i = H_{RSTB_i}(F_{i-1}), i=1,2,...,R $$ $$F_{DF} = H_{conv}(F_R)$$ Where $$$F_i$$$ denotes the intermediate feature after the i^th RSTB block, $$$H_{RSTB_i}$$$, and $$$H_{conv}$$$ is the convolutional layer.
The image reconstruction module, $$$H_{IR}$$$, is a convolution that maps the aggregated shallow and deep features back to the image space such that $$ x^{k+1}=H_{IR}(F_{SF}+ F_{DF})$$

The Residual Swin transformer block is composed of Q video Swin transformer layers followed by a convolution before a residual connection (Fig.2b), such that $$ F_{i,j} = H_{VSTL_{i,j}}(F_{i,j-1}), j=1,2,...,Q $$ $$F_i = H_{conv_i}(F_{i,Q})+F_{i,0} $$ Where $$$F_{i,j}$$$ denotes the intermediate feature after the j^th Swin transformer layer, $$$H_{VSTL_{i,j}}$$$, of the of the i^th RSTB block and $$$H_{conv_i}$$$ is the convolutional layer.

The video Swin transformer is composed of a multi-head self-attention (MSA) followed by a multi-layer perceptron (MLP) module (Fig2c). MSA is performed within each non-overlapping 3D windows of size $$$P \times M \times M $$$ and is calculated as $$Attention(Q,K,V) = SoftMax(QK^T/sqrt(d)+B)V$$ Where $$$Q,K,V \in\mathbb{R}^{PM^2 \times d}$$$ are the query, key, and value matrices, d is the query/key dimension, and $$$B\in\mathbb{R}^{P^2 \times M^2 \times M^2} $$$ is the relative position bias. MLP is composed of two fully connected layers with a GELU non-linearity in between. A layer-norm and residual connection is employed for both the MSA and MLP modules. To enable connections across local windows between layers, the windows are shifted by $$$(P/2, M/2, M/2)$$$ between subsequent Swin transformer layers.

Methods: Dataset and Training

Fully sampled 2D cardiac cine datasets were acquired from 22 healthy volunteers on 1.5T and 3.0T GE MRI scanners with a 32 channel cardiac coil using balanced SSFP with TR/TE/FA, slice thickness, and spatial resolution of 3.7ms/1.6ms/40-60⁰, 8mm, and 1.8-2.0$$$\times$$$1.6-1.8mm². Data were acquired in the short axis, 2-chamber, 3-chamber, and 4/5-chamber views. All data were compressed from 32 to 8 virtual coils. The 22 volunteers were divided into three cohorts for training, validation, and testing (10, 2, 10 split) corresponding with 5220, 660, and 2640 images respectively. Variable density undersampling masks were applied to simulate 10-15x acceleration across k-t space. Training was performed using a Nvidia RTX 3090 24GB video card for a total of 101 hours.

Results

Figures 3a and 3b show representative reconstructions of DL-ESPIRiT, and DL-ESPIRiT VSwinIR on retrospectively undersampled cardiac CINE with varying acceleration factors in the short axis and a 5-chamber view respectively. VSwinIR has substantially less aliasing artifacts at higher acceleration factors up to R=20 and better reconstructs fine anatomic details such as the papillary muscles. Figure 4 demonstrates DL-ESPIRiT VSwinIR has improved reconstruction compared to DL-ESPIRiT across all acceleration factors in terms of SSIM.

Conclusion

We used a video Swin transformer network with DL-ESPIRiT and significantly improved the reconstruction quality of cardiac CINE images. This network has better characterization of fine anatomic detail, less aliasing artifacts, and decreased reconstruction error, which can allow us to push the acceleration by up to 20x.

Acknowledgements

GE healthcare

References

Sandino CM, Lai P, Vasanawala SS, Cheng JY. Accelerating cardiac cine MRI using a deep learning-based ESPIRiT reconstruction. Magn Reson Med. 2021;85(1):152-167. doi:10.1002/mrm.28420
Jao, Terrence, Sandino CM, Vasanawala SS. Improving cardiac cine MRI using a deep learning-based ESPIRiT reconstruction with self attention. In: Proc. ISMRM 31st Scientific Session. Toronto, Canada; :p4974.
Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need. ArXiv170603762 Cs. August 2023. http://arxiv.org/abs/1706.03762.
Liang J, Cao J, Sun G, Zhang K, Van Gool L, Timofte R. SwinIR: Image Restoration Using Swin Transformer. ArXiv210810257 Cs Eess. August 2021. http://arxiv.org/abs/2108.10257.
Liu Z, Lin Y, Cao Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ArXiv210314030 Cs. August 2021. http://arxiv.org/abs/2103.14030.
Liu Z, Ning J, Cao Y, et al. Video Swin Transformer. ArXiv210613230 Cs. June 2021. http://arxiv.org/abs/2106.13230.

Figures

Figure 1: Diagram of the DL-ESPIRiT VSwinIR network architecture. ESPIRiT maps are computed from a fully sampled calibration region of time averaged k-space and are used to calculate an initial zero-filled reconstruction. The unrolled network applies data consistency (DC) updates followed by the Video SwinIR module for k = 5 iterations.

Figure 2: Video SwinIR architecture. A) VSwinIR is composed of 3 modules, shallow feature extraction (SFE), deep feature extraction (DFE), and image recovery (IR). DFE is composed of R=1 Residual Swin Transformer Blocks (RSTB) followed by a 3x3x3 convolution layer. B) Each RSTB block is composed of Q=6 video Swin transformer layers (VSTL) followed by a convolution layer before a residual connection with the input. C) The VSTL is composed of a shifted window multi-head self-attention and a multi-layer perceptron module with residual connections after layer normalization.

Figure 3: Cardiac CINE reconstructions with DL-ESPIRiT and DL-ESPIRiT VSwinIR with a range of acceleration factors (R) of 10, 16, 20, and 24 with corresponding ground truth in the a) short axis and b) 5 chamber view. The red arrow indicates the papillary muscles. The green arrow indicates the right ventricular trabeculations. The yellow arrow indicates aliasing artifacts.

Figure 4: Box plots of SSIM (left) and RMSE (right) of DL-ESPIRiT (blue) and DL-ESPIRiT VSwinIR (orange) reconstructions at different acceleration factors. The DL-ESPIRiT VSwinIR demonstrates improved reconstruction quality in terms of SSIM for all acceleration factors. At lower acceleration factors of 10 through 14, DL-ESPIRiT VSwinIR had slightly greater RMSE compared with DL-ESPIRiT.

Proc. Intl. Soc. Mag. Reson. Med. 32 (2024)

1669

DOI: https://doi.org/10.58530/2024/1669