0654

Imaging transformer for MRI denoising with SNR unit training: enabling generalization across field-strengths, imaging contrasts, and anatomy
Hui Xue1, Sarah Hooper1, Azaan Rehman1, Iain Pierce2, Thomas Treibel2, Rhodri Davies2, W Patricia Bandettini1, Rajiv Ramasawmy1, Ahsan Javed1, Yang Yang3, James Moon2, Adrienne Campbell-Washburn1, and Peter Kellman1
1National Heart, Lung, and Blood Institute, Bethesda, MD, United States, 2Barts Heart Centre at St. Bartholomew's Hospital, London, United Kingdom, 3University of California, San Francisco, San Francisco, CA, United States

Synopsis

Keywords: Other AI/ML, Machine Learning/Artificial Intelligence, imaging transformer, generalization

Motivation: MR denoising using the CNN models often requires abundant high quality data for training. In many applications, such as higher acceleration and low field, high quality data is not available. This study overcome this limitation by developing a SNR unit based training scheme and a novel imaging transformer (imformer) architecture.

Goal(s): To develop and validate a novel imformer model for MR denoising, enabling generalization across field-strengths, imaging contrasts, and anatomy.

Approach: SNR unit training scheme and imaging transformer architecture

Results: Imformer models outperformed CNNs and conventional transformer. The SNR training enables storng generalization.

Impact: Recovery high-fidelity MR signal from very low SNR inputs; Enable 0.55T MRI model training.

Introduction

Ability to recover signal from noise is key to achieve fast acquisition, accurate quantification, and good quality. Past work has shown CNNs1 can be trained with large datasets of high-SNR images. For applications where high-quality data is difficult to produce at scale (e.g. aggressive acceleration, high resolution, or low field), training can be infeasible. We overcome this limitation by a) develop a training scheme that uses complex MRIs in the SNR units (i.e., the images have a fixed noise level) and augmentation with realistic noise based on coil g-factor to allow strong generalization and b) develop a novel imaging transformer (imformer) to outperform CNNs.

Methods

The proposed training method is shown in Fig. 1. High SNR complex images (retro-gated cine acquired solely at 3T) were degraded with realistic MR noise augmented with g-factor maps (R=2 to 6), partial Fourier and kspace filters. Unlike past approaches, which typically normalize the signal levels (e.g., to range [0, 1]), we focus on normalizing the noise level. The SNR unit 2 reconstruction was applied throughout training to keep the noise standard deviation at 1.0. The corrupted images and g-factor maps were input into the model. The model was optimized to recover high-quality images from degraded inputs.

We propose a novel imaging transformer (Fig. 2), to enable the model to process 2D, 2D+T, and 3D images. The key innovation are three configurable imaging attention modules: spatial local (L), spatial global (G) and temporal (T) attention. A block allows any combination of attention modules. By stacking many blocks together, imaging transformer models are built, enabling every output pixel to be computed by nonlinearly combining all input pixels with data-specific attention coefficients. Two imaging transformer were implemented, inspired by two popular CNNs: Unet 3 and high-res net (“HRnet”) 4 (Fig. 2b and c). respectively.

Model comparison. Five models were compared: three “standard” models – Unet, HRnet, swin3D transformer 5; two imformers – Unet-imformer and HRnet-imformer. All were trained on 309,238 3T retro-cine complex 2D+T series (MAGNETOM Prisma, Siemens Healthcare) with SNR unit and noise augmentation. For ablation, four models were additionally trained without g-factor or MR noise. Training was for 50 epochs with the Sophia optimizer 6 and tested on a held-out 560 samples containing 2D, 2D+T, and 3D images as well as different SNR levels. MSE, L1, PSNR (max intensity 2048.0) and SSIM were computed.

Cross field strength generalization, 0.55T MRI. 8 healthy volunteers were scanned at a 0.55T MRI (MAGNETOM FreeMax, Siemens Healthcare) for R=2 and 3 retro-gated cine, CH4, CH2, CH3 and SAX stacks. HRNet-imformer was applied to R=3 cine and ROIs were drawn in the LV and myocardium. The local point spread function 7 was measured on 452 points on the endo- and epi-cardiac boundaries to quantify resolution loss after the model. EF, ESV, EDV, MASS were measured on model outputs and compared to R=2 without model, using a validated cine AI model 8.

Cross imaging contrast. While training only used retro-gated cine 3T data, the model was applied to 1.5T CMR perfusion and LGE. Cross anatomy. Model was applied to three datasets (neuro 3T, spine and knee 0.55T 9) to demonstrate cross anatomy performance.

Results

Table 1 lists model comparison results. HRnet-imformer gave the best performance, but quite comparable to the Unet-imformer. Adding spatial attention and noise augmentation improved performance over using only temporal attention or not using noise augmentation. Imaging transformers outperformed CNNs. Among the convolutional models, the 3D models performed better than the 2D models.

Figure 3 gives examples of 0.55T cines. Mean SNR gain for R=3 was 119-224% (90 percentiles) in the blood pool and 142-277% in the myocardium. The model was found to be locally linear (local linearity ratio7 is 0.99±0.09) and LPSF was 1.04±0.11 for readout and phase, and 1.28±0.14 for temporal direction. Due to inferior quality, the cine AI model failed at 4 subjects for the raw R=3 images but was successful after denoising using all imformer models. No significant differences were found between R=3 model outputs and R=2 images (paired t-test, P>0.15).

Figure 4 shows model generalization results on different imaging sequences (1.5T perfusion and LGE in 4a) and other anatomies (4b). While all training was performed on heart data from 3T scanners, the model generalized well.

Conclusion

We propose a denoising training scheme with SNR unit reconstruction and realistic noise augmentation, and propose novel imaging attention modules and imformers. We show their superior performance over CNNs and conventional transformers for MRI denoising tasks. Together, these contributions result in strong generalization across field-strengths, imaging sequences, contrasts, and anatomies.

Acknowledgements

No acknowledgement found.

References

1. Desai AD, Ozturkler BM, Sandino CM, et al. Noise2Recon: Enabling SNR-robust MRI reconstruction with semi-supervised and self-supervised learning. Magn Reson Med. 2023;90(5):2052-2070. doi:10.1002/mrm.29759

2. Kellman P, McVeigh ER. Image reconstruction in SNR units: A general method for SNR measurement. Magn Reson Med. 2005;54(6):1439-1447. doi:10.1002/mrm.20713

3. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. Lect Notes Comput Sci. 2015;9351:234-241. doi:10.1007/978-3-319-24574-4_28

4. Wang J, Sun K, Cheng T, et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans Pattern Anal Mach Intell. 2021;43(10):3349-3364. doi:10.1109/TPAMI.2020.2983686

5. Tumors B, Hatamizadeh A, Nath V, Tang Y, Yang D. Swin UNETR : Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. :1-13.

6. Liu H, Li Z, Hall D, Liang P, Ma T. Sophia : A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. 2023:1-32.

7. Wech T, Stäb D, Budich JC, Köstler H. Resolution evaluation of MR images reconstructed by iterative thresholding algorithms for compressed sensing. 2012;39(7):4328-4338.

8. Davies RH, Augusto JB, Bhuva A, et al. Precision measurement of cardiac structure and function in cardiovascular magnetic resonance using machine learning. J Cardiovasc Magn Reson. 2022;24(1):1-11. doi:10.1186/s12968-022-00846-4

9. Campbell-washburn AE, Ramasawmy R, Restivo MC. Opportunities in Interventional and Diagnostic Imaging by Using High-Performance Low-Field-Strength MRI. Radiology. 2019;293(3):384-393.

10. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. http://arxiv.org/abs/2010.11929.

Figures

Figure 1. SNR unit training scheme. Training starts by computing the g-factor maps at different acceleration (g-factor map augmentation). The white, complex noise was transferred with SNR unit maintained. Different models can be plugged into this training scheme. Long-term skip connection was added to facilitate residual learning. The training was performed with two patch sizes (32x32 and 64x64, interleaved between steps). The loss was the sum of MSE, L1, perpendicular loss and PSNR loss. 90% of the ~310K training samples were used for training and 10% were used for validation.

Figure 2. Imaging transformer model architecture. (a) Three imaging attentions are key of this design. Every module can process 2D, 2D+T and 3D data. These three modules can be concatenated as a block, e.g. TLG means the concatenation of Temporal, Local and Global attention. (b) Unet-imformer and (c) HRnet-imformers are instantiation of imaging transformers. To compare with imformer models, corresponding CNN networks were implemented by replacing every imaging attention with a convolution layer, together with nonlinear activation and normalization.

Figure 3. 0.55T retro-gated cine imaging. Imaging parameters: FOV 360x270mm2, FA 78o, BW 528 Hz, SSFP, acquired temporal resolution 33ms, breath-hold 12 and 9 HBs for R=2 and 3. (a) R=3 before and after model. (b) Bland-Altman plots for EF, ESV, EDV, MASS on R=2 and R=3 model outputs, showing model did not bias cardiac function measures. CV is coefficient of variation. RPC is reproducibility coefficient. Cine AI reported CV for EF and MASS as 4.2% and 3.6% 10 in a scan-rescan test. The imformer induced measurement variation (2.6% and 1.3%) is less than the reported variation of cine AI.

Figure 4. With all training data acquired at 3T for heart, the SNR unit training scheme enable strong generalization over anatomies, contrast and sequences. (a) Cardiac perfusion, acquired matrix 256x144, SSPF readout, TR 70ms, R=4, 1.5T Siemens Area. (b) Free-breathing LGE, slice thickness 8mm, acquired matrix 256x256, R=2, 16 averages with 50% kept after MOCO. (c) T1 MPRAGE 3D neuro scan, matrix size 256x282x224, R=4, 3T scanner; (d) TSE sagittal knee scan, acquired matrix 384x384, R=1, 0.55T scanner; (e) T2 TSE sagittal spine scan, acquired matrix 384x384, R=2, 0.55T scanner.

Table 1. Model comparison results. Here Block-0 and Block-1 are the block structures for two resolution levels in HRnet and Unet setups. HRnet-imformer gave slightly better results than the Unet-imformer. Imaging transformers outperform the CNNs, with higher PSNR and SSIM, and lower MSE and L1. The ablation tests showed benefits from gmap and noise augmentation, for both CNN and imformers. HRnet-imformer outperformed the Swin 3D model which is a conventional linear attention transformer with more parameters. The 3D convolution models performed much better than 2D conv models.

Proc. Intl. Soc. Mag. Reson. Med. 32 (2024)
0654
DOI: https://doi.org/10.58530/2024/0654