0077

A Novel Cross-Subject Transformer Denoising Method
Shoujin Huang1, Sixing Liu1, Lifeng Mei1, Chenhui Tang1, Ed X Wu2,3, and Mengye Lyu1
1College of Health Science and Environmental Engineering, ShenZhen Technology University, Shenzhen, China, 2Laboratory of Biomedical Imaging and Signal Processing, The University of Hong Kong, Hong Kong, China, 3Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, China

Synopsis

Keywords: Data Processing, Modelling, Deep learning, Denoise

In this work, we propose a new denoising method named Cross-Subject Transformer Denoising (CSTD), which transfers the texture of a reference image retrieved from a large database to the noisy image with soft attention mechanisms. The experiments on the fastMRI dataset with various noise levels show that our method is likely superior to many competing denoising algorithms including current the state-of-the-art NAFNet. Moreover, our method exhibits excellent generalizability when directly applied to in-vivo low-field data without retraining. Due to the flexibility, the method is expected to have a wide range of applications.

Introduction

Denoising algorithms can be divided into two categories: internal denoising and external denoising1. The latter utilizes external clean images as auxiliary information and is also known as reference-based denoising.

While internal denoising presents satisfactory solutions to many applications, external denoising can be advantageous when noise is high and clean references are at hand. In fact, external denoising is particularly suitable for MRI. Clean reference images can be collected from the same subjects2. Alternatively, we note that it is possible to utilize existing MRI databases since anatomical structures are similar across subjects to certain degrees.

However, most existing external denoising methods such as TID3 suffer from the following limitations. (1) Patches are matched in image space, being not as efficient and robust as in high-level feature space. (2) Domain knowledge cannot be well learned in advance, benefiting little from big data.

Here, we propose a new attention-based external denoising method named Cross-Subject Transformer Denoising (CSTD). For a given noisy image, this method first retrieves clean reference images from a large database, then extracts, correlates, and fuses their deep texture features with soft attention, and lastly decodes the fused features to a denoised image. This method effectively utilizes big data, leading to excellent performance in MRI denoising tasks.

Methods

The proposed Cross-Subject Transformer Denoising (CSTD) method is schematically illustrated in Figure 1. It mainly includes a reference image retrieval module and a denoising network module.

1. Reference image retrieval module
To accelerate the search, every reference image is vectorized to the length of 2048 using ResNet50, resulting in a prebuilt reference database. For an input noisy image $$$X$$$, we also vectorize it and search the database for the best reference image in terms of least L2 distance using FAISS4 library. This step is highly efficient due to vectorization, costing less than a second even if the number of images is more than a million.

2. Denoising network module
The proposed CSTD network has three key parts: texture extractor, texture transformer, and encoder-decoder. The former two were inspired by TTSR5,6 which was designed for super-resolution tasks.

The texture extractor converts raw images to deep texture features in three scales (1, 1/2 and 1/4). Currently we use VGG19. To facilitate feature correlation later, Gaussian noise Δ was added to $$$Ref$$$, such that the resulting $$$Ref^Δ$$$ has comparable noise level to the input noisy image $$$X$$$.
$$Ref^Δ=Ref+Δ$$
Then $$$X$$$, $$$Ref$$$ and $$$Ref^Δ$$$ are fed to the texture extractor to obtain multiscale query (Q), key (K) and value (V) features, respectively.
$$Q=TE(X), K=TE(Ref^Δ), V=TE(Ref)$$

The texture transformer estimates the feature correlation between Q and K by inner product and locates the most correlated features in V. Note that for each location in Q the correlation is estimated globally on all locations in K. Thus, the transferred multiscale texture features $$$T^n$$$(n for three scales of 1, 1/2 and 1/4) are formed by the most correlated features in V, and the soft attention map S is formed by the highest correlation value in K.

The encoder-decoder is based on a UNet7 with 3 down/upsampling stages. We fuse the multiscale transferred features $$$T^n$$$ with the UNet decoder in a novel way: at each upsampling stage, the UNet derived features $$$F^n$$$ are concatenated with $$$T^n$$$, convoluted once, and multiplied by the soft attention map S. These fused features finally go through a squeeze-and-excite (SE)8 attention layer before next upsampling stage. In summary, the fusion operation on each stage is
$$F^{n}_{fused} = SE(Conv(Concat(F^n,T^n))*S)+F^n$$
The final denoised image is obtained through a convolution layer after the last upsampling stage.

3. Network training and evaluation
The proposed CSTD network was trained and evaluated on fastMRI9 brain dataset (see Figure 2 caption for details). For comparison, DNCNN10 and NAFNet11 were trained similarly without reference images. Traditional denoising methods BM3D12 and TID were also evaluated.

4. Robustness test
(1) We investigated the robustness of CSTD by using less similar reference images.
(2) To further demonstrate generalization ability, we applied the CSTD model to denoise real 0.3T brain data without retraining (See Figure 5 caption for details).

Results

Our method achieved the highest PSNR and SSIM values among all evaluated methods (Figure 2). These numbers agree with the visual inspection of image quality (Figure 3). Our method can be influenced if suboptimal references are used, yet even in extreme cases, it performed reasonably well (Figure 4). In particular, it performed robustly on the real low-field data, and the NEX=1 images after CSTD denoising were visually comparable or better than NEX=6 images (Figure 5).

Discussion and Conclusion

Preliminary results indicate our method is very robust, even in comparison to the current state-of-the-art NAFNet. Note that we used a simple implementation of the encoder-decoder and transformer, leaving large room to borrow more advanced structures such as those in NAFNet and Restormer13. The results may also improve if multiple reference images are used. This method is widely applicable. It does not rely on additional acquisition from the same subject, nor require image co-registration. Moreover, in the era of big data, it could continuously benefit from the growth of MRI database.

Acknowledgements

This study is supported in part by Natural Science Foundation of Top Talent of Shenzhen Technology University (Grants No. 20200208 to Lyu, Mengye) and the National Natural Science Foundation of China (Grant No. 62101348 to Lyu, Mengye)

References

[1]. Buades, A., Coll, B. & Morel, J.-M. A review of image denoising algorithms, with a new one. Multiscale modeling & simulation 4, 490-530 (2005).

[2]. Hu, J., Liu, Y., Yi Z., Y Zhao, Y., Chen, F., & Wu, E. X. (2021). Adaptive Multi-contrast MR Image Denoising based on a Residual U-Net using Noise Level Map. In ISMRM (International Society of Magnetic Resonance Imaging) Virtual Conference & Exhibition, 2021. International Society of Magnetic Resonance Imaging (ISMRM)..

[3]. Luo, E., Chan, S. H., & Nguyen, T. Q. (2014, May). Image denoising by targeted external databases. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2450-2454). IEEE.

[4]. Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3), 535-547.

[5]. Lyu, M., Deng, G., Zheng, Y., Liu, Y., & Wu, E. X. (2021). MR image super-resolution using attention mechanism: transfer textures from external database. In ISMRM (International Society of Magnetic Resonance Imaging) Virtual Conference & Exhibition, 2021. Internationala Society of Magnetic Resonance Imaging (ISMRM)..

[6]. Yang, F., Yang, H., Fu, J., Lu, H., & Guo, B. (2020). Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5791-5800).

[7]. Zhang, K., Li, Y., Zuo, W., Zhang, L., Van Gool, L., & Timofte, R. (2021). Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]. Vosco, N., Shenkler, A., & Grobman, M. (2021). Tiled Squeeze-and-excite: Channel attention with local spatial context. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 345-353).

[9]. Zbontar, J., Knoll, F., Sriram, A., Murrell, T., Huang, Z., Muckley, M. J., ... & Lui, Y. W. (2018). fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839.

[10]. Zhang, K., Zuo, W., Chen, Y., Meng, D., & Zhang, L. (2017). Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7), 3142-3155.

[11]. Chen, L., Chu, X., Zhang, X., & Sun, J. (2022). Simple baselines for image restoration. arXiv preprint arXiv:2204.04676.

[12]. Danielyan, A., Katkovnik, V., & Egiazarian, K. (2011). BM3D frames and variational image deblurring. IEEE Transactions on image processing, 21(4), 1715-1728.

[13]. Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., & Yang, M. H. (2022). Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5728-5739).

Figures

Figure 1: First, CSTD searches a huge clean image database for a similar image as a reference. Then, noisy image and reference are input to the texture extractor and transformer for transferred texture feature T and soft attention S. At upsampling stage of UNet, the texture features will be fused with soft attention to generate the denoised image.

Figure 2: Statistical analysis of quantitative metrics on fastMRI T2 dataset at different noise levels. The small variance of our method also indicates its good stability. Training details: From the 3493 T2w volumes in the fastMRI brain dataset, we use 800 volumes for training, 200 for test, 2493 for the reference database. For each volume, we use 12 slices. In total, 9600 images were used for training, 2400 for test, and 29916 for building the clean reference database. The model was trained for 100 epochs with L1 loss, Adam optimizer and learning rate of 0.001.

Figure 3: Representative slices denoised by different methods under gaussian noise. The PSNR/SSIM values are labeled on top. Our proposed CSTD method resulted in substantial noise reduction and well-preserved the fine details which highly agree with the ground truth.

Figure 4: The results of the proposed CSTD method with suboptimal references under gaussian noise. The PSNR/SSIM values are labeled on top. The denoising quality was influenced if suboptimal references were used, yet even in extreme cases, it still performed reasonably well

Figure 5: Results on the real MRI data acquired on a 0.3T scanner (matrix size=256x195, TR/TE=5500/128, echo training length=13). CSTD* means the weighted average of the original noisy image and denoised image to balance between SNR and spatial resolution. Our method substantially reduced the noise on the NEX=1 images without obvious smoothing, leading to similar, if not better, image quality to the NEX=6 averaged images. It is worth mentioning that the model used here was trained on fastMRI without any further fine-tuning and the references were also from fastMRI.

Proc. Intl. Soc. Mag. Reson. Med. 31 (2023)
0077
DOI: https://doi.org/10.58530/2023/0077