1818

Self-supervised contrastive learning for motion artifact detection in whole-body MRI: Quality assessment across multiple cohorts
Thomas Küstner1, Jan Borst1, Dominik Nickel2, Fabian Bamberg3, Marcel Früh1, and Sergios Gatidis1,4
1Medical Image and Data Analysis (MIDAS.lab), Department of Diagnostic and Interventional Radiology, University Hospital of Tuebingen, Tuebingen, Germany, 2Siemens Healthineers, Erlangen, Germany, 3Department of Diagnostic and Interventional Radiology, University of Freiburg, Freiburg, Germany, 4Max Planck Institute for Intelligent Systems, Tuebingen, Germany

Synopsis

Keywords: Machine Learning/Artificial Intelligence, Motion Correction, Motion Detection, Self-Supervised Learning

Motion is still one of the major extrinsic sources for imaging artifacts in MRI that can strongly deteriorate image quality. Any impairment by motion artifacts can reduce the reliability and precision of the diagnosis and a motion‐free reacquisition can become time‐ and cost‐intensive. Furthermore, in large-scale epidemiological cohorts, manual quality screening becomes impracticable. An automated quality assessment is thus of interest. Reliable motion estimation in varying domains (imaging sequences, multiple scanners, sites) is however challenging. In this work, we propose an attention-based transformer that can detect motion in various MR imaging scenarios.

Introduction

MRI offers a broad variety of imaging applications with targeted and flexible sequence and reconstruction parametrizations. Varying acquisition conditions and long examination times make MRI susceptible to imaging artifacts, amongst which motion is one of the major extrinsic factors deteriorating image quality.

Images recorded for clinical diagnosis are often inspected by a human specialist to determine the achieved image quality which can be a time-demanding and cost-intensive process. If insufficient quality is determined too late, an additional examination may be even required decreasing patient comfort and throughput. Moreover, in the context of large epidemiological cohort studies such as UK Biobank (UKB)1 or German National Cohort (NAKO)2 the amount and complexity exceed practicability for a manual image quality analysis. Thus, in order to guarantee high data quality, arising artifacts need to be detected as early as possible to seize appropriate countermeasures, as e.g. prospective3,4 or retrospective correction techniques5,6. Retrospective deep-learning correction methods7-14 have also been proposed for motion-affected images. However, these methods are not always available and/or applicable. Instead of a manual screening by experts or a retrospective correction, an automatic processing for an early prospective quality assurance can be a better suited alternative.

Ideally, a database of multiple pairwise motion-free and motion-affected images, or large cohorts with manually annotated datasets for motion artifact presence, would enable the use of supervised training procedures. Creation of these cohorts is very demanding with limited number of labeled samples. Most previous works worked within this context15-20 or simulated motion artifacts21-24.

We propose a self-supervised model being capable to detect MR motion artifacts in various imaging domains. The proposed architecture combines a Region Vision Transformer (ViT) and Swin UNet which was trained on unlabeled data in a self-supervised contrastive fashion. The learned feature embedding is then fine-tuned on few labeled cases. The proposed framework is tested on abdominal data from UKB, NAKO, and two whole-body in-house acquired databases.

Methods

The proposed framework is depicted in Fig.1 and detects motion artifacts on a pixel level in various imaging domains (scanners, imaging sites, imaging sequences, contrasts).

Data: Data was considered from four databases:

NAKO/UKB: Axial abdominal data was acquired with 3D Dixon in NAKO (3T Siemens Skyra,resolution=1.4x1.4x3mm3, TE/TR=1.23,2.46/4.36ms, α=9°) and UKB (1.5T Siemens Aera,resolution=2.2x2.2x4mm3, TE/TR=2.39,4.77/6.69ms, α=10°). 5000 randomly selected subjects were taken respectively.

MRMotion: Axial T1w and T2w fast-spin-echo data was recorded in head, abdomen and pelvis on 3T PET/MR (Siemens Biograph mMR) in 20 healthy subjects20. Each subject was scanned twice in a motion-free and motion-affected setting (head movement and free-breathing).

BLADE: Axial abdominal and pelvic data was acquired with T2w BLADE (resolution=0.9x0.9x4mm3, TE/TR=104/7560ms, α=130°) on a 1.5T MRI (Siemens Aera) in 420 patients.

Network: The ViT-UNet architecture consists of a Region-ViT25 and Swin UNet26. It takes as input a 2D image slice and encodes the local (patch) and regional (4x4 neighbourhood) tokens via convolution operations. These embeddings are passed through an R2L-attention block consisting of local and regional multi-head self-attention modules followed by a feed-forward network. In the encoder, downsampling is realized by 3x3 convolution with stride 2x2, layer normalization and Gaussian Error Linear Unit (GELU). In the decoder, upsampling consists of 2x2 transposed convolutions (stride 2x2), layer normalization and GELU.

Training: 2D slices were patched to a common size of 224x224 by sliding window (90% overlap) and normalized to unit variance. Random cropping, flipping and affine motion simulation (translation, rotation, shearing) provides the positive and negative pairs for the contrastive learning. The network is trained in a self-supervised manner for 60 epochs with AdamW optimizer (weight decay 1E-4), one-cycle learning rate scheduler (1E-6 to 1E-4), batch size 64, and an L1-similarity loss between input and reconstructed output, and a contrastive loss on the bottleneck features. For SSL training, 4980 subjects (NAKO, UKB respectively), and 415 (BLADE) were taken. Afterwards, the encoder was frozen and the feature embeddings were used for supervised fine-tuning on 15 (NAKO, UKB) and 15 (MRMotion) with manually annotated motion masks and binary cross-entropy loss.

Experiments: Performance is compared to vanilla UNet and Swin UNet trained fully-supervised and self-supervised Swin UNet on 200 annotated cases and tested for 20 subjects (uniform drawn over databases) with manual annotations in terms of accuracy, sensitivity, specificity, precision, negative-predictive-value and Dice.

Results and Discussion

Fig.2 and 3 depict exemplary subject slices from the NAKO and UKB for the different architectures. Fig.4 shows the performance for the proposed architecture over the different databases. The metrics are illustrated in Fig.5. Overall, the proposed approach can reliably detect motion. Self-supervised contrastive learning outperforms supervised training. While, some pixel-wise discrepancies exist between network and manual annotation, they could also be attributed to ambiguities in labelling.
We acknowledge several limitations. Databases need to be harmonized with respect to sample size and composition. Further investigations are needed to verify the reliability of the proposed approach in more diverse imaging scenarios. Other means of assessing the model accuracy and measuring the existence of motion are needed.

Conclusion

The proposed framework was self-supervised trained on unlabeled data and fine-tuned on a few labeled cases. It can reliably detect motion artifacts in multiple cohorts without the need for retraining, making it suitable to operate as an automatic quality control.

Acknowledgements

This project was conducted with data from the German National Cohort (GNC) (www.nako.de). The GNC is funded by the Federal Ministry of Education and Research (BMBF) (project funding reference no. 01ER1301A/ B/C and 01ER1511D), federal states, and the Helmholtz Association, with additional financial support from the participating universities and institutes of the Leibniz Association. We thank all participants who took part in the GNC study and the staff in this research program.

This work was carried out under UK Biobank Application 40040. We thank all participants who took part in the UKBB study and the staff in this research program.

The work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC 2064/1 – Project number 390727645.

References

1. Ollier et al. UK Biobank: from concept to reality. Pharmacogenomics 2005;6(6):639-646.
2. Bamberg et al. Whole-Body MR Imaging in the German National Cohort: Rationale, Design, and Technical Background. Radiology 2015;277(1):206-220.
3. Maclaren et al. Prospective motion correction in brain imaging: a review. Magn Reson Med 2013;69(3):621-636.
4. Zaitsev et al. Motion artifacts in MRI: A complex problem with many partial solutions. J Magn Reson Imaging 2015;42(4):887-901.
5. Atkinson et al. An autofocus algorithm for the automatic correction of motion artifacts in MR images. Information Processing in Medical Imaging; 1997; Berlin, Heidelberg. Springer Berlin Heidelberg. p 341-354. (Information Processing in Medical Imaging).
6. Odille et al. Generalized reconstruction by inversion of coupled systems (GRICS) applied to free-breathing MRI. Magn Reson Med 2008;60(1):146-157.
7. Haskell et al. Network Accelerated Motion Estimation and Reduction (NAMER): Convolutional neural network guided retrospective motion correction using a separable motion model. Magnetic Resonance in Medicine 2019:mrm.27771-mrm.27771.
8. Johnson et al. Conditional generative adversarial network for 3D rigid‐body motion correction in MRI. Magnetic Resonance in Medicine 2019:mrm.27772-mrm.27772.
9. Küstner et al. Retrospective correction of motion-affected MR images using deep learning frameworks. Magnetic Resonance in Medicine 2019;82(4):1527-1540.
10. Lee et al. MC2-Net: motion correction network for multi-contrast brain MRI. Magnetic Resonance in Medicine 2021;86(2):1077-1092.
11. Levac et al. Accelerated Motion Correction for MRI using Score-Based Generative Models. arXiv 2022(2211.00199).
12. Sommer et al. Correction of Motion Artifacts Using a Multiscale Fully Convolutional Neural Network. AJNR American journal of neuroradiology 2020;41(3):416-423.
13. Usman et al. Motion Corrected Multishot MRI Reconstruction Using Generative Networks with Sensitivity Encoding. 2019.
14. Tamada et al. Motion Artifact Reduction Using a Convolutional Neural Network for Dynamic Contrast Enhanced MR Imaging of the Liver. Magn Reson Med Sci 2020;19(1):64-76.
15. Fantini et al. Automatic detection of motion artifacts on MRI using Deep CNN. 2018. p 1-4.
16. Lorch et al. Automated Detection of Motion Artefacts in MR Imaging Using Decision Forests. Journal of Medical Engineering 2017;2017:1-9.
17. Romero Castro et al. Detection of MRI artifacts produced by intrinsic heart motion using a saliency model. In: Brieva J, García JD, Lepore N, Romero E, editors; 2017 2017/11//. SPIE. p 51-51.
18. Meding et al. Automatic detection of motion artifacts in MR images using CNNS. 2017 5-9 March 2017. p 811-815.
19. Iglesias et al. Retrospective Head Motion Estimation in Structural Brain MRI with 3D CNNs. Springer, Cham; 2017. p 314-322.
20. Küstner et al. Automated reference-free detection of motion artifacts in magnetic resonance images. Magnetic Resonance Materials in Physics, Biology and Medicine 2018;31(2):243-256.
21. Oksuz et al. Automatic CNN-based detection of cardiac MR motion artefacts using k-space data augmentation and curriculum learning. Medical Image Analysis 2019;55:136-147.
22. Jangâ et al. Automated MRI k-space Motion Artifact Detection in Segmented Multi-Slice Sequences. 2022. p 5015.
23. Haskell et al. TArgeted Motion Estimation and Reduction (TAMER): Data Consistency Based Motion Mitigation for MRI Using a Reduced Model Joint Optimization. IEEE Trans Med Imaging 2018;37(5):1253-1265.
24. Oh et al. Unpaired MR Motion Artifact Deep Learning Using Outlier-Rejecting Bootstrap Aggregation. IEEE Trans Med Imaging 2021;40(11):3125-3139.
25. Chen et al. Regionvit: Regional-to-local attention for vision transformers. arXiv preprint arXiv:210602689 2021.
26. Cao et al. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:210505537 2021.

Figures

Fig. 1: Proposed self-supervised motion artifact detection network: Combination of Region-ViT and Swin UNet. A contrastive self-supervised learning is performed on positive (affine motion, cropping, flipping) and negative pairs using contrastive and L1-similarity loss. The learned feature embeddings of the bottleneck are subsequently fine-tuned in a supervised manner (frozen encoder) on a few manually labeled cases using binary cross-entropy loss.

Fig. 2: Three exemplary subjects of the German National Cohort (NAKO) are depicted which exhibit respiratory motion (top row), peristaltic motion (middle row) and no motion (bottom row) artifacts. Experienced readers manually labeled the ground-truth motion masks. Motion artifact probabilities are overlaid on the image. The proposed self-supervised ViT-UNet shows a more consistent motion estimation than the purely supervised trained models.

Fig. 3: Three exemplary subjects of the UK Biobank (UKB) are depicted which exhibit respiratory motion (top and middle row), and no motion (bottom row) artifacts. Experienced readers manually labeled the ground-truth motion masks. Motion artifact probabilities are overlaid on the image. The proposed self-supervised ViT-UNet shows a more consistent motion estimation than the purely supervised trained models.

Fig. 4: Motion artifact estimations of the proposed contrastive self-supervised ViT-UNet in various imaging cohorts: T2w fast-spin-echo imaging in the abdominal and pelvic region (MRMotion), T1w fast-spin-echo imaging in the brain (MRMotion), and T2w BLADE imaging in the abdomen (BLADE). The motion-affected image and the motion artifact predictions are shown, respectively for two representative cases of each cohort.

Fig. 5: Quantitative analysis in terms of accuracy, sensitivity, specificity, precision, negative predictive value (NPV) and Dice for the proposed self-supervised ViT-UNet, the supervised UNet and the supervised Swin UNet in 20 manually annotated cases uniformly drawn from NAKO, UK Biobank, MRMotion and BLADE. The proposed self-supervised ViT-UNet showed statistically significant (*: p < 0.0001) better performance.

Proc. Intl. Soc. Mag. Reson. Med. 31 (2023)
1818
DOI: https://doi.org/10.58530/2023/1818