1139

Neighborhood-attention models for incorporating spatial information in deep learning parameter estimation applied to IVIM

Misha Pieter Thijs Kaandorp^1,2,3, Frank Zijlstra^1,2, Davood Karimi³, Ali Gholipour³, and Peter Thomas While^1,2
¹Department of Radiology and Nuclear Medicine, St. Olav’s University Hospital, Trondheim, Norway, ²Department of Circulation and Medical Imaging, NTNU – Norwegian University of Science and Technology, Trondheim, Norway, ³Department of Radiology, Boston Children’s Hospital, Harvard Medical School, Boston, MA, United States

Synopsis

Keywords: Analysis/Processing, Signal Representations, AI, transformers, synthetic data, parameter estimation, IVIM

Motivation: Conventional model-fitting approaches neglect spatial information. Recent work showed promise in using convolutional neural networks (CNNs) trained on spatially-correlated synthetic data. However, the convergence rate remained suboptimal, and the spatial extent was limited.

Goal(s): To improve estimator performance by utilizing transformer networks and training on larger receptive-fields.

Approach: Transformers with self-attention and neighborhood-attention with increased receptive-field were trained on spatially-correlated synthetic data (IVIM), and evaluated quantitatively using novel fractal-noise maps and in-vivo scans.

Results: Transformers excelled in integrating spatial information over CNNs. The application of larger receptive-fields with neighborhood-attention effectively leveraged correlated signal information from nearby voxels, leading to improved estimator performance.

Impact: The improved parameter estimation from neighborhood-attention models trained on synthetic data brings challenging ill-posed signal analysis problems, like IVIM, closer to clinical implementation. Additionally, the novel fractal-noise maps provide spatially-correlated ground truths, permitting new approaches to quantitative medical image analysis.

Introduction

The tissue microenvironment typically exhibits local homogeneity, where properties such as diffusion and perfusion do not change randomly between adjacent voxels¹. Leveraging potential correlations between relevant signals in neighboring voxels should therefore enhance model-parameter fitting. However, estimators like least squares (LSQ) and voxelwise neural networks (NNs)^2,3 do not incorporate such spatial information⁴. Our recent work considered convolutional NNs (CNNs)⁵ with receptive-field 3x3 for incorporating spatial information by training on synthetic data encompassing all plausible combinations of direct correlations. This approach was applied to the intravoxel incoherent motion (IVIM)⁶ model for diffusion-weighted imaging (DWI), and yielded promising results with apparently reduced noise sensitivity, akin to signal averaging. However, the convergence rate was suboptimal, and the spatial extent was limited.

In this work, we address these limitations by replacing CNNs with transformer NNs (attention models)⁷. Recently, transformers have garnered substantial interest in computer vision⁸. Conventional transformers rely on self-attention⁷ (SA), yet their memory usage scales quadratically with the number of pixels, making them less feasible for computer vision tasks. Neighborhood-attention⁹ (NA) has been proposed as an alternative, which localizes SA to the nearest neighbors, effectively reducing memory consumption. Here we explore whether training with larger receptive-fields can further improve performance, by investigating SA and NA.

Methods

In this work, we considered two types of transformers. The first utilized SA (3 SA-blocks, 128 units) and the second NA (128 units). We trained networks with increasing receptive-field, where SA-X and NA-X denote a receptive-field of X×X voxels. Expanding the receptive-field beyond 7×7 caused excessive memory usage in SA-networks. Therefore, for receptive-fields of 7×7 and larger, NA was employed. Each NA-X network had a varying number of NA-blocks of kernel size 3 (e.g. NA-7 had 3 NA-blocks). We trained a CNN and voxelwise network for comparison. Both CNN and voxelwise networks were multi-layer perceptrons, where the CNN had a 2D-convolution in its first layer with kernel size 3. The network inputs were patches of DWI signals, outputs were IVIM parameters (D: diffusion, D*: pseudo-diffusion, f: perfusion fraction) and S₀.

In our first experiment, SA-3, CNN and voxel-wise networks were trained supervised on DWI signals simulated by uniformly sampling 3×3 patches of parameters: 0≤S₀≤1, 0×10^-3≤D≤3×10^-3 mm²/s, 0≤f≤50%, and 3×10^-3≤D*≤100×10^-3 mm²/s, considering 16 b values. Each patch was generated such that a random number of neighbors correlated to its center (neighbors-random, Fig. 1A), meaning that those neighbors shared identical parameters. Rician noise was added to the signals such that S₀=1 equated to SNR=200. Convergence rate and final loss were evaluated on the test set consisting of 100,000 patch-wise sets of DWI signals. As proxies for the optimal expected performance of a CNN, an additional set of CNNs was trained specifically for each number of correlated neighbors (neighbors-N, Fig. 1C).

In a second experiment, we trained transformers on larger patch sizes up to 17×17. We conducted a quantitative assessment by testing the networks on synthetic data (128×128 pixels; 16 b values) corresponding to 40 sets of fractal-noise¹⁰ parameter maps (Fig. 2). These maps provided spatially-correlated ground truths, considering IVIM parameter ranges relevant to the brain. Evaluation involved assessing parameter maps, median absolute percentage error (MDAPE), median percentage bias (MDPE), mean absolute percentage error (MAPE), and comparison to LSQ. Qualitative assessment was conducted on an IVIM scan from a volunteer.

Results

Transformers outperformed CNNs for incorporating spatial information, exhibiting superior convergence speed and reduced loss (Fig. 1B), and matching or surpassing the CNNs trained for each specific number of correlated neighbors (Fig. 1C). Fig. 3 further shows that training with larger receptive-fields resulted in improved accuracy and reduced inherent supervised bias4 in low SNR regions, while retaining edge-like structures, particularly for D*. Utilizing NA allows training with larger receptive-fields than SA, leading to improved performance. Fig. 4 qualitatively shows similar findings in vivo.

Discussion and conclusion

In this work, we showed that transformers outperform CNNs for incorporating spatial information in IVIM parameter estimation when trained supervised on spatially-correlated synthetic data. Quantitative evaluation on novel fractal-noise maps revealed improved performance when trained with larger receptive-fields. SA was limited to smaller receptive-fields due to excessive memory demands, whereas NA could handle larger receptive-fields and therefore may be preferable.

Our proposed method brings challenging ill-posed signal analysis problems, like IVIM, closer to clinical application. Furthermore, the novel fractal-noise maps offer spatially-correlated ground truths, which may permit avenues for research within quantitative medical image analysis. Future research should explore whether including additional prior assumptions about the test data into the network's training process, such as representative parameter distributions for tissue types of interest, may further enhance performance.

Acknowledgements

This work was supported by the Research Council of Norway (FRIPRO Researcher Project 302624).

References

1. Novikov DS, Fieremans E, Jespersen SN, Kiselev VG. Quantifying brain microstructure with diffusion MRI: Theory and parameter estimation. NMR Biomed. 2019;32(4):1-53. doi:10.1002/nbm.3998

2. Kaandorp MPT, Barbieri S, Klaassen R, et al. Improved unsupervised physics-informed deep learning for intravoxel incoherent motion modeling and evaluation in pancreatic cancer patients. Magn Reson Med. 2021;86(4):2250-2265. doi:https://dx.doi.org/10.1002/mrm.28852

3. Barbieri S, Gurney-Champion OJ, Klaassen R, Thoeny HC. Deep learning how to fit an intravoxel incoherent motion model to diffusion-weighted MRI. Magn Reson Med. 2020;83(1):312-321. doi:10.1002/mrm.27910

4. Kaandorp MPT, Zijlstra F, Federau C, While PT. Deep learning intravoxel incoherent motion modeling: Exploring the impact of training features and learning strategies. Magn Reson Med. 2023;(March):312-328. doi:10.1002/mrm.29628

5. Kaandorp M, Zijlstra F, While P. Synthetic data shows the potential of unsupervised and supervised learning for incorporating spatial information in IVIM fitting. In: Proceedings of Int. Soc. of Magnetic Resonance Imaging. ; 2023. doi:10.1002/mrm.27910

6. Le Bihan D, Breton E, Lallemand D, Grenier P, Cabanis E, Laval-Jeantet M. MR imaging of intravoxel incoherent motions: application to diffusion and perfusion in neurologic disorders. Radiology. 1986;(161(2), 401-407).

7. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;2017-Decem(Nips):5999-6009.

8. Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M. Transformers in Vision: A Survey. ACM Comput Surv. 2022;54(10):1-30. doi:10.1145/3505244

9. Hassani A, Walton S, Li J, Li S, Shi H. Neighborhood attention transformer. Proc IEEE/CVF Conf Comput Vis Pattern Recognit. Published online 2023:6185-6194. http://arxiv.org/abs/2209.15001

10. Perlin K. Image Synthesizer. Comput Graph. 1985;19(3):287-296. doi:10.1145/325165.325247

Figures

Fig. 1. (A) Example neighbors-random patches. (B) Test loss curves for the voxelwise, CNN and transformer (SA-3) networks trained on neighbors-random over 10,000 epochs. The transformer network outperformed the CNN in terms of convergence speed and accuracy. (C) Performance of the networks trained on neighbors-random and evaluated on neighbors-N, compared with CNNs trained for each specific number of correlated neighbors (yellow).

Fig. 2. Example of parameter map generation using fractal noise (shown for D*). One fractal-noise map provides masks representing the spatial extent of each tissue type. One additional fractal-noise map per tissue type is scaled to appropriate parameter values. The masks are applied to the corresponding scaled fractal-noise maps and combined, wherefrom the D* map is generated. While in vivo assessment only offers qualitative value, the fractal-noise parameter maps provide ground truths that share spatial information, enabling a quantitative evaluation of our proposed methods.

Fig 3. Example IVIM fractal-noise parameter maps, RMSE maps (calculated between the predicted DWI signal and the ground truth DWI signal), and parameter error maps, estimated for the transformer networks (SA-X or NA-X, where X denotes the receptive-field) trained on neighbors-random. Note that transformer network ‘SA-1’ is essentially a voxelwise equivalent. Corresponding maps of the ground truth (top left) and least squares are also shown.

Fig 4. Boxplots displaying the median absolute percentage error (MDAPE), median percentage error (MDPE), and mean absolute percentage error (MAPE) calculated on the fractal-noise test set for the transformer networks (SA-X or NA-X, where X denotes the receptive-field) trained on neighbors-random. These error metrics were evaluated over the last 20 epochs of training to illustrate variability. The blue values printed above the plots are the median values for SA-1, in cases where these exceeded the ranges displayed.

Fig 5. IVIM parameter maps and RMSE maps (calculated between the predicted DWI signal and the measured DWI signal) for a representative slice from the volunteer scan (acquired in accordance to institutional guidelines), estimated by the transformer networks (SA-X or NA-X, where X denotes the receptive-field) trained on neighbors-random. Also shown is the b=0 image (top left), along with corresponding maps for the least squares.

Proc. Intl. Soc. Mag. Reson. Med. 32 (2024)

1139

DOI: https://doi.org/10.58530/2024/1139