1674

Multi-modal fusion with joint conditional transformer for grading hepatocellular carcinoma

Shangxuan Li¹, Yanshu Fang², Guangyi Wang³, Lijuan Zhang⁴, and Wu Zhou¹
¹School of Medical Information Engineering, Guangzhou University of Chinese Medicine, Guangzhou, China, ²First Clinical Medical College, Guangzhou University of Chinese Medicine, Guangxhou, China, ³Department of Radiology, Guangdong Provincial People’s Hospital, Guangzhou, China, ⁴Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Synopsis

Keywords: Cancer, Machine Learning/Artificial Intelligence

Multimodal medical imaging plays an important role in the diagnosis and characterization of lesions. Transformer pays more attention to global relationship modeling in data, which has obtained promising performance in lesion characterization. We propose a multi-modal fusion network with jointly conditional transformer to realize adaptive fusion of multimodality information and mono-modality feature learning constrained by other modal conditions. The experimental results of the clinical hepatocellular carcinoma (HCC) dataset show that the proposed method is superior to the previously reported multimodal fusion methods for HCC grading.

Introduction

Hepatocellular carcinoma (HCC) is the third most common cause of cancer death in the world. The preoperative knowledge of pathological grade of HCC is of great significance to the management of patients and prognosis prediction. Contrast-enhanced MR has been demonstrated to be a promising tool for diagnosing and characterizing HCC, especially the combination of multimodal medical images for HCC grading¹. Although the information between or within modalities are considered in previous research^2-4, the effective combination of these information has not been explored. In addition, the information of other modalities should be consistent with the learning objectives of the mono-modality. These should be used as conditions to constrain other modalities to focus on this position to obtain the connection between modalities. By controlling the input mode of self-attention, transformer can simply realize the learning of the specificity within modality and the correlation between modalities⁵. Therefore, we propose a multi-modal fusion network with jointly conditional transformer (M-JCT network) for grading HCC.

Materials and Methods

This retrospective study was approved by the local institutional review board and the informed consent of patients was waived. From October 2012 to December 2018, a total of 112 patients with 117 histologically confirmed HCCs were retrospectively included in the study. Gd-DTPA enhanced MR images for each patient were acquired with a 3.0T MR scanner. The pathological diagnosis of HCC was based on surgical specimens, in which there were 54 low-grade and 63 High-grade. Figure 1 shows the proposed M-JCT network, which extracts deep features from different modalities using a parallel architecture. Each architecture learns the cross-modal information through the multi-head cross attention (MHCA) module (Figure 2(a)) and uses them as constraints to guide the multi-head joint conditional attention (MHJCA) module (Figure 2(b)) to learn the information within the mode. Then, the two information is effectively integrated through the AFA module (Figure 2(c)). Finally, the deep features of the three modals are concatenated and cross entropy loss is adopted for optimization. The training and testing were repetitively performed five times in order to reduce the measurement error, and values of accuracy, sensitivity, specificity and area under the curves (AUC) were calculated in average.

Results

Table 1 showed the performance comparison of different fusion methods in HCC grading of Contrast-enhanced MR. The performance of proposed M-JCT network is significantly better than other fusion methods. In addition, compared with simple concatenation⁶ or deeply supervised fusion⁷ of deep features in different phases, analyzing the relationship within^3,4 or between modals^2-4 can better guide feature fusion. According to the ablation experiment results of Table 2 and the visual spatial feature distribution of figure 3, the proposed method uses the multi-head attention structure with stronger remote modeling ability to transfer inter modal information and intra modal feature learning, and adaptively and effectively fuse these features. In addition, the joint conditional attention module can use other modal information to constrain intra-modal feature learning and promote the learning of intra-modal important features.

Discussion

Our research shows that compared with matrix decomposition-based methods^2,3, using the attention mechanism⁴ or our transformer with more powerful long-distance modeling advantages and flexible multimodal input mode can effectively learn the characteristics within and between modals. Evidently, the proposed M-JCT network is superior to the previously reported multimodality fusion methods for lesion characterization. In addition, our research shows that effective fusion of features within and between modes can bring better results, which indicates that they need to be treated differently with greater importance. In addition, inter modal information is used as a condition constraint to guide intra modal feature learning and further strengthen the discriminability of intra modal features.

Conclusion

In this work, we proposed a multi-modal fusion network with jointly conditional transformer for grading hepatocellular carcinoma (M-JCT network) to improve the performance of lesion characterization, outperforming previously reported multimodal fusion methods. The method and findings proposed in this study may be beneficial to multimodal fusion for lesion characterization.

Acknowledgements

This research is supported by the grant from National Natural Science Foundation of China (NSFC: 81771920).

References

1. Bi WL, Hosny A, Schabath M.B, et al. Artificial intelligence in cancer imaging: Clinical challenges and applications. CA Cancer J Clin. 2019; 69(2):127-157.

2.Hussein S, Kandel P, Corral J.E, et al. Deep multi-modal classification of intraductal papillary mucinous neoplasms (IPMN) with canonical correlation analysis. In 2018 IEEE 15th International Symposium on Biomedical Imaging, 2018; 800-804.

3. Dou T, Zhang L, Zheng H, et al. Local and Non-local Deep Feature Fusion for Malignancy Characterization of Hepatocellular Carcinoma. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2018.

4. Li S, Xie Y, Wang G, et al. Attention guided discriminative feature learning and adaptive fusion for grading hepatocellular carcinoma with Contrast-enhanced MR. Computerized Medical Imaging and Graphics. 2022; 97:102050.

5. Xu P, Zhu X, Clifton D.A. Multimodal Learning with Transformers: A Survey. 2022; arXiv:2206.06488

6.Dou T, Zhang L, Zhou W, 3d deep feature fusion in contrast- enhanced MR for malignancy characterization of hepatocellular carcinoma. IEEE 15th International Symposium on Biomedical Imaging. 2018; 29–33.

7.Zhou W, Wang G, Xie G, et a. Grading of hepatocellular carcinoma based on diffusion weighted images with multiple b-values using convolutional neural networks. Med Phys. 2019; 46(9):3951-3960.

Figures

Figure 1: The proposed M-JCT network. The pink arrows indicate the information flow of K, V vectors between modals.

Figure 2: Submodules in M-JCT network. (a) Multi-head cross attention (MHCA), where i, j ∈ (arterial phase (AP), portal-vein phase (PvP), pre-contrast phase (PP)) and i ≠ j. (b) Multi-head joint conditional attention (MHJCA), where i ∈ (AP, PvP, PP). (c) Adaptive attention fusion (AFA), where a₁, a₂ and a₃ represent the output of two MHCA modules and MHJCA module respectively, W is a learnable matrix.

Table 1: Performance on different methods in multi-phase fusion (%).

Table 2: Ablation study. Ours1 indicates the effect of MHSA followed by MHCA. Ours2 adopts the parallel MHSA and MHCA structure shown in Figure 1. Ours3 indicates the effect of using AFA module on the basis of Ours2. Ours represents the M-JCT network (%).

Figure 3: Visualization of feature space distribution by t-SNE. (a) Original distribution (b) CCA² (c) C+I³ (d) AGDAF⁴ (e) M-JCT network without joint condition (f) M-JCT network. The purple dot represents positive class samples, and the blue dot represents negative class samples.

Proc. Intl. Soc. Mag. Reson. Med. 31 (2023)

1674

DOI: https://doi.org/10.58530/2023/1674