Qing Li1, Yan Li2, Longyu Sun1, Mengting Sun1, Meng Liu1, Xumei Hu1, Xinyu Zhang1, Xueqin Xia3, Shuo Wang4, Yinghua Chu5, and Chengyan Wang1
1Human Phenome Institute, Fudan University, Shanghai, China, 2Department of Radiology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China, 3Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China, 4Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai, China, 5Simens Healthineers Ltd, Shanghai, China
Synopsis
Keywords: Visualization, Visualization, Fairness; Bias
Motivation: A systematic analysis of the segmentation effectiveness for fairness helps enhance the effectiveness of artificial intelligence(AI) model, which has not been done before.
Goal(s): This study aims to compile statistics the relation between the segmentation effectiveness and aging, gender as well as anatomical regions.
Approach: The nnU-Net model is used for organ segmentation while the DICE was computed to evaluate the relation between the effectiveness with aging and gender and the heatmap was used to visualize the spatial error distribution regarding anatomical regions.
Results: The result demonstrates variations in nnU-Net's effectiveness within subgroups, highlighting the significance of attention mechanisms for segmentation model enhancement.
Impact: This study comprehensively evaluated the fairness and effectiveness of nnU-Net across multiple organs within the body. An analysis was conducted to investigate the relationship between segmentation errors and age, gender as well as anatomical regions for organ segmentation.
Introduction
While nnU-Net is currently one of the most widely used models in medical image segmentation[1], there is a lack of comprehensive reports on its fairness [2, 3] and effectiveness in systematic evaluations across multiple organs within the body. This study aimed to assess the relation among DICE, age and gender. Additionally, an analysis of the difference of spatial error for organs through heat map was applied to evaluate the relation of the segmentation effectiveness and anatomical regions.Methods
Image acquisition
A total of 800 healthy volunteers, aged 20 to 60, were included in the study from April 2020 to August 2023. The study received approval from our local institutional review board. MRI data were obtained using a 3.0T scanner (MAGNETOM Vida, Siemens Healthineers, Germany) with the parameters: TR/TE = 4000/82 ms, FOV = 256 × 256 mm, voxel size = 1.0 × 1.0 mm, slice thickness = 2.0 mm, number of slices = 30, flip angle = 90°, and number of shots = 8, with each shot including 32 k-space lines.
Segmentation model
Manual segmentation (label) of 24 organs and vessels was performed by two radiologists with over 5 years of experience using ITK-SNAP (version 3.8.0). The nnU-Net used in this study was pretrained using a training set of 300 samples (148 males and 152 females). To ensure fairness in the model training stage, the number of samples in different age groups (20 to 60, with a step of 10) was kept as balanced as possible within each subgroup of males and females. The last 500 samples were used as test set for subsequent research processes.
Spatial error distribution for segmentation
The flow chart of this study is shown in Figure 1. The predictions and the labels of NII format for organs are acquired and converted to STL and the usability of this process is validated, and the STL converted further to point clouds. In order to analyze the feature of spatial error in a specialized group, the error computed for each point cloud is normalized (Average for distance in Figure 1), where the point clouds are registried and the K-means[4] is proposed for computing the group of center points and average of the error. the normalized result of the error from predictions to labels is obtained and visualized through a heat map. Besides, the usability of the conversion process is validated: the source NII data for each samole is converted to STL data for 2 rounds, the distance of the 2 converted STL data is normalized and visualized as mentioned.
Segmentation error with aging and gender
The predictions and labels for each organ of each sample are acquired and the DICE is computed. For the 2 subgroups with gender, the Pearson correlation [5] and P-values are calculated between age and the DICE of the organs. Besides, the difference of DICE and the correlation between DICE and age among subgroups are acquired for analysis.Results
Figure 2 shows the effectiveness of the data conversion method, which maintains high consistency with the source data, thereby supporting the subsequent study. Figure 3 displays the spatial distribution of error for 9 organs, indicating that the maximum error tends to occur at the edges. In Figure 4(a), a significant correlation between DICE and age is observed for organs like the pancreas and Sa_ES (right ventricle). LSA also shows significant difference in correlation with gender. In Figure 4(b), there is significant difference in the correlation between DICE and age across different organs. For example, the nnU-Net performs better for younger individuals in pancreas segmentation, but the performance is reversed in AAO (Aorta ascendens). However, the segmentation performance remains consistent for most organs in the two subgroups. Figure 5(a) presents the maximum DICE of 0.99 for Kidney cortex and the minimum DICE of 0.87 for LCCA (left common carotid artery) in males; for females, the maximum DICE of 0.99 for Liver and the minimum DICE of 0.86 for LCCA. The order and values of DICE for each organ in the two subgroups are similar. Figure 5(b) shows that the greatest difference in segmentation occurs in the pancreas for the age 45 to 55 between males and females. However, the effectiveness of the nnU-Net remains consistent for most organs across different gender groups.Discussion and Conclusion
This study analyzes the segmentation effectiveness from 2 aspects, demonstrating the regularity of the segmentation error of nnU-Net in spatial distribution and the relation of the error with aging and gender for fairness, which will provide valuable insights for researchers to enhance the segmentation model further.Acknowledgements
This study was supported in part by the National Natural Science Foundation of China (No. 62001120, 62331021), The Royal Society (IEC\NSFC\211235) and the Shanghai Sailing Program (No. 20YF1402400, 22YF1409300).
The correspondence should be sent to Prof. Chengyan Wang (Email: wangcy@fudan.edu.cn)References
[1] Isensee F, Jaeger P F, Kohl S A A, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation[J]. Nature methods, 2021, 18(2): 203-211.
[2] ReMehrabi N, Morstatter F, Saxena N, et al. A survey on bias and fairness in machine learning[J]. ACM computing surveys (CSUR), 2021, 54(6): 1-35.
[3] Á. A. Cabrera, W. Epperson, F. Hohman, M. Kahng, J. Morgenstern and D. H. Chau, "FAIRVIS: Visual Analytics for Discovering Intersectional Bias in Machine Learning," 2019 IEEE Conference on Visual Analytics Science and Technology .(VAST), Vancouver, BC, Canada, 2019, pp. 46-56, doi: 10.1109/VAST47406.2019.8986948.
[4] Shi B Q, Liang J, Liu Q. Adaptive simplification of point cloud using k-means clustering[J]. Computer-Aided Design, 2011, 43(8): 910-922.
[5] Sedgwick P. Pearson’s correlation coefficient[J]. Bmj, 2012, 345.