2129

Evaluation of the fairness and effectiveness of nnU-Net on multi-organ segmentation

Qing Li¹, Yan Li², Longyu Sun¹, Mengting Sun¹, Meng Liu¹, Xumei Hu¹, Xinyu Zhang¹, Xueqin Xia³, Shuo Wang⁴, Yinghua Chu⁵, and Chengyan Wang¹
¹Human Phenome Institute, Fudan University, Shanghai, China, ²Department of Radiology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China, ³Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China, ⁴Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai, China, ⁵Simens Healthineers Ltd, Shanghai, China

Synopsis

Keywords: Visualization, Visualization, Fairness; Bias

Motivation: A systematic analysis of the segmentation effectiveness for fairness helps enhance the effectiveness of artificial intelligence(AI) model, which has not been done before.

Goal(s): This study aims to compile statistics the relation between the segmentation effectiveness and aging, gender as well as anatomical regions.

Approach: The nnU-Net model is used for organ segmentation while the DICE was computed to evaluate the relation between the effectiveness with aging and gender and the heatmap was used to visualize the spatial error distribution regarding anatomical regions.

Results: The result demonstrates variations in nnU-Net's effectiveness within subgroups, highlighting the significance of attention mechanisms for segmentation model enhancement.

Impact: This study comprehensively evaluated the fairness and effectiveness of nnU-Net across multiple organs within the body. An analysis was conducted to investigate the relationship between segmentation errors and age, gender as well as anatomical regions for organ segmentation.

Introduction

While nnU-Net is currently one of the most widely used models in medical image segmentation[1], there is a lack of comprehensive reports on its fairness [2, 3] and effectiveness in systematic evaluations across multiple organs within the body. This study aimed to assess the relation among DICE, age and gender. Additionally, an analysis of the difference of spatial error for organs through heat map was applied to evaluate the relation of the segmentation effectiveness and anatomical regions.

Methods

Image acquisition
A total of 800 healthy volunteers, aged 20 to 60, were included in the study from April 2020 to August 2023. The study received approval from our local institutional review board. MRI data were obtained using a 3.0T scanner (MAGNETOM Vida, Siemens Healthineers, Germany) with the parameters: TR/TE = 4000/82 ms, FOV = 256 × 256 mm, voxel size = 1.0 × 1.0 mm, slice thickness = 2.0 mm, number of slices = 30, flip angle = 90°, and number of shots = 8, with each shot including 32 k-space lines.
Segmentation model
Manual segmentation (label) of 24 organs and vessels was performed by two radiologists with over 5 years of experience using ITK-SNAP (version 3.8.0). The nnU-Net used in this study was pretrained using a training set of 300 samples (148 males and 152 females). To ensure fairness in the model training stage, the number of samples in different age groups (20 to 60, with a step of 10) was kept as balanced as possible within each subgroup of males and females. The last 500 samples were used as test set for subsequent research processes.
Spatial error distribution for segmentation
The flow chart of this study is shown in Figure 1. The predictions and the labels of NII format for organs are acquired and converted to STL and the usability of this process is validated, and the STL converted further to point clouds. In order to analyze the feature of spatial error in a specialized group, the error computed for each point cloud is normalized (Average for distance in Figure 1), where the point clouds are registried and the K-means[4] is proposed for computing the group of center points and average of the error. the normalized result of the error from predictions to labels is obtained and visualized through a heat map. Besides, the usability of the conversion process is validated: the source NII data for each samole is converted to STL data for 2 rounds, the distance of the 2 converted STL data is normalized and visualized as mentioned.
Segmentation error with aging and gender
The predictions and labels for each organ of each sample are acquired and the DICE is computed. For the 2 subgroups with gender, the Pearson correlation [5] and P-values are calculated between age and the DICE of the organs. Besides, the difference of DICE and the correlation between DICE and age among subgroups are acquired for analysis.

Results

Figure 2 shows the effectiveness of the data conversion method, which maintains high consistency with the source data, thereby supporting the subsequent study. Figure 3 displays the spatial distribution of error for 9 organs, indicating that the maximum error tends to occur at the edges. In Figure 4(a), a significant correlation between DICE and age is observed for organs like the pancreas and Sa_ES (right ventricle). LSA also shows significant difference in correlation with gender. In Figure 4(b), there is significant difference in the correlation between DICE and age across different organs. For example, the nnU-Net performs better for younger individuals in pancreas segmentation, but the performance is reversed in AAO (Aorta ascendens). However, the segmentation performance remains consistent for most organs in the two subgroups. Figure 5(a) presents the maximum DICE of 0.99 for Kidney cortex and the minimum DICE of 0.87 for LCCA (left common carotid artery) in males; for females, the maximum DICE of 0.99 for Liver and the minimum DICE of 0.86 for LCCA. The order and values of DICE for each organ in the two subgroups are similar. Figure 5(b) shows that the greatest difference in segmentation occurs in the pancreas for the age 45 to 55 between males and females. However, the effectiveness of the nnU-Net remains consistent for most organs across different gender groups.

Discussion and Conclusion

This study analyzes the segmentation effectiveness from 2 aspects, demonstrating the regularity of the segmentation error of nnU-Net in spatial distribution and the relation of the error with aging and gender for fairness, which will provide valuable insights for researchers to enhance the segmentation model further.

Acknowledgements

This study was supported in part by the National Natural Science Foundation of China (No. 62001120, 62331021), The Royal Society (IEC\NSFC\211235) and the Shanghai Sailing Program (No. 20YF1402400, 22YF1409300). The correspondence should be sent to Prof. Chengyan Wang (Email: wangcy@fudan.edu.cn)

References

[1] Isensee F, Jaeger P F, Kohl S A A, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation[J]. Nature methods, 2021, 18(2): 203-211.

[2] ReMehrabi N, Morstatter F, Saxena N, et al. A survey on bias and fairness in machine learning[J]. ACM computing surveys (CSUR), 2021, 54(6): 1-35.

[3] Á. A. Cabrera, W. Epperson, F. Hohman, M. Kahng, J. Morgenstern and D. H. Chau, "FAIRVIS: Visual Analytics for Discovering Intersectional Bias in Machine Learning," 2019 IEEE Conference on Visual Analytics Science and Technology .(VAST), Vancouver, BC, Canada, 2019, pp. 46-56, doi: 10.1109/VAST47406.2019.8986948.

[4] Shi B Q, Liang J, Liu Q. Adaptive simplification of point cloud using k-means clustering[J]. Computer-Aided Design, 2011, 43(8): 910-922.

[5] Sedgwick P. Pearson’s correlation coefficient[J]. Bmj, 2012, 345.

Figures

Flowchart of this study

Figure 2 The results of the validation of data conversion. BCT = Brachiocephalic arterial trunk, LSA = Left subclavian artery, the value of the error of anatomical regions is visualized by the gradual color and the color of pure red presents the max error the process of conversion results in.

Figure 3 Spatial distribution of segmentation errors for representative 9 organs. SV=splenic vein, AO=Aorta and the BCT, LSA are the part of AO. The value of the error of anatomical regions is visualized by the gradual color and the color of pure red presents the max error the process of segmentation results in.

Figure 4 Correlation between DICE coefficient and age of segmentation. (a) the result of 4 representative organs and vessels: Sa_ES(Right ventricle), IVC(Inferior cava vena), Pancreas and LSA; (b) the summary of all organs in subgroups with gender, where the value for an organ in a subgroup is presented by the 95% quantile of the all values for the organ in this subgroup.

Figure 5 The difference of DICE and correlation between 2 subgroups. (a) the order of DICE for each organ in different subgroups, where the value for an organ in a subgroup is presented by the 95% quantile of the all values for the organ in this subgroup. (b) the difference of DICE for each organ among subgroups with aging and gender, where the age ranges from 20 to 60 by the step of 5 and the value shown in the heatmap presents the comparasion between male and female in a specialized age step, and blue means higher in male.

Proc. Intl. Soc. Mag. Reson. Med. 32 (2024)

2129

DOI: https://doi.org/10.58530/2024/2129