3168

Data Selection for Deep Learning via diversity visualization and scoring
Deepa Anand1, Dattesh Dayanand Shanbhag1, and Rakesh Mullick1
1Advanced Technology Group, GE Healthcare, Bangalore, India

Synopsis

Data diversity is a key ingredient for robust deep learning models, especially in the medical domain. We present a diversity visualization and quantification scheme which enables decisions on data selection different enough from already existing data. Out experiments amply validate the usefulness of the proposed diversity metric in terms of enhancement in accuracy of models resulting from using them in data selection decision process with accuracy improvement from 3%->10% across different sites.

Introduction

With AI based algorithms in vogue, making a judicious choice of data for a given AI task is very important. Adding new data similar to existing train data distribution could result in over-fitting, hamper generalizability, and affect robustness. In addition, monetary and logistics constraints could also influence choice of data to be generated or acquired. In this work, we describe a methodology for visualizing and quantifying the data diversity. We also demonstrate how this tool guides data selection and thereby impacts performance of MR knee label classification.

Methods

Subject Data: Knee MRI data for study came from four sites (Sites 1-4, Fig 1). Data from volunteers as well as patients were included in the study. Data from each site was segregated into train (60%), validation (20%), and test (20%) cohorts. All studies were approved by respective IRBs.

MRI Scanner and Acquisition: Localizer data was acquired on multiple GE 1.5T and 3T MRI scanners, with different knee coil configurations, contrast (GRE, SSFSE), image resolution and matrix size across subjects. A total of 15100 localizer images were included in the study.

Deep Learning based classification model: A DL-CNN based classification model as described in [1] was used to label the given Knee MRI tri-planar localizer as belonging to one of the five different labels : Label 1: Relevant axial femur, Label 2: Relevant axial tibia, Label 3: Irrelevant, Label 4: Relevant Coronal and Label 5: Relevant Sagittal (Fig. 1A). Site 1 is considered as reference site on which a DL model M was initially trained.

Feature Generation and Visualization: Given a pool of data, we derive latent representations as the features ( total = 16384) extracted from the penultimate layer of model M (Fig.2). The reference site features are used to construct a UMAP transform [2] to reduce features to smaller space (= 2) and normalized to lie between 0 and 1, per feature. For a given new site data , the DL features derived for that data from model M undergo a similar dimensionality reduction, but using existing UMAP transform. The 2D feature scatter plot showing the UMAP reduced features for reference and new site data allows for visualization of data diversity. This is different than regular tSNE [3] visualization since UMAP allows learning the mapping from higher dimensional to lower dimensional space on the train data and then applying the same transformation to any new test data, which is not possible with current implementation of tSNE.

Diversity quantification: For reference site, class labels are known. Consequently, for each label, UMAP feature cluster centroid and the mean distance from the cluster centroid (Dref) was obtained. Similarly, for new site data, mean distance of UMAP features of new site test data pool from nearest reference cluster centroids is computed (Dnew). The diversity score for a new site is then computed = Dnew / Dref.
Since this process is stochastic in nature, multiple runs were done ( two runs for visualization and three runs for diversity score). All the methods were implemented with functionality in Tensorflow 2.3, UMAP Learn package [4] and in Python 3.6.

Impact of data diversity on DL task: Sites 2-4 are considered as new sites and decision was needed on which site(s) to be included in training cohort. We trained models using data from Site1 and in turn added from each new site ( 2 to 4) to Site1 cohort and generated new DL models. These models were then evaluated on test samples from individual sites as well as entire test data pooled among Sites 1-4 and performance assessed vis-à-vis data diversity.

Results and Discussion:

From Fig.3 and Fig 4, it is well appreciated that the proposed framework is a good indicator of data diversity. Site3 and Site4 have pronounced scattering and higher diversity scores. Site2 is clustered tightly around Site1, indicating that it is not much different from Site1 data. So based on this analysis, we suggest that Site4 and Site3 would be best candidates for inclusion in new training (in that order).
Site3 data results in best classification accuracy in all site pooled test data (89%) as well for test data from Sites 1,3 and 4 (Fig.5A). Moreover, notice that Site3 based model boosts the accuracy for Site4 (most diverse) data compared to Site2 based model (Site3 model = 89%, Site2 model = 72%). While Site4 inclusion does improve overall accuracy (85%), the impact is not as profound as compared to Site3. This effect is primarily attributed to the skewed label data balance. From Fig 5B, it is evident that Site4 inclusion results in higher accuracy for the labels which have similar data balance in comparison, to Site2 or 3 (See Fig.1C). For labels 2 and 3, which are highly imbalanced in Site4 ( ~ 7% and 0.4%), the corresponding accuracy for Site4 based model are also lower (78% and 61%). This suggests that our hypothesis of using the data diversity visualization is reasonable for adding data from new sites, provided the datasets are also more -or less balanced out.

Conclusion

Overall, the experiments validate our intuition that proposed diversity framework is a sound basis for judicious data selection in deep learning-based model development.

Acknowledgements

No acknowledgement found.

References

1. Shanbhag DD et.al. A generalized deep learning framework for multi-landmark intelligent slice placement using standard tri-planar 2D localizers. In Proceedings of ISMRM 2019, Montreal, Canada, p. 670.

2. Leland McInnes, John Healy, James Melville, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, arXiv:1802.03426., 2018 (https://arxiv.org/abs/1802.03426)

3. T-SNE – Maaten, L.V.D. and Hinton, G., 2008. J. of Mac. Learn. Res. 9, pp.2579-2605.

Figures

Figure 1. Data Labels and Distribution is shown here. A: Five labels used for Knee localizer data classification. B: Total localizer image Data distribution per site. Site 4 has the least amount of data. C: Label distribution per site is shown. Site 1 is considered as reference and normalized to 100% per label. Site4 has highly skewed Label2 and Label3 distribution

Figure 2: Data diversity visualization framework is shown. From the existing DL model , we derive the latent representations as the features ( total = 16384). The reference site features are used to construct a UMAP transform which compresses the feature space (N= 2)and normalized to lie between 0 and 1. For a given new site data , the DL features from model undergo a similar dimensionality reduction using the existing UMAP transform. The feature can be then visualized as scatter plot.

Figure 3.The features from UMAP transform are visualized as 2D scatter plots for reference site (Site 1) and any new site (Site 2-4). Notice that Site 2 features (A) have good overlap with reference Site1, with overlap reducing further with Site3 (B) and Site 4 (C). Site4 has the least overlap or highest diversity from the reference site 1. Since the process is stochastic in nature, two runs were done. Both the runs are consistent in manifesting the diversity for each site.

Figure 4.Diversity score is computed using the method proposed in the Methods section. Notice that stochastic nature of the process is manifest in slightly varying numbers for Dref and Dnew. However, the diversity scores computed are consistent across runs, distinct across sites (no overlap) and match the visual assessment (that Site4 is the most diverse set of data).

Figure 5: Accuracy of Knee label classification is shown. The highest accuracy in respective bins are highlighted in blue. A. Accuracy comparison across different site models shows that the model trained on Site3 as additional data outperforms all the others(Except for Site2 on Site 2 test data). Note the boost in accuracy for Site4 data with Site3 model vs Site2 model B. Label-wise accuracy evaluation of models based on Site3 and Site4 data. We see that accuracy of Site4 matches and sometimes outperforms Site 3 model except for those labels with a high data imbalance (See Fig 1C, Table2).

Proc. Intl. Soc. Mag. Reson. Med. 30 (2022)
3168
DOI: https://doi.org/10.58530/2022/3168