3099

Optimal selection of candidate datasets for deep learning model retraining using radiomics: application to pancreas segmentation in MRI

Alexandre Triay Bagur¹, Vishal Jain¹, and Paul Aljabar¹
¹Perspectum Ltd, Oxford, United Kingdom

Synopsis

Keywords: Analysis/Processing, Segmentation, Radiomics

Motivation: Machine learning (ML) models need to be periodically evaluated to combat ‘data drift’, where the target population changes over time

Goal(s): The goal is to present a framework for optimally selecting new datasets and updating ML models

Approach: We retrained a pancreas segmentation model in MRI scans. We selected 50 new cases to annotate using radiomics features, i.e., those unlabelled cases with most differing segmentations from those in the previous training set

Results: The system identified a diversity of failure cases, which flagged challenges in real-world data. The mean performance of the model improved after retraining with the additional cases

Impact: The proposed system yields a helpful guide for researchers and technicians for retraining machine learning models, particularly deep learning models for organ segmentation in MRI. Selecting an optimal new set of data to annotate produces time and cost savings

Introduction

The performance of machine learning models needs to be periodically evaluated, not least because of the ‘data drift’ phenomenon, where the input data distribution to a model may change over time¹. For example, when acquiring data from a new MRI scanner unseen by the model. Updating deep learning (DL) models, continuously or periodically, may be important to keep them relevant to target (diseased) populations.

Many methods have been reported for training deep learning models. However, the process for creating new versions of models through retraining has been less explored and is more inefficient and static, particularly the process of selecting and curating new datasets that will maintain optimal model performance. It is also important to reduce operator cost in annotating new cases while maximising the ability of the model to adapt to the new data.

In this work, we propose an efficient semi-automated system for DL segmentation model retraining based on radiomics scores and model uncertainty. We show an application of the system to pancreas segmentation in MRI images.

Methods

The aim of the system is to prioritise cases for inclusion into the training data for the new iteration of the DL model. The system, illustrated schematically in Figure 1, works in the following way:

1. Start with the $n$-th iteration trained model, $m_n$, and its training set, $d_n$.

2. Define a list of task related features. Compute their mean and covariance over the manual labels on $d_n$.

3. For all the candidate images in the new data to be labelled:

Use the model $m_n$ to predict labels.
Extract features from the predicted labels.
Measure the discrepancy of the features from the features derived from $d_n$, using a score such as the Mahalanobis distance (MD).

4. Order the candidate datasets for model retraining by ranking the distance scores from highest to lowest and optionally further filtering tags, for example to ensure a balanced distribution of scanner manufacturers or to include subjects with particular demographics or disease status. Taking the number of cases for retraining from the ordered datasets yields the set $d_{n+1}$.

5. (Optional) Manual review of the ordered cases to further exclude datasets based on other criteria (e.g. to exclude images with specific artefacts or list other reasons for failure that inform data collection).

6. Annotate/label cases in the set $d_{n+1}$.

7. Retrain to obtain model $m_{n+1}$ from $d_n \cup d_{n+1}$.

A previously trained pancreas segmentation model² was used in this work. Its training data comprised liver to kidney abdominal T1w Dixon scans.

Radiomics features (volume, surface area, semi-major axis and centroid of the mask relative to the image) were calculated using pyradiomics³ for the previous training set (N=279) and for the new data (N=1,817). The new data were ranked by MD to the training data and filtered by manufacturer. The 20 highest-ranked Siemens cases were selected, 15 from Philips, and 15 from GE Healthcare, yielding 50 selected new cases in $d_{n+1}$ (34 3T, 16 1.5T). The datasets were annotated using MONAI Label⁴, that allows ranking cases based on model uncertainty within an active learning framework. The pancreas segmentation model was then retrained using the previous training set and the 50 additional annotations and evaluated on 24 test cases.

Results and Discussion

Example low and high MD cases are shown in Figure 2. The high MD case showed wrap-around artefacts and poor model segmentation targeting the subcutaneous fat. Failures of the initial segmentation model were manually recorded and summarised in Figure 3.

The performance of the initial model on the new test set (N=24), quantified by Dice Similarity Coefficient (DSC), was 0.712 ± 0.251 (mean ± standard deviation). The retrained model obtained a performance of DSC 0.810 ± 0.094. Future work will assess improvement in performance comparing our system against selecting cases at random, and quantify the time saved.

Figure 4 compares pancreas segmentations prior to and after retraining on a newly selected test. This subject had ingested pineapple juice in preparation for a magnetic resonance cholangiopancreatography (MRCP) scan. Importantly, the MRI data properties depend on the acquisition protocol of that data but also any procedures performed in preparation for the imaging session.

Conclusions

Characterising failures in $d_{n+1}$ informs future data collection, for example by focusing on those cases that need to be processed by the model but where the quality may be improved through changes in the acquisition protocol.

Radiomics features serve as a surrogate quantitative method to identify cases with poor segmentations, that may be useful for retraining and improving a DL model.

Acknowledgements

No acknowledgement found.

References

1. Stacke K, Eilertsen G, Unger J, Lundstrom C. Measuring Domain Shift for Deep Learning in Histopathology. IEEE J Biomed Health Inform. 2021;25(2):325-336. doi:10.1109/JBHI.2020.30320602.

2. Owler J, Triay Bagur A, Marriage S, et al. Pancreas Volumetry in UK Biobank: Comparison of Models and Inference at Scale. In: Papież BW, Yaqub M, Jiao J, Namburete AIL, Noble JA, eds. Medical Image Understanding and Analysis. Vol 12722. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2021:265-279. doi:10.1007/978-3-030-80432-9_213.

3. van Griethuysen JJM, Fedorov A, Parmar C, et al. Computational Radiomics System to Decode the Radiographic Phenotype. Cancer Research. 2017;77(21):e104-e107. doi:10.1158/0008-5472.CAN-17-03394.

4. Diaz-Pinto A, Alle S, Ihsani A, et al. MONAI Label: A framework for AI-assisted Interactive Labeling of 3D Medical Images. March 2022. http://arxiv.org/abs/2203.12362. Accessed May 17, 2022.

Figures

Diagram of the proposed system, that leads to selection and curation of a new subset of data to annotate, $d_{n+1}$. The system starts with a trained model, $m_n$, and its training set. The new data may then be annotated and used (jointly with the previous training set) to train the next iteration of the model, $m_{n+1}$

Examples of cases with low (left, MD=0.495) and high (right, MD=35) Mahalanobis distance (MD). The case with high MD has a wrap-around artefact and the previous model segmentation was poor (bottom left in axial view)

Categorised reasons for failure of the initial segmentation model in the 50 additional training datasets

Example of case with high intensity in the stomach due to pineapple juice ingestion in preparation for an MRCP scan. This confounded the initial pancreas model (green outline). The retrained model (red outline) segmented the pancreas correctly (visualised with 3D Slicer). Importantly, image features of MRI datasets need to be understood in the context of the entire acquisition protocol, because features in a single image series may depend on subject preparation prior to the entire imaging session

Proc. Intl. Soc. Mag. Reson. Med. 32 (2024)

3099

DOI: https://doi.org/10.58530/2024/3099