Alexandre Triay Bagur1, Vishal Jain1, and Paul Aljabar1
1Perspectum Ltd, Oxford, United Kingdom
Synopsis
Keywords: Analysis/Processing, Segmentation, Radiomics
Motivation: Machine learning (ML) models need to be periodically evaluated to combat ‘data drift’, where the target population changes over time
Goal(s): The goal is to present a framework for optimally selecting new datasets and updating ML models
Approach: We retrained a pancreas segmentation model in MRI scans. We selected 50 new cases to annotate using radiomics features, i.e., those unlabelled cases with most differing segmentations from those in the previous training set
Results: The system identified a diversity of failure cases, which flagged challenges in real-world data. The mean performance of the model improved after retraining with the additional cases
Impact: The proposed system yields a helpful guide for researchers and technicians for retraining machine learning models, particularly deep learning models for organ segmentation in MRI. Selecting an optimal new set of data to annotate produces time and cost savings
Introduction
The performance of machine learning models needs to be periodically evaluated, not least because of the ‘data drift’ phenomenon, where the input data distribution to a model may change over time1. For example, when acquiring data from a new MRI scanner unseen by the model. Updating deep learning (DL) models, continuously or periodically, may be important to keep them relevant to target (diseased) populations.
Many methods have been reported for training deep learning models. However, the process for creating new versions of models through retraining has been less explored and is more inefficient and static, particularly the process of selecting and curating new datasets that will maintain optimal model performance. It is also important to reduce operator cost in annotating new cases while maximising the ability of the model to adapt to the new data.
In this work, we propose an efficient semi-automated system for DL segmentation model retraining based on radiomics scores and model uncertainty. We show an application of the system to pancreas segmentation in MRI images.Methods
The aim of the system is to prioritise cases for inclusion into the training data for the new iteration of the DL model. The system, illustrated schematically in Figure 1, works in the following way:
1. Start with the $n$-th iteration trained model, $m_n$, and its training set, $d_n$.
2. Define a list of task related features. Compute their mean and covariance over the manual labels on $d_n$.
3. For
all the candidate images in the new data to be labelled:
- Use the model $m_n$ to predict labels.
- Extract features from the predicted labels.
- Measure the discrepancy of the features from the features derived from $d_n$, using a score such as the Mahalanobis distance (MD).
4. Order the candidate datasets for model retraining by ranking the distance scores from highest to lowest and optionally further filtering tags, for example to ensure a balanced distribution of scanner manufacturers or to include subjects with particular demographics or disease status. Taking the number of cases for retraining from the ordered datasets yields the set $d_{n+1}$.
5. (Optional) Manual review of the ordered cases to further exclude datasets based on other criteria (e.g. to exclude images with specific artefacts or list other reasons for failure that inform data collection).
6. Annotate/label cases in the set $d_{n+1}$.
7. Retrain to obtain model $m_{n+1}$ from $d_n \cup d_{n+1}$.
A previously trained pancreas segmentation model
2 was used in this work. Its training data comprised liver to kidney abdominal T1w Dixon scans.
Radiomics features (volume, surface area, semi-major axis and centroid of the mask relative to the image) were calculated using pyradiomics
3 for the previous training set (N=279) and for the new data (N=1,817). The new data were ranked by MD to the training data and filtered by manufacturer. The 20 highest-ranked Siemens cases were selected, 15 from Philips, and 15 from GE Healthcare, yielding 50 selected new cases in $d_{n+1}$ (34 3T, 16 1.5T). The datasets were annotated using MONAI Label
4, that allows ranking cases based on model uncertainty within an active learning framework. The pancreas segmentation model was then retrained using the previous training set and the 50 additional annotations and evaluated on 24 test cases.
Results and Discussion
Example low and high MD cases are shown in Figure 2. The high MD case showed wrap-around artefacts and poor model segmentation targeting the subcutaneous fat. Failures of the initial segmentation model were manually recorded and summarised in Figure 3.
The performance of the initial model on the new test set (N=24), quantified by Dice Similarity Coefficient (DSC), was 0.712 ± 0.251 (mean ± standard deviation). The retrained model obtained a performance of DSC 0.810 ± 0.094. Future work will assess improvement in performance comparing our system against selecting cases at random, and quantify the time saved.
Figure 4 compares pancreas segmentations prior to and after retraining on a newly selected test. This subject had ingested pineapple juice in preparation for a magnetic resonance cholangiopancreatography (MRCP) scan. Importantly, the MRI data properties depend on the acquisition protocol of that data but also any procedures performed in preparation for the imaging session.Conclusions
Characterising failures in $d_{n+1}$ informs future data collection, for example by focusing on those cases that need to be processed by the model but where the quality may be improved through changes in the acquisition protocol.
Radiomics features serve as a surrogate quantitative method to identify cases with poor segmentations, that may be useful for retraining and improving a DL model.Acknowledgements
No acknowledgement found.References
1. Stacke K, Eilertsen G, Unger J, Lundstrom C. Measuring Domain Shift for Deep Learning in Histopathology. IEEE J Biomed Health Inform. 2021;25(2):325-336. doi:10.1109/JBHI.2020.30320602.
2. Owler J, Triay Bagur A, Marriage S, et al. Pancreas Volumetry in UK Biobank: Comparison of Models and Inference at Scale. In: Papież BW, Yaqub M, Jiao J, Namburete AIL, Noble JA, eds. Medical Image Understanding and Analysis. Vol 12722. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2021:265-279. doi:10.1007/978-3-030-80432-9_213.
3. van Griethuysen JJM, Fedorov A, Parmar C, et al. Computational Radiomics System to Decode the Radiographic Phenotype. Cancer Research. 2017;77(21):e104-e107. doi:10.1158/0008-5472.CAN-17-03394.
4. Diaz-Pinto A, Alle S, Ihsani A, et al. MONAI Label: A framework for AI-assisted Interactive Labeling of 3D Medical Images. March 2022. http://arxiv.org/abs/2203.12362. Accessed May 17, 2022.