2130

Clinical Validation of the InnerEye Hippocampal Segmentation Tool
Anna Schroder1, Hamza A. Salhab2,3, James Moggridge2,3, Caroline Micallef2, Jiaming Wu1,4, Sjoerd Vos5, Melissa Bristow6, Fernando Pérez-García6, Javier Alvarez-Valle6, Tarek A. Yousry2,3, John S. Thornton2,3, Frederik Barkhof1,3,4,7, Daniel C. Alexander1, and Matthew Grech-Sollars1,2
1Centre for Medical Image Computing, Department of Computer Science, University College London, London, United Kingdom, 2Lysholm Department of Neuroradiology, National Hospital for Neurology and Neurosurgery, University College London Hospitals NHS Foundation Trust, London, United Kingdom, 3Department of Brain Repair and Rehabilitation, UCL Institute of Neurology, University College London, London, United Kingdom, 4Department of Medical Physics & Biomedical Engineering, University College London, London, United Kingdom, 5Centre for Microscopy, Characterisation & Analysis, University of Western Australia, Perth, Australia, 6Health Futures, Microsoft Research Cambridge, Cambridge, United Kingdom, 7Department of Radiology and Nuclear Medicine, Amsterdam Neuroscience, Vrije Universiteit Amsterdam, Amsterdam UMC, Amsterdam, Netherlands

Synopsis

Keywords: Segmentation, Machine Learning/Artificial Intelligence

Motivation: Accurate segmentation of the hippocampus provides an important biomarker in neurodegenerative diseases, e.g., Alzheimer’s disease. However, currently available tools are not robust to disease-related atrophy.

Goal(s): We aim to demonstrate the accuracy of our InnerEye hippocampal segmentation tool on clinical data.

Approach: We fine-tuned our existing model on manually segmented data and externally validated the model on a clinical dataset of patients referred to a dementia clinic. We compare our model to three commonly used segmentation tools.

Results: Our model provides significant improvements over currently available tools when tested on an external, clinical dataset.

Impact: The hippocampal segmentation model presented in this work provides significant improvements over currently available tools in an external, clinical dataset. Segmentation performance was increased, while run-times were decreased. These results support the tool as a viable alternative in clinical settings.

Introduction

The hippocampus, a small, sub-cortical brain region, provides a critical biomarker for neurodegenerative diseases, e.g., Alzheimer’s disease1. Automated segmentation tools often fail to accurately capture heterogeneity amongst subjects, particularly in those with substantial disease-related atrophy2. Deep learning segmentation tools rely on large training datasets with ground truth segmentations. As these are expensive to obtain, tools often rely on automatically3 or semi-automatically labelled2 data which are subject to inaccuracies. Using these segmentations as ground truth for training can lead to substantial errors in segmentation models.

In this work, we use manual segmentations to improve our original hippocampal segmentation model2, which was built on the InnerEye toolbox4 with semi-automatically segmented data5. We assess the accuracy of this model on an external, clinical dataset and compare to three freely available segmentation tools: FreeSurfer6, FastSurfer3 and HIPPOSEG7.

Methods

Data

In previous work2, we trained our original InnerEye model on 1155 T1 MP-RAGE MRI scans with corresponding semi-automated segmentations from the Alzheimer’s disease neuroimaging initiative (ADNI) (adni.loni.usc.edu) (see Table 1). In this work, we used three datasets with manual segmentations: two ADNI datasets and a dataset of patients referred to the dementia clinic at the National Hospital for Neurology and Neurosurgery (NHNN).

ADNI: We downloaded two sets of ADNI T1 MP-RAGE MR scans. The first (ADNI A, N=120) was used to fine-tune the original InnerEye model to ensure the model produced segmentations closer to the manual ground truth. The second (ADNI B, N=30) was used to run internal validation on the fine-tuned model.

Manual segmentations for ADNI A were downloaded from the Harmonised Protocol (HarP) dataset8,9 where hippocampi were segmented by five tracers, followed by an independent quality check. This dataset was split into training and validation sets for fine-tuning. MR scans from ADNI B were manually segmented by a local clinician and confirmed by a consultant neuroradiologist. Information on data splits and clinical diagnoses are in Table 1.

NHNN: Our clinical NHNN dataset was used to test the fine-tuned model. We pre-processed the 20 T1 MP-RAGE MR images using FSL10 to correct the bias field and apply automated cropping to improve consistency with the ADNI MR images. Scans were manually segmented by the same process as ADNI B.


InnerEye and Fine-Tuning

Our original InnerEye model for hippocampal segmentation2 was trained using the InnerEye toolbox, a deep learning toolbox for 3D medical images4. The pipeline consists of an ensemble of five 3D U-Net models. Here, we fine-tuned each ensemble model using the ADNI A dataset. The learning rate was selected to maximise Dice score on the ADNI A validation dataset.

Comparison Tools

Segmentations from both the original2 and fine-tuned InnerEye models were compared to various commonly used tools: FreeSurfer6, FastSurfer3, and HIPPOSEG7, which is currently used clinically at the NHNN. Statistical significance tests for group differences were assessed using the Wilcoxon signed-rank test.

Results

When comparing the fine-tuned model to the original InnerEye model, we observe a significant (p<0.01) improvement in Dice score on the internal dataset, ADNI B (Figure 1).

The fine-tuned InnerEye model provides the best mean Dice score on the external clinical dataset from NHNN (Figure 2). These results are significantly better (p<0.01) than other segmentation tools. FreeSurfer and FastSurfer tend to over-segment the hippocampus, resulting in low precision, while HIPPOSEG under-segments the hippocampus, resulting in low recall.

Figure 3 provides a qualitative comparison of model performance in the external dataset. Figure 3a illustrates “typical” performance for each model, where the Dice scores are within the middle 50th percentile for each model’s Dice scores. We observe the best segmentation from fine-tuned InnerEye, while FreeSurfer and FastSurfer over-segments and HIPPOSEG under-segments. Figure 3b provides a rare case where fine-tuned InnerEye is marginally outperformed by another model (HIPPOSEG).

Table 2 provides mean run-times for each of the segmentation tools on a Quad-Core Intel Core i5 CPU with 8Gb RAM, for comparison to clinically used hardware. InnerEye provides the fastest run-time, with inference taking 3.9 minutes per scan.

Discussions and Conclusions

We have demonstrated improvements to the performance of our InnerEye Hippocampal Segmentation tool fine-tuned on manual segmentations, which outperforms all other comparison models on an external, clinical dataset (NHNN). Our model significantly outperforms HIPPOSEG, which we currently use clinically. This supports InnerEye as a viable alternative to HIPPOSEG in clinical settings.

Future work will focus on training the model with augmentations, which will aim to improve model robustness and remove the need to pre-process the images prior to training and inference. The model will be tested on other datasets to ensure generalisability to other neurological conditions, including epilepsy.

Acknowledgements

The InnerEye software is open source and can be found at https://github.com/microsoft/InnerEye-DeepLearning. AS, MGS, FBJW, JST and TAY are supported by the NIHR Biomedical Research Centre at UCLH. AS is supported by Engineering and Physical Sciences Research Council (EPSRC), Impact Acceleration Account (IAA) 2022-25. Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; Euro Immun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; Neuro Rx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

References

1. Jack CR, Petersen RC, Xu Y, O’Brien PC, Smith GE, Ivnik RJ, et al. Rates of hippocampal atrophy correlate with change in clinical status in aging and AD. Neurology. 2000 Aug 22;55(4):484–90.

2. Schroder A, Moggridge J, Wu J, Salhab HA, Vos S, Bristow M, et al. InnerEye as a Tool for Accurate Hippocampal Segmentation. In: Proceedings of the Annual Meeting of ISMRM. Toronto; 2023.

3. Henschel L, Conjeti S, Estrada S, Diers K, Fischl B, Reuter M. FastSurfer - A fast and accurate deep learning based neuroimaging pipeline. NeuroImage. 2020 Oct 1;219:117012.

4. Oktay O, Nanavati J, Schwaighofer A, Carter D, Bristow M, Tanno R, et al. Evaluation of Deep Learning to Augment Image-Guided Radiotherapy for Head and Neck and Prostate Cancers. JAMA Netw Open. 2020 Nov 30;3(11):e2027426.

5. Hsu YY, Schuff N, Du AT, Mark K, Zhu X, Hardin D, et al. Comparison of automated and manual MRI volumetry of hippocampus in normal aging and dementia. J Magn Reson Imaging JMRI. 2002 Sep;16(3):305–10.

6. Fischl B, Salat DH, Busa E, Albert M, Dieterich M, Haselgrove C, et al. Whole Brain Segmentation: Automated Labeling of Neuroanatomical Structures in the Human Brain. Neuron. 2002 Jan 31;33(3):341–55.

7. Winston GP, Cardoso MJ, Williams EJ, Burdett JL, Bartlett PA, Espak M, et al. Automated hippocampal segmentation in patients with epilepsy: Available free online. Epilepsia. 2013;54(12):2166–73.

8. Boccardi M, Bocchetta M, Apostolova LG, Barnes J, Bartzokis G, Corbetta G, et al. Delphi definition of the EADC‐ADNI Harmonized Protocol for hippocampal segmentation on magnetic resonance. Alzheimers Dement. 2015 Feb;11(2):126–38.

9. Boccardi M, Bocchetta M, Morency FC, Collins DL, Nishikawa M, Ganzola R, et al. Training labels for hippocampal segmentation based on the EADC‐ADNI harmonized hippocampal protocol. Alzheimers Dement. 2015 Feb;11(2):175–83.

10. Jenkinson M, Beckmann CF, Behrens TEJ, Woolrich MW, Smith SM. FSL. NeuroImage. 2012 Aug;62(2):782–90.

Figures

Table 1: Data splits for the ADNI datasets. The final three columns provide the percentage of subjects in each data split with each clinical diagnosis: dementia, mild cognitive impairment (MCI) and cognitively normal (CN). We show data splits for the pre-training dataset used to train our original model2, and the two ADNI datasets used in this work: ADNI A and ADNI B.


Figure 1: Internal validation of InnerEye against manual segmentations in the ADNI B dataset.


Figure 2: External clinical validation of InnerEye against manual segmentations in the NHNN dataset.


Figure 3: Qualitative analysis of model performance in two scenarios from the external clinical dataset: a) where we observe “typical” performance from each of the models (Dice scores are within the middle 50th percentile for each model’s Dice scores); and b) where fine-tuned InnerEye is marginally outperformed by another model (HIPPOSEG). Blue segmentations represent the manual segmentations, and model segmentations are represented by colours consistent with Figures 1 and 2: fine-tuned InnerEye in orange, FreeSurfer in green, FastSurfer in red, and HIPPOSEG in purple.


Table 2: Average run-time for each model. Run-times were averaged across all subjects in the external validation dataset. FreeSurfer and FastSurfer perform full brain segmentation, while InnerEye and HIPPOSEG are hippocampal specific.


Proc. Intl. Soc. Mag. Reson. Med. 32 (2024)
2130
DOI: https://doi.org/10.58530/2024/2130