Michael Amann1,2, Pavel Falkovskiy3,4,5, Alain Thoeni1, Tobias Kober3,4,5, Alexis Roche3,4,5, Bénédicte Maréchal3,4,5, Philippe Cattin6, Tobias Heye2, Oliver Bieri2, Till Sprenger7, Christoph Stippich2, Gunnar Krueger4,5,8, Ernst-Wilhelm Radue1, and Jens Wuerfel1
1Medical Image Analysis Center (MIAC), Basel, Switzerland, 2Department of Radiology, University Hospital of Basel, Basel, Switzerland, 3Advanced Clinical Imaging Technology, Siemens Healthcare AG, Lausanne, Switzerland, 4Department of Radiology, University Hospital (CHUV), Lausanne, Switzerland, 5École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 6Department of Biomedical Engineering, University of Basel, Basel, Switzerland, 7Department of Neurology, DKD Helios Klinik, Wiesbaden, Germany, 8Siemens Medical Solutions USA, Boston, MA, United States
Synopsis
Performance of FreeSurfer
and FSL was compared on T1-weighted 3D MRI data of 22 controls as function of
scan session, scanner type and segmentation pipeline. Intra-class correlation
coefficients and percentage volume differences were calculated for the segmentation
results of both pipelines. Strong agreement was found for whole brain, white
matter and cortex. For each pipeline, the impact of experimental factors was
assessed by linear mixed effects analysis. We found significant scanner effect on
the results of both segmentation pipelines. For subcortical structures, segmentation
reliability was higher in FSL than in FreeSurfer, whereas for cortex and WM,
FreeSurfer was more stable.Introduction
In many neurological
disorders, atrophy of the brain and/or its substructures has been identified as
an important marker of pathological processes. As these changes are often
subtle, it is important to investigate the effects of hardware setup and data
post-processing onto the volume measurements. In this study, we compared the
performance of the freely available software packages FreeSurfer (FS) [1] and
FSL [2] on quantitative volumetric measurements, using T1-weighted
3D MRI data derived from different scanners as a function of scan session and
segmentation pipelines.
Materials
and Methods
All MRI scans were performed
on a single site on four different clinical scanners (all Siemens Healthcare,
Germany):
a.) 1.5T MAGNETOM Avanto (12-channel head coil),
b.) 1.5T MAGNETOM
Espree (12-channel head coil),
c.) 3T MAGNETOM Prisma (64-channel head-neck
coil),
d.) 3T MAGNETOM Skyra (20-channel head-neck coil).
On all scanners, the
scanning protocols encompassed three-dimensional T1w MPRAGE sequences. At
3T, the spatial resolution was 1mm3 isotropic (TR/TI/BW/α/TA= 2.3s/0.9s/240Hz/px/9°/5:12min),
at 1.5T, resolution was 1.25x1.25x1.20mm3 (TR/TI/BW/α/TA= 2.4s/1.0s/180Hz/px/8°/4:42min).
Sequence parameters were based on the ADNI protocol [3] and were adjusted for
similar signal-to-noise ratio at both field strengths.
At scanner (a) and (c), four
MPRAGE scans were performed:
R0: baseline scan
R1: back-to-back scan – best case scenario, used to
calculate scan/rescan-reliability
R2: scan after repositioning and new shim
R3: scan performed two to four weeks after baseline
On scanner (b) and (d), only
R0 and R3 were performed.
Twenty-two healthy subjects
underwent this study protocol (13 women, median age 25.0y, range 20.6y-39.4y). Before
each scan session, subject’s hydration status and arterial pressure were
controlled. Brain segmentation was performed with two software pipelines: FreeSurfer
v5.3.0 and FSL v5.0. In the latter, SIENAX [4] was used for whole brain, grey
matter (GM) and white matter (WM) segmentation, and FIRST was used [5] for subcortical GM segmentation. Different anatomical structure definitions in both
pipelines were taken into account in the statistical comparisons.
IBM SPSS Statistics v22 and R
v3.2.2 were used for statistical assessment. To test for the comparability of the two
segmentation methods, we calculated intra-class
correlation coefficients (ICC; two-way mixed-model) for each scan scenario as well as percentage
differences in volume (Equation 1):
$$\triangle V = 100 \% \cdot \frac{V(FSL)-V(FS)}{0.5\cdot(V(FSL)+V(FS))} (1)$$
For each
software, the impact of experimental factors on the segmentation results was assessed
by linear mixed-effects analysis [6]. In the respective statistical model,
scanner type, subject’s age and gender were included as fixed effects, whereas “subject”
was considered as random effect with both variable intercept and slope
depending on the scanning session. The significance of fixed effects was
assessed by Bonferroni corrected t-tests against null hypothesis (with Satterthwaite
approximated degrees-of-freedom). The reliability of segmentation was calculated
by test-retest ICCs according to
$$ICC= \frac{variance(subject)}{variance(subject)+variance(scan/rescan)+variance(R2)+ variance(R3)} (2)$$
Results
In Table 1, ICC (adjusted
for absolute volume) between FS and FSL are summarized for different brain
structures. Strong agreement between segmentation results (ICC>0.9) was
found in all scanner types for whole brain (including brainstem), WM and cortex.
For subcortical GM, strong agreement was found in all scanners except of PRISMA.
For smaller subcortical structures, agreement was substantially lower. The volume
differences between FS and FSL were small in larger brain structures;
however, for most scanner-structure combinations they reached statistical significance (Table 2).
In general, scanner type has a relevant effect on the segmentation results of both
segmentation pipelines (1-4% for the larger structures, up to 9% for smaller
structures such as the caudate). Interestingly, the impact of experimental
factors is different: in WM, cortex and in subcortical
structures, scanner effects are about a factor of two higher in FS
than in FSL. Additionally for the subcortical structures, the segmentation reliability
reflected by test-retest ICC was higher in FSL than in FS, whereas for cortex
and WM, FS is more stable (Table 3). Compared to scan-rescan variability, the
effects of R2 and R3 (repositioning, re-shim, physiological variances) were
minor in all software/scanner combinations (Figure 1).
Discussion
The segmentation results of FS
and FSL are similar for larger brain compartments including the whole brain, WM
and cortex. However, systematic and significant differences can be observed. In
subcortical structures, performance of both pipelines is considerably
different. In general, FSL segmentation seems to be more stable against hardware
effects; however, these effects are still significant. In conclusion, we could
demonstrate that experimental factors have a significant impact on the
segmentation results independently of the applied post-processing software. In
multi-site and multi-scanner studies, these facts have to be considered and to
be compared both to scan-rescan variability and to the expected size of the
effects under investigation.
Acknowledgements
We want to thank the local
MR technologists’ team for supporting us in the MR scansReferences
1. Fischl B et al, Neuron 33 :341-355, 2002.
2. Smith SM et al, NeuroImage 23(S1) :208-219,
2004.
3. Jack CR et al, JMRI 27 :685-691,
2008.
4. Smith SN et al, NeuroImage 17 :
479-489, 2002.
5.
Patenaude B et al, NeuroImage
56 : 907-922, 2011.
6. Winter B. arXiv :1308.5499, 2013.