Sreenath P Kyathanahally1, Victor Mocioiu2, Nuno Miguel Pedrosa de Barros3, Johannes Slotboom3, Alan J Wright4, Margarida Julià-Sapé 2, Carles Arús2, and Roland Kreis1
1Depts. Radiology and Clinical Research, University of Bern, Bern, Switzerland, 2Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Universitat Autònoma de Barcelona, Barcelona, Spain, 3DRNN, Institute of Diagnostic and Interventional Neuroradiology/SCAN, University Hospital Bern, Bern, Switzerland, 4CRUK Cambridge Institute, University of Cambridge, Cambridge, United Kingdom
Synopsis
Despite many potential applications of MR spectroscopy in the clinic,
its usage is limited – and the need for human experts to identify bad quality
spectra may contribute to this. Previous studies have shown that machine
learning methods can be developed to accept or reject a spectrum automatically.
In this study, we extend this to different machine learning methods on 1916
spectra from the eTUMOUR and INTERPRET databases. The RUSBoost classifier,
which handles unbalanced data, improved specificity and accuracy compared to
other classifiers, in particular in combination with an extended feature set
and multi-class labels.Introduction:
A
major hurdle for successful application and robust use of MRS in the clinic is the
need for local technical expertise to prevent inappropriate interpretation of bad
MR spectra
1. A previous study
2 suggested an automatic
procedure based on support-vector-machine (SVM) classification to rate MRS tumor
data as acceptable or unacceptable to eliminate the need for the human
specialist and associated subjectivity. We tried to verify its performance on a
much larger database and on upon finding poor performance, we aimed to pinpoint
the reasons for this and to develop a more robust tool as listed in Table 1.
Methods:
Spectra
were downloaded from eTUMOUR3 and INTERPRET4 databases (Table
2). These spectra were acquired from 1.5T MR scanners (GE, Siemens, Philips)
using standard PRESS and STEAM sequences at either short (20-32ms) or long echo
times (135-144ms). Judgment of quality was originally done by three spectroscopists.
If two of them had agreed, it was classified as acceptable.
All
spectra were preprocessed using DMS software5. Briefly, it included phase-correction
using the water reference, residual water removal, spectral alignment, SNR calculation
and unit length normalization. Training and test sets for developing new classifiers
included 1916 and 241 1H single-voxel spectra, respectively. The
latter were also judged by 4 local spectroscopists (twice, each).
The
method of Wright et al.2 consists of extracting 5 independent
components (ICs) and training a SVM classifier. All classifiers, like RUSBoost
(random undersampling and boosting)6, and feature extraction methods
have been implemented in MATLAB.
Results and Discussion:
Wright’s
method2 performed well when applied to the same dataset as in Ref. 2,
but when applied to the whole dataset (1916 spectra), the highest specificity (i.e.
the proportion of bad spectra that are correctly identified) was around 45%. We
tested potential reasons for this failure and potential modifications and new
tools (see Table 1 for overview and Table 3 for the results). Neither
separation of the data along echo time (TE) or according to database improved
the classification performance considerably; neither did an increase of the
number of ICs. Changing to more features, using sequential forward feature
selection and principal component analysis (PCA), in combination with
replacement of SVM by LDA (Linear discriminant analysis)7 clearly
improved specificity but still not sufficiently.
To
test whether the huge class imbalance in our dataset (87% acceptable) was the
reason for bad performance, we used RUSBoost6 as classifier, which
is known for handling class imbalance. Its specificity was highest for 9 ICs. Since
ICs may be too simplistic, we added other features (see Table 1) and used
treebagger (Bootstrap
aggregation for ensemble of decision trees) in MATLAB to estimate the
feature importance and arrived at 24 features, where SNR and skewness of the
spectrum were ranked highly. With these features, the specificity improved to
83%. Another extension was to move to a three class grading (“poor” if just one
spectroscopist accepted the spectrum). The specificity increased substantially
to 92%.
To judge
potential effects of inconsistent human labeling, 241 spectra were relabeled
locally and intra- and inter-rater reliability, as well as consistency with the
earlier raters was tested. Intra-rater variance was low (< 1 point mean
difference in a 10 point quality score), as was inter-rater variability. The
average agreement on classifications between the new consensus ratings and the
individual spectroscopists was 92% (binary labels). The local spectroscopists
agreed in 84% with the earlier consortium experts. However, they were systematically
more critical in accepting spectra as reflected in the lower sensitivity (Table
4).
Finally, all local
human reclassification cases and RUSBoost binary and multi-class classifiers were
combined. Table 5 shows that the RUSBoost classifiers had a good performance, though
not quite reaching human raters.
Conclusions:
Classification appears to be easier when the classes are nearly balanced,
especially when the class you are targeting is the rarer one. It is common to
use classification accuracy as a first measure to determine the classifier
performance but when the classes are imbalanced (such as 87% in one class),
accuracy measures are misleading since they may just reflect the underlying
class distribution even if the true accuracy is higher. It is better to look at
sensitivity and specificity which give more insight into the classifier
performance. Here, we trained the RUSBoost classifier that combats the
imbalanced training data which showed improved specificity with an independent
test-set of data compared to previously published methods. In addition, our
results suggest that multiclass labels (3 classes) may be beneficial for
improved classification performance. The
final classifiers had a comparable performance to the panel of human expert
spectroscopists in rejecting spectra.
Acknowledgements
We thank the eTUMOUR and INTERPRET consortium for providing data. This
research was carried out in the framework of the European Marie-Curie Initial
Training Network, ‘TRANSACT’, PITN-GA-2012-316679, 2013-2017 and also supported
by the Swiss National Science FoundationReferences
1. R
Kreis. Issues of spectral quality in clinical 1H-magnetic resonance
spectroscopy and a gallery of artifacts. NMR Biomed. 2005; 17(6):361-381.
2. AJ
Wright et al. Automated quality control protocol for MR spectra of brain tumors.
Magn Reson Med. 2008; 59(6):1274-1281.
3. http://solaria.uab.es:9091/eTumour/
4. http://gabrmn.uab.es/interpretdb/publico/
5. http://gabrmn.uab.es/dms.
6. C Seiffert et al. RUSBoost: Improving classification performance
when training data is skewed. 19th
International Conference on Pattern Recognition.
2008;1–4.
7. S Ortega-Martorell et al. SpectraClassifier 1.0: a user-friendly, automated MRS-based classifier development system, BMC
Bioinform. 2010; 11:106.