Investigating machine learning approaches for quality control of brain tumor spectra

Sreenath P Kyathanahally¹, Victor Mocioiu², Nuno Miguel Pedrosa de Barros³, Johannes Slotboom³, Alan J Wright⁴, Margarida Julià-Sapé ², Carles Arús², and Roland Kreis¹

¹Depts. Radiology and Clinical Research, University of Bern, Bern, Switzerland, ²Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Universitat Autònoma de Barcelona, Barcelona, Spain, ³DRNN, Institute of Diagnostic and Interventional Neuroradiology/SCAN, University Hospital Bern, Bern, Switzerland, ⁴CRUK Cambridge Institute, University of Cambridge, Cambridge, United Kingdom

Synopsis

Despite many potential applications of MR spectroscopy in the clinic, its usage is limited – and the need for human experts to identify bad quality spectra may contribute to this. Previous studies have shown that machine learning methods can be developed to accept or reject a spectrum automatically. In this study, we extend this to different machine learning methods on 1916 spectra from the eTUMOUR and INTERPRET databases. The RUSBoost classifier, which handles unbalanced data, improved specificity and accuracy compared to other classifiers, in particular in combination with an extended feature set and multi-class labels.

Introduction:

A major hurdle for successful application and robust use of MRS in the clinic is the need for local technical expertise to prevent inappropriate interpretation of bad MR spectra¹. A previous study² suggested an automatic procedure based on support-vector-machine (SVM) classification to rate MRS tumor data as acceptable or unacceptable to eliminate the need for the human specialist and associated subjectivity. We tried to verify its performance on a much larger database and on upon finding poor performance, we aimed to pinpoint the reasons for this and to develop a more robust tool as listed in Table 1.

Methods:

Spectra were downloaded from eTUMOUR³ and INTERPRET⁴ databases (Table 2). These spectra were acquired from 1.5T MR scanners (GE, Siemens, Philips) using standard PRESS and STEAM sequences at either short (20-32ms) or long echo times (135-144ms). Judgment of quality was originally done by three spectroscopists. If two of them had agreed, it was classified as acceptable.

All spectra were preprocessed using DMS software⁵. Briefly, it included phase-correction using the water reference, residual water removal, spectral alignment, SNR calculation and unit length normalization. Training and test sets for developing new classifiers included 1916 and 241 ¹H single-voxel spectra, respectively. The latter were also judged by 4 local spectroscopists (twice, each).

The method of Wright et al.² consists of extracting 5 independent components (ICs) and training a SVM classifier. All classifiers, like RUSBoost (random undersampling and boosting)⁶, and feature extraction methods have been implemented in MATLAB.

Results and Discussion:

Wright’s method² performed well when applied to the same dataset as in Ref. 2, but when applied to the whole dataset (1916 spectra), the highest specificity (i.e. the proportion of bad spectra that are correctly identified) was around 45%. We tested potential reasons for this failure and potential modifications and new tools (see Table 1 for overview and Table 3 for the results). Neither separation of the data along echo time (TE) or according to database improved the classification performance considerably; neither did an increase of the number of ICs. Changing to more features, using sequential forward feature selection and principal component analysis (PCA), in combination with replacement of SVM by LDA (Linear discriminant analysis)⁷ clearly improved specificity but still not sufficiently.

To test whether the huge class imbalance in our dataset (87% acceptable) was the reason for bad performance, we used RUSBoost⁶ as classifier, which is known for handling class imbalance. Its specificity was highest for 9 ICs. Since ICs may be too simplistic, we added other features (see Table 1) and used treebagger (Bootstrap aggregation for ensemble of decision trees) in MATLAB to estimate the feature importance and arrived at 24 features, where SNR and skewness of the spectrum were ranked highly. With these features, the specificity improved to 83%. Another extension was to move to a three class grading (“poor” if just one spectroscopist accepted the spectrum). The specificity increased substantially to 92%.

To judge potential effects of inconsistent human labeling, 241 spectra were relabeled locally and intra- and inter-rater reliability, as well as consistency with the earlier raters was tested. Intra-rater variance was low (< 1 point mean difference in a 10 point quality score), as was inter-rater variability. The average agreement on classifications between the new consensus ratings and the individual spectroscopists was 92% (binary labels). The local spectroscopists agreed in 84% with the earlier consortium experts. However, they were systematically more critical in accepting spectra as reflected in the lower sensitivity (Table 4).

Finally, all local human reclassification cases and RUSBoost binary and multi-class classifiers were combined. Table 5 shows that the RUSBoost classifiers had a good performance, though not quite reaching human raters.

Conclusions:

Classification appears to be easier when the classes are nearly balanced, especially when the class you are targeting is the rarer one. It is common to use classification accuracy as a first measure to determine the classifier performance but when the classes are imbalanced (such as 87% in one class), accuracy measures are misleading since they may just reflect the underlying class distribution even if the true accuracy is higher. It is better to look at sensitivity and specificity which give more insight into the classifier performance. Here, we trained the RUSBoost classifier that combats the imbalanced training data which showed improved specificity with an independent test-set of data compared to previously published methods. In addition, our results suggest that multiclass labels (3 classes) may be beneficial for improved classification performance. The final classifiers had a comparable performance to the panel of human expert spectroscopists in rejecting spectra.

Acknowledgements

We thank the eTUMOUR and INTERPRET consortium for providing data. This research was carried out in the framework of the European Marie-Curie Initial Training Network, ‘TRANSACT’, PITN-GA-2012-316679, 2013-2017 and also supported by the Swiss National Science Foundation

References

1. R Kreis. Issues of spectral quality in clinical 1H-magnetic resonance spectroscopy and a gallery of artifacts. NMR Biomed. 2005; 17(6):361-381.

2. AJ Wright et al. Automated quality control protocol for MR spectra of brain tumors. Magn Reson Med. 2008; 59(6):1274-1281.

3. http://solaria.uab.es:9091/eTumour/

4. http://gabrmn.uab.es/interpretdb/publico/

5. http://gabrmn.uab.es/dms.

6. C Seiffert et al. RUSBoost: Improving classification performance when training data is skewed. 19th International Conference on Pattern Recognition. 2008;1–4.

7. S Ortega-Martorell et al. SpectraClassifier 1.0: a user-friendly, automated MRS-based classifier development system, BMC Bioinform. 2010; 11:106.

Figures

Table 1: Potential reasons for failure of previous classification tool and adaptations tested to improve classification.

Table 2: Overview of the data downloaded from eTUMOUR and INTERPRET databases

Table 3: Performance of classifiers on training set with 30-fold cross validation. Accuracy, sensitivity, specificity reflect the proportion of overall correctly identified spectra, the proportion of good spectra that are correctly identified as good, and the proportion of bad spectra correctly identified as bad, respectively

Table 4: Agreement of new quality labeling by local spectroscopists with previous consensus labels. For definition of accuracy, sensitivity, specificity, see Table 3

Table 5: Performance of human experts and the optimized classifier (30-fold cross validation) with respect to new consensus labels. For definition of accuracy, sensitivity, specificity, see Table 3

Proc. Intl. Soc. Mag. Reson. Med. 24 (2016)

0021