0394

MRI-Based Response Prediction to Immunotherapy of Late-Stage Melanoma Patients Using Deep Learning
Annika Liebgott1,2, Louisa Fay1, Viet Chau Vu2, Bin Yang1, and Sergios Gatidis2
1Institute of Signal Processing and System Theory, University of Stuttgart, Stuttgart, Germany, 2Department of Radiology, University Hospital of Tuebingen, Tuebingen, Germany

Synopsis

The treatment of malignant melanoma with immunotherapy is a promising approach to treat advanced stages of the disease. However, the treatment can cause serious side effects and not every patient responds to it, which means crucial time may be wasted by an ineffective treatment. Assessment of the possible therapy response is hence an important research issue. The research presented in this study focuses on the investigation of the potential of medical imaging and machine learning to solve this task. To this end, we trained and compared different deep learning models on multi-modal PET/MR images to differentiate non-responsive from responsive patients.

Introduction

Malignant melanoma has shown increasing worldwide incidence over the last decades1. Although the prognosis is very good when caught early, it is a very aggressive type of cancer that spreads quickly once it has advanced beyond the skin barrier, leading to low survival rates. In recent years, therapy with immune checkpoint inhibitors has lead to significantly improved patient outcome. The treatment has shown the potential to slow down, stop or completely reverse the disease's progress2. While the positive effects are promising, there are also issues which often lead to immunotherapy being not the first choice of treatment. For instance, the stimulation of the immune system can inflict severe side effects. The main concern, however, is that only part of the patients respond to the treatment while the disease continues to progress in others, leading in the worst case to wasting crucial time with ineffective therapy.
Hence, a major issue in different clinical research disciplines is finding out what differentiates responsive from non-responsive patients, as well as trying to predict the individual therapy response potential. Our research focuses on using PET/MR imaging combined with machine learning (ML) to predict therapy response. In this study, we implemented a deep learning (DL) system that has been trained to distinguish responsive from non-responsive patients based on multi-modal images of segmented organs with relationship to the immune system.
In the past years, a couple of related studies have been published proposing to use ML approaches. To the best of our knowledge, none of these studies used a similar approach to ours. They either used other imaging modalities3,4, combined imaging with other prior knowledge (e.g. RNA sequencing3), or did not use radiological images, but other clinical examinations (e.g. h&e stain5, genetic analyses6,7 or blood tests8).

Methods

Our data set consists of PET/MR images (Figure 1) from 24 patients acquired at three times over the course of treatment. As our cohort of patients is relatively small and we did not want individual physiological traits to influence our results, we only used the liver, spleen and spine (Figure 2). Segmentation of the organs has been performed by trained physicians on the MR images, the resulting VOIs have then been transferred to the corresponding PET images and ADC maps.
The general structure of our DL system is depicted in Figure 3. All network architectures we used are constructed in an encoder-decoder structure. Figure 4 shows the investigated models.
In some experiments, we employed transfer learning10 (TL), a strategy to boost performance of a model (especially for small data sets), by re-using a pre-trained model to initialize the training process. The hypothesis is that a model trained to classify images will learn general features significant for arbitrary image classification tasks, meaning a new model will only need to learn the relationship between those features and the desired outputs. We hence re-used the encoder of a pre-trained model for medical image segmenatation (dataset: Medical Segmentation Decatholon11) and only adapted the decoder layers to our task.
Our experiments were conducted considering two questions:
  1. How useful are the chosen organs for our task?
  2. Do we need all three examinations?

Results

The performance of the best models in terms of resulting F1 score, sorted by organ and number of examinations used, are presented in Figure 5. Table a) shows the best results without TL, Table b) when utilizing a pre-trained model.

Discussion

In early experiments, we found that using all organs combined for training did not work well, hence we further investigated the organs individually. While liver and spleen could lead to F1 scores of ~0.8, our best result for the spine was as low as 0.67. This indicates that the information contained in this organ is not as useful to our task and possibly led to the bad performance when using all organs combined. Further experiments combining liver and spleen will be conducted in the near future.
In general, F1 scores were higher if all three examinations were used, which was expected due to the model being able to draw conclusions between the response label and the image differences between acquisitions. However, our best overall model with F1 score of 0.82 resulted from using only the first examination of the spleen and employing TL. This indicates that the spleen may contain valuable information about the responsiveness of a patient even before the start of treatment, which needs to be further explored. Overall, TL proved to be mainly useful for models trained on the first examination only but yielded no benefit when using all three examinations.
Although our results look promising, classifier performance needs to be increased significantly. Based on our experiments, we are confident that using a larger training base could achieve this goal. Nevertheless, these findings are only to be viewed as proof of concept and need to be validated on a larger, more diverse data set to be able to draw more general conclusions.

Conclusion

The results presented in this proof of concept study indicate, that predicting therapy response based on radiological imaging using DL should be feasible. Further investigation could help to find a non-invasive method to early predict patients' individual therapy response potential.

Acknowledgements

This research was conducted with the support of Vector Stiftung.

References

1. Cancer Research UK, https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/melanoma-skin-cancer

2. I. Lugowska, P. Teterycz, P. and Rutkowski: Immunotherapy of melanoma. Contemporary oncology (Poznan, Poland), 2018, 22(1A), 61–67. doi:10.5114

3.R. Sun, E. J. Limkin, M. Vakalopoulou, L. Dercle, S. Champiat, S. R. Han et al.: A radiomics approach to assess tumour-infiltrating CD8 cells and response to anti-PD-1 or anti-PD-L1 immunotherapy: an imaging biomarker, retrospective multicohort study, The Lancet Oncology, Volume 19, Issue 9, 2018, pp 1180-1191.

4. S. Trebeschi, S. G. Drago, N. J. Birkbak, I. Kurilova, A.M. Calin, A. Delli Pizzi et al.: Predicting Response to Cancer Immunotherapy using Non-invasive Radiomic Biomarkers, Annals of Oncology, mdz108, March 2019.

5. Z. Dawood, N. Coudray, R. H. Kim, S. Nomikou, U. Moran, J. S. Weber et al.: Prediction of response and toxicity to immune checkpoint inhibitor therapies (ICI) in melanoma using deep neural networks machine learning, Journal of Clinical Oncology, 2018, pp 9529-9529

6. S. Gandhi, S. Pabla, M. Nesline, M. Pandey, M. S. Ernstoff, G. K. Dy et al.: Algorithmic prediction of response to checkpoint inhi-bitors: Hyperprogressors versus responders, Journal of ClinicaOncology, 2017, pp 11565-11565.

7. C. Morrison, S. Pabla, J. M. Conroy et al.: Predicting response to checkpoint inhibitors in melanoma beyond PD-L1 and mutational burden, J. ImmunoTherapy of Cancer, 2018, pp 6 – 32.

8. C. Krieg, M. Nowicka, S. Guglietta, S. Schindler, F. J. Hartmann, L. M. Weber et al.: Biomarker prediction to anti-PD-1 immunotherapy by using high dimensional single cell analysis, The Journal of Immunology May 1, 2018, 200 (1 Supplement) 174.26.

9. E. Castro, J. S. Cardoso and J. C. Pereira: “Elastic deformations for data augmentation in breast cancer mass detection,” in 2018 IEEE EMBS International Conference on Biomedical Health Informatics (BHI), 2018, pp. 230–234.

10. M. Raghu, C. Zhang, J. M. Kleinberg and S. Bengio, “Transfusion: Understanding transfer learning with applications to medical imaging,” CoRR, vol. abs/1902.07208, 2019. [Online]. Available: http://arxiv.org/abs/1902.07208

11. A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. H. Menze, O. Ronneberger, R. M. Summers, P. Bilic, P. F. Christ, R. K. G. Do, M. Gollub, J. Golia-Pernicka, S. Heckers, W. R. Jarnagin, M. McHugo, S. Napel, E. Vorontsov, L. Maier-Hein and M.Cardoso, “A large annotated medical image dataset for the development and evaluation of segmentation algorithms,” CoRR, vol. abs/1902.09063, 2019. [Online]. Available: http://arxiv.org/abs/1902.09063

Figures

Figure 1: Exemplary abdominal slices of one examination: images with fat (a) and water weighted (b) Dixon sequences, ADC map (c) and PET image (d) were acquired. For each patient, examinations have been conducted prior to, two weeks and two months after starting immunotherapy.

Figure 2: Example of the segmentations we used in this study to train our model. The organs have been chosen due to their close relationship to the immune system, making it more likely to see an immune response in them, as well as the comparably small variability between patients. The latter is intended to decrease the risk of physiological traits of individual patients leading to random correlations between patient physiology and their therapy response without an underlying causality regarding the immune system.

Figure 3: Pipeline of our deep learning framework. Preprocessing of the data consists of organ segmentation, followed by data normalization and creation of TFRecords for efficient data processing, which are then split into training and test data and resized such that all images have the same size. The modular design allows to choose arbitrary deep learning models to investigate different network architectures. Dashed lines mark optional modules (data augmentation, transfer learning). As data augmentation, we implemented random rotation, random shift and elastic deformation9.

Figure 4: Encoder and decoder architectures investigated in this study. The encoders are a simple 3D convolutional network (ConvNet), a convolutional network arranged in residual blocks (ResNet) or a convolutional network structured as an inception cell (InceptionNet). The decoders are a simple classification decoder (SCD), consisting of one flattening and dense layer followed by a dense classification layer, and a time-distributed classification decoder (TDCD) which features additional LSTM layers to account for the different time instances of the examinations.

Figure 5: Results of the best performing models without and with TL. Best models per organ without TL (a) are marked green. In case of liver and spleen, all three examinations are needed to achieve the highest result. For the spine, using only the first two works better, although the F1 score is still bad. The best results achieved with TL (b) are mostly lower compared to training the same models from scratch. Models which benefited from TL are highlighted in blue. Especially models using liver or spleen of only the first examination were able to achieve an increased F1 score.

Proc. Intl. Soc. Mag. Reson. Med. 29 (2021)
0394