3609

Deep Learning to Produce Realistic MR Images through Fréchet Inception Distance Monitoring

Sunghun Seo¹, Seung Hong Choi², and Sung-Hong Park¹
¹Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Korea, Republic of, ²Department of Radiology, Seoul National University College of Medicine, Seoul, Korea, Republic of

Synopsis

It is known that optimizing a deep learning model based on best validation loss achieves best quantitative results in image reconstruction, but resulting images are often blurry. In this study we propose an alternative way of optimization in which convolutional neural network (CNN) is trained beyond best validation loss to produce realistic MR images by monitoring Fréchet Inception Distance. The new approach generated sharper and more realistic images than the conventional optimization, providing a new insight into optimization for MR image reconstruction.

INTRODUCTION

The development of deep learning has shown a great impact on the medical imaging fields, such as classification, segmentation, and reconstruction. Particularly, the use of convolutional neural network (CNN) has enabled active research in imaging acceleration and image quality enhancement. A trained model is typically chosen by monitoring its performance in terms of its loss function, such as L1 or L2 loss, on the validation sets during training process. This is to: 1) choose a model that would perform generally well on new data and 2) avoid overfitting. The chosen model successfully minimizes error during test phase, but details in the output images are often blurred. It is critical in the medical image field for the images to be as detailed and realistic as possible for potential clinical usage, requiring a process other than just simple loss optimization. We therefore propose a way to produce sharper and more realistic images through training beyond loss function optimization, which we term “over-iteration”, to select the model by monitoring the Fréchet Inception Distance (FID)¹.

METHODS

Spin-echo T1-weighted images and fast spin-echo T2-weighted images of 21 normal subjects were acquired on a SIEMENS Trio 3.0T scanner (Siemens Medical Solutions, Erlangen, Germany). Imaging parameters were: TR/TE=500/9.8ms for T1, 3700ms/103ms for T2, FOV=220×165mm², matrix=320×240 and number of slices=12-15. 13 sets were used for training, 2 for validating, and 6 for testing.

Deep learning models were developed to enhance the quality of undersampled data. Fully sampled T1-weighted image sets and undersampled T2-weighted images (R=4) were used to train the network to produce fully sampled T2-weighted images². T2-weighted images were undersampled by a factor of 4 through uniform undersampling with densely sampled center lines, the number of which was designated to be 70% of total undersampled lines.

Performance was evaluated and compared between two models: Model-1: single generative model with best validation L1 loss, Model-2: proposed single generative model over-iterated for best validation FID loss. The images of our proposed model were also evaluated by referring to the previous work of generative adversarial network (GAN) based model², which showed an effective way to produce realistic medical images using adversarial networks, in order to check our model’s tendency towards producing realistic images.

The performance of the models was assessed using the normalized mean squared error (NMSE) and structure similarity (SSIM) indices. FID score was used to quantify how realistic the output images were compared to the ground truth images¹. FID index was developed and widely used to compare GAN performance³ in terms of realistic resemblance to ground truth images, and it is known to correlate with human judgment⁴. Lower score means more resemblance towards the ground truth.

U-net with residual learning scheme⁵ was used, in which the generative model was designed to learn the difference between input and label images. The adversarial network structure for reference GAN model was adapted from previous work⁶: four convolutional blocks that consist of 3 3 kernel (stride 2), rectified linear unit (ReLU) activation, and batch normalization (except for first block). Adam optimizer with lr=0.0002 and momentum=0.5 was used to optimize L1-loss function of the models. For reference GAN model, Adam optimizer with lr=0.0001 and momentum=0.5 was used for adversarial network. The loss weights for GAN model was =10 for L1-loss and =1 for adversarial loss (L2-loss). =0 was used for Model-1 and 2. All of the models were trained for 3000 epochs. No additional loss function was used to optimize the models.

RESULTS

The learning curves shown in Figure 1 illustrates the training process of the models. As indicated by red arrows, the validation loss for single generative model reached its minimum (Model-1) at the very early stage of training epochs (Fig 1A). The increase in validation loss while train loss decreases is typically considered as overfitting, and it is conventional to stop the training at this point. However, FID was still decreasing at this point, indicated by black arrows, and converged at later stage (Model-2) of the training (Fig 1B). The converged FID score was comparable to that of reference GAN-based model FID score (Table 1).

Figure 2 shows the visual results of the output T2-images from different models. The single generative model of Model-2 was able to successfully show sharper and more realistic images than Model-1, and showed comparable results to reference GAN model. NMSE and SSIM are better for Model-1(Table 1), which is a well‑known phenomenon², but the FID score of Model-2 was clearly better than that of Model‑1 and comparable to that of reference, when the model was over-iterated.

DISCUSSION and CONCLUSION

Through FID monitoring, we could quantitatively observe the pattern during training after minimum validation loss has been reached. This over‑iteration process in the model optimization produced sharper and detailed information for the training sets, benefitting the test results in terms of realistic resemblance. While this concept might be discarded as overfitting in a number of research fields, the application of the over-iteration process in medical images using a typical single generative model has shown its potential in producing realistic images, the simplicity of which can be easily adapted for further investigation.

Acknowledgements

No acknowledgement found.

References

1. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. 2017. p 6626-6637.

2. Kim KH, Do WJ, Park SH. Improving resolution of MR images with an adversarial network incorporating images with different contrast. Med Phys 2018;45(7):3120-3131.

3. Lucic M, Kurach K, Michalski M, Gelly S, Bousquet O. Are GANs created equal. A Large-Scale Study ArXiv e-prints 2017;2(4).

4. Barratt S, Sharma R. A note on the inception score. arXiv preprint arXiv:180101973 2018.

5. Lee D, Yoo J, Tak S, Ye JC. Deep Residual Learning for Accelerated MRI Using Magnitude and Phase Networks. IEEE Trans Biomed Eng 2018;65(9):1985-1995.

6. Isola P, Zhu J, Zhou T, Efros AA. Image-to-Image Translation with Conditional Adversarial Networks. 2017 21-26 July 2017. p 5967-5976.

Figures

FIGURE 1. Validation FID and L1-Loss vs Epochs for Model-1 and Model-2. Red arrow indicates the point at which Model-1 was chosen, where validation loss (orange) was minimum. Starting from the point indicated by the red arrow, validation loss increases. The behavior of FID score on validation sets was monitored through out the training process. FID score is still high and decreasing at the point where best validation L1-loss is reached (red arrow). The single generative model was over-iterated, and Model-2 was chosen when FID score was minimum (black arrow).

FIGURE 2. Visual comparison of reconstructed T2-images between methods. First column is the inputted undersampled images through the CNN models. Subsequent columns are images processed by: Model-1)single generative model chosen at best validation L1-loss, Model-2) proposed single generative model over-iterated beyond best validation L1-loss epoch until best validation FID score, GAN-based) reference model chosen at best validation FID score, respectively. Last column is the ground truth(GT) T2 image. Second row of the images are the zoomed-in images in the red box region.

Table 1. Quantification Analysis Comparison between Models. NMSE : normalized mean-squared error. SSIM : structural similarity index. FID: Fréchet Inception Distance. Undersampled: images undersampled by factor of 4. Model-1: performance by single generative model chosen by best validation L1 loss. Model-2: proposed single generative model overfitted beyond best validation L1 loss by monitoring best validation FID loss. GAN-based: reference model chosen by best FID loss. Data is given as mean ± standard deviation across tested in vivo subjects (N=6).

Proc. Intl. Soc. Mag. Reson. Med. 28 (2020)

3609