4523

Volumetric vocal tract segmentation using a deep transfer learning 3D U-Net model
Subin Erattakulangara1, Sarah Gerard1, Karthika Kelat1, Katie Burnham2, Rachel Balbi2, David Meyer2, and Sajan Goud Lingala1,3
1Roy J Carver Department of Biomedical Engineering, University of Iowa, iowa city, IA, United States, 2Janette Ogg Voice Research Center, Shenandoah University, Winchester, IA, United States, 3Department of Radiology, University of Iowa, iowa city, IA, United States

Synopsis

Keywords: Machine Learning/Artificial Intelligence, Head & Neck/ENT

The investigation into the 3D airway area is the prerequisite step for quantitatively studying the anatomical structures and function of the upper airway. Segmentation of upper airway can be considered as one of the stepping stones for this investigation. In this work, we propose a transfer learning-based 3D U-Net model with a ResNet encoder for vocal tract segmentation with small datasets training. We demonstrate its utility on sustained volumetric vocal tract MR scans from the recently released French speaker database.

Purpose

The human vocal tract is an exquisite instrument capable of producing fluent speech. Recently, accelerated volumetric vocal tract MRI schemes have demonstrated fast imaging of the entire vocal tract during sustained production of speech tokens (eg. sustained utterances of English and French phonemes in <10 seconds) [1]. Quantitatively assessing vocal tract posture modulation (e.g. via vocal tract area functions) has advanced our linguistic understanding of speaker-to-speaker variability and the underlying phonetics of language and song [2]. These approaches are labor and time intensive. Recently, deep learning-based U-Net models have been applied to segment the upper airway in static 3D upper-airway MRI volumes [3]. However, these require large amounts of annotated training samples (of the order of 206 volumes). There is thus a need for an efficient volumetric segmentation algorithm that can be trained with a small annotated dataset and yet produce accurate results. In this work, we propose a transfer learning-based 3D U-Net model with a ResNet encoder for vocal tract segmentation with small dataset training. We demonstrate its utility on sustained volumetric vocal tract MR scans from the recently released French speaker database [1].

Methods

Our model learns the mapping between input static 3D MRI volume and the volumetric vocal tract segmentation. Figure 1 (a) shows the 3D U-Net architecture. The model employs a ResNet-18 [4] encoder and U-Net [5] decoder. The model is trained by minimizing the Dice similarity between the predicted segmentation and the corresponding ground truth segmentation in the training set. For transfer learning, the network was initially trained with an open-source AAPM 2017 Thoracic Auto-segmentation Challenge dataset [6]. This dataset contained multi-organ segmentation and we have only considered the lung segmentations. The rationale behind using lung segmentations as a transfer learning dataset is to apply auxiliary training for upper airway segmentation. We have considered 25 lung computed tomography (CT) volumes for pre-training the network. Subsequently, the first 5 layers were frozen and the rest of the layers were fine-tuned with the French speaker vocal tract MRI datasets. An illustration of the training pipeline is provided in Figure 1 (b). The proposed training pipeline allows low-level features to be learned from the CT dataset, while medium- and high-level features can be learned from the French speaker dataset [1] consisted of 10 speakers with over 750 postures. In this work, randomly selected 26 postures from all 10 speakers are used. The number of datasets in the training, testing, and validation sets respectively were 17, 1, and 8. Samples from the French speaker dataset are provided in Figure 2.
Pre-processing was performed on both the CT and MRI datasets. Their initial pixel intensities were scaled to a range of -1 to 1. For the upper airway, the airway anatomy was cropped to the area of interest which starts from the lips and ends near the glottis region. The network architecture was developed using Keras and TensorFlow [7] on an NVIDIA A30 GPU. The training was conducted for 500 epochs which took approximately 1.5 hours. The output segmentation was post-processed using the Slicer software to remove any non-anatomical segmentations manually using the eraser tool on the pharyngeal or infra glottis region. The performance of the network was evaluated using the Dice coefficient. Instead of using a single manual segmentation as ground truth, we used the STAPLE [8] method to generate an optimum segmentation using manual segmentations from 3 different observers (2 graduate students with expertise in image processing, upper-airway anatomy, and 1 vocologist). This method allows us to create new reference segmentation with less interobserver variability. An example of three different manual segmentations and their corresponding STAPLE segmentation is given in Figure 3.

Results

Figure 4 shows the Dice coefficient comparison for 8 test volumes from 3 different subjects. It can be observed that 75 percent of the segmentations have shown dice scores above 0.8. Once trained, the network took an average processing time of 10 seconds per volume and 4 minutes for post-processing using Slicer. To appreciate the quality of segmentation we have provided a GIF animation of the volumetric segmentations for 2 subjects in the test dataset along with corresponding reference segmentations (Figure 5). We note the quality of the 3D U-Net segmentations to be accurate near the roof of the tongue, the base of the velum, and around the epiglottis. However, it had some notably missed segmentations in thin airway openings behind the oropharynx (Figure 4 column 5) and openings between the lips (Figure 4 column 7). Further investigations in refining the network model to address these limitations are needed and will form the scope of our future work

Conclusion

We have demonstrated vocal tract segmentation from a 3D static upper airway MRI with small dataset training (17 volumes). The network is trained to learn the mapping between upper airway MRI volume and airway segmentation by leveraging information from open-source CT datasets. Once trained, the model can perform airway segmentation on the order of 4 minutes and 10 seconds including post-processing time on NVIDIA A30 GPU. We also demonstrate that transfer learning has the potential to be an efficient tool for small dataset training for upper airway segmentation.

Acknowledgements

This work was conducted on an MRI instrument funded by 1S10OD025025-01

References

[1] K. Isaieva, Y. Laprie, J. Leclère, I. K. Douros, J. Felblinger, and P.-A. Vuissoz, “Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers.,” Sci Data, vol. 8, no. 1, p. 258, 2021, doi: 10.1038/s41597-021-01041-3.

[2] K. Tom, I. R. Titze, E. A. Hoffman, and B. H. Story, “<title>Volumetric EBCT imaging of the vocal tract applied to male falsetto singing</title>,” in Medical Imaging 1996: Physiology and Function from Multidimensional Images, Apr. 1996, vol. 2709, pp. 132–142. doi: 10.1117/12.237856.

[3] V. L. Bommineni et al., “Automatic Segmentation and Quantification of Upper Airway Anatomic Risk Factors for Obstructive Sleep Apnea on Unprocessed Magnetic Resonance Images,” Acad Radiol, 2022, doi: 10.1016/j.acra.2022.04.023.

[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Dec. 2015, [Online]. Available: http://arxiv.org/abs/1512.03385

[5] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” Jun. 2016, [Online]. Available: http://arxiv.org/abs/1606.06650

[6] J. Yang et al., “Autosegmentation for thoracic radiation treatment planning: A grand challenge at AAPM 2017,” Med Phys, vol. 45, no. 10, pp. 4568–4581, Oct. 2018, doi: 10.1002/mp.13141.

[7] M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” Mar. 2016, [Online]. Available: http://arxiv.org/abs/1603.04467

[8] S. K. Warfield, K. H. Zou, and W. M. Wells, “Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation,” IEEE Trans Med Imaging, vol. 23, no. 7, pp. 903–921, Jul. 2004, doi: 10.1109/TMI.2004.828354.

Figures

Figure 1: Network architecture, Part (a) shows the U-Net 3D with ResNet-18 encoder. The network consists of 4 encoder blocks and 5 decoder blocks. Data flow between units is shown using arrow marks. Part (b) illustrates the proposed training pipeline. The yellow box shows the transfer learning models used on the U-net architecture. Pre and post-processing methods are used to improve the segmentation quality of the network.

Figure 2: Samples of Training and Testing instances in the French speaker database before post-processing. The first row shows the mid-sagittal sections of the individual volumes used and the second row shows the corresponding segmentations. Three samples taken from three different subjects show a variety of vocal tract postures. 7 subjects were used in the training dataset and 3 subjects were used in the testing dataset. CT volumes and segmentations shown in the first column are used for pre-training the network.

Figure 3: Comparison between manual segmentations from 3 different observers and the corresponding STAPLE segmentation.

Figure 4: segmentation overlay with Dice coefficient values. The vertical axis shows the type of data and segmentation. The first row shows the input volume used. The second row shows the segmentations generated from the reference STAPLE method. The third row shows the segmentations from the proposed transfer learning 3D U-Net model. The slices are shown by considering a mid-sagittal section from the 3D segmentation.

Figure 5: GIF animation representing a slice sweep along sagittal. The vertical and horizontal axis show’s subjects used and the type of segmentation respectively. Network segmentation and reference can be compared using columns two and three.

Proc. Intl. Soc. Mag. Reson. Med. 31 (2023)
4523
DOI: https://doi.org/10.58530/2023/4523