5167

Real-time MRI and Audio Synchronization for Vocal Tract Analysis in Linguistics

Haidee Joy Paterson¹, Ben Lang^2,3, Zainab Hermes⁴, Samantha Wray^3,5, Osama Abdullah¹, Alec Marantz³, and Hadi Zaatiti⁶
¹Core Technology Platform, New York University Abu Dhabi, Abu Dhabi, United Arab Emirates, ²University of California San Diego, San Diego, CA, United States, ³New York University, New York, NY, United States, ⁴The University of Chicago, Chicago, IL, United States, ⁵Dartmouth College, Hanover, NH, United States, ⁶New York University Abu Dhabi, Abu Dhabi, United Arab Emirates

Synopsis

Motivation: This study seeks to overcome the challenges associated with characterizing vocal tract articulation during speech.

Goal(s): The primary objective of this study is to outline the technical configuration required to optimize real-time MRI with high temporal resolution while synchronizing it with audio recordings.

Approach: We employed a single slice FLASH sequence, an MRI-compatible optical microphone, and a signal generator. This setup enables precise synchronization of MRI image acquisition with audio recordings during speech production.

Results: Our findings showcase the practicality of our setup in studying Arabic speech articulation, both in letter pronunciation and the articulation of whole words, encompassing various dialects.

Impact: This research highlights the technical intricacies involved in integrating real-time MRI of the vocal tract with synchronised speech production, introducing an innovative application previously unexplored in linguistic research to address challenging linguistic problems.

Background

The study of the vocal tract during speech presents unique challenges to fully characterize vocal tract articulation and the production of phonetic elements, such as pharyngeal sounds [1], [2]. One primary challenge is achieving a high temporal frame rate to capture the rapid and intricate movements within the vocal tract during speech. Real-time MRI holds tremendous promise in this regard, allowing us to visualize the dynamic processes within the vocal tract. However, the dynamic nature of speech production and the need to precisely synchronize MRI scanner acquisition with voice recordings introduce technical intricacies. Accurate synchronization ensures that dynamic MRI images align precisely with corresponding voice recordings, facilitating the correlation of speech sounds with vocal tract movements. This work aims to describe the technical configuration to optimize high framerate real-time MRI using commercially available sequences with audio synchronization, utilizing available equipment in a physics lab, such as a signal generator.

Materials and Methods

All studies were conducted using a 3T Siemens Prisma scanner running X-numaris X31 software, equipped with a 64-channel head coil. Optimization of real-time MRI acquisition involved comparing the true-FISP sequence with traditional FLASH sequences in terms of image quality, artifacts, and temporal speed. Our results (Figure 1) supported the use of the FLASH sequence in a single midline sagittal image with 10 frames per second (fps) for 10 seconds with the following parameters: a repetition time (TR) of 104.6ms, echo time (TE) of 1.33ms, flip angle 10 degrees, slice thickness of 10mm, spatial resolution of 0.9 x 0.9 x 10mm, FOV of 230mm, an acceleration factor (GRAPPA) of 3, smoothing filter turned on, and with interpolation. Audio recordings within the MRI scanner were achieved using an Optoacoustic’s optical microphone, as described in Figure 2. To synchronize MRI image acquisition with speech production, we used an Agilent’s wave generator to trigger both the gradient echo sequence (with the external trigger option selected on the MRI console) and the optical microphone, positioned approximately 1-2 cm away from the subject's mouth. A custom Python code was used to temporary align the onset of the first MRI image with the onset of speech production, saving a video file of the MRI in sync with the spoken audio. A MATLAB toolkit [3] was then used to automatically detect the contours of the vocal tract for each MRI frame.

Results and Conclusion

In Figure 3, we present an example of synchronized audio and MRI acquisitions, showing a subject uttering three different Arabic syllables. The first row displays the audio recording, the second row shows the time-frequency analysis (i.e., spectrogram), and examples of two MRI timepoints during syllable pronunciation. This capability enables linguists to identify key tongue positions in various tasks. Figure 4 depicts two pairs of similar Arabic letters (ta and tta, and ka and kaf). Note the differences in tongue position and shape, made discernible by the synchronization of real-time MRI and audio recording. Figure 5 shows that speakers of different Arabic dialects pronouncing identical words (/ʕiʒʒa/ and /ħiʒʒa/ ) implement similar overall constriction in the pharynx and larynx for /ħ/ and /ʕ/ regardless of origin dialect. Importantly, constriction for the pharyngeal consonant does not appear to be isolated to tongue root. Retraction–constriction is also present in the larynx utterance of whole words in various Arabic dialect. In summary, this study outlines the technical intricacies involved in integrating real-time MRI of the vocal tract with synchronized speech production, introducing an innovative application previously unexplored in linguistic research. We showcase the practicality of our synchronized realtime MRI and audio recording, addressing otherwise challenging linguistic problems.

Acknowledgements

The experiments described herein were conducted using the facilities of the NYUAD Brain Imaging Core Technology Platform.

References

[1] A. TOUTIOS and S. NARAYANAN, “Advances in real-time magnetic resonance imaging of the vocal tract for speech science and technology research,” Physiol. Behav., vol. 176, no. 3, pp. 139–148, 2017.

[2] A. Niebergall et al., “Real-time MRI of speaking at a resolution of 33 ms: Undersampled radial FLASH with nonlinear inverse reconstruction,” Magn. Reson. Med., vol. 69, no. 2, pp. 477–485, 2013.

[3] M. Belyk, C. Carignan, and C. McGettigan, “An open-source toolbox for measuring vocal tract shape from real-time magnetic resonance images,” Behav. Res. Methods, no. 0123456789, 2023.

Figures

Figure 1: Real-time MRI Acquisition Optimization. The initial comparison involved TrueFISP and FLASH sequences, with a subsequent shift to optimizing the standard FLASH sequence.

Figure 2: The signal generator produces an external trigger using a sine wave. This trigger, in TTL format, is simultaneously routed to both the MRI scanner and the microphone control console. This synchronization ensures the simultaneous initiation of image acquisition and audio recording. The in-bore fiber-optic microphone is utilized for capturing audio recordings.

Figure 4: Data acquisition of audio and MRI scans of the pronunciation of Arabic consonants First row (left to right): sound waves of the pronunciation of Arabic syllabus, respectively /aħa/, /oħo/ and /iħi with a focus on long duration of the consonant /ħ/ Second row: spectrograms obtained from the first row sound waves Third row: dynamic T1-weighted MRI scans of the organs of the vocal system while a volunteer pronouncing the syllabus, left: initial position of the vocal system organs, right: steady-state position during the consonant pronunciation

Figure 4. Tongue location for various Arabic letters.

Figure 5. Comparison of different implementations of two words, /ʕiʒʒa/ and /ħiʒʒa/, in MSA and dialect groups, indicating pharyngeal and laryngeal constrictions.

Proc. Intl. Soc. Mag. Reson. Med. 32 (2024)

5167

DOI: https://doi.org/10.58530/2024/5167