The tongue is arguably the most important articulator enabling human speech production. Current tagged MRI methods for studying internal tongue motion evolved from CINE cardiac techniques that rely on multiple repetitions with perfect synchronization. However, speech production, unlike cardiac motion, possesses great token, type and individual variability due to its voluntary, highly context-sensitive and information-encoding nature. In this work, we demonstrate tagged RT-MRI of speech production, without requiring any repetitions or synchronization for data re-binning. We demonstrate capture of several important tongue deformation patterns and their relative timing.
Introduction
The tongue is a complex biomechanical system comprised of numerous intrinsic and extrinsic muscles [1], forming remarkably complex shapes during speech. Tagged CINE-MRI has been employed to analyze the motion of the internal tongue [2]–[4]. Current tagged MRI methods rely on repetition with perfect synchronization and have enabled nuanced analysis of cardiac motion [5], [6] during sinus rhythm (highly repeatable and synchronized to ECG). Speech production differs from cardiac motion in important ways; notably, there is substantial token and type variability due to its voluntary, highly context-sensitive, and information-encoding nature.
In this work, we demonstrate a tagging method during RT-MRI for speech production without needing synchronization or repetition. We show that the proposed method can capture several unique motion patterns and their relative timing through measuring internal tongue deformation.
Method
Sequence: We utilize 1-3-3-1 SPAtial Modulation of Magnetization (SPAMM) tagging and rapid spiral GRE acquisition [7]. Figure 1 illustrates the pulse sequence with precise timing, as implemented within a real-time imaging platform (HeartVista, Inc., Los Altos, CA, USA). Tagging is applied as a brief interruption to a continuous real-time spiral acquisition. Tagging can be initiated manually by the operator, cued to the speech stimulus, or be automatically applied with a fixed frequency. We used a standard 2D 1-3-3-1 SPAMM sequence with 1cm spacing in both in-plane directions within 5.66msec. The imaging parameters were: FOV 20cm, slice thickness 7mm, readout duration 2.49msec, TE/TR 0.71/5.58msec, 13-interleaves bit reversed view-ordering. Tag persistence in tongue muscle depends on longitudinal relaxation (T1) of the tongue muscle and imaging flip angles [8]. Tag persistence was simulated and experimentally measured.
Reconstruction: Gridding reconstruction with view-sharing was performed on-the-fly during data acquisition. Sliding window of 5TRs resulted in 36frames/sec temporal resolution. The approximate end-to-end reconstruction latency was about 30msec. This setup enables the operator to observe the tagging lines’ deformation in real-time to monitor the subject completion of the designed articulation task, and if the timing of triggering conformed to design.
Speech Experiments: We scanned 2 volunteers (27/M and 27/F), both native American English speakers, on a Signa Excite HD 1.5T scanner with a custom eight-channel upper-airway coil [9]. American English diphthongs /aɪ/, /ɔɪ/ and /aʊ/ were studied because they involve substantial movement of tongue when gliding from initial to final vowel positions, and the duration of these movements (~180ms to 300ms [10]) can be thoroughly covered in the current imaging window. Images were qualitatively evaluated by visual assessment.
Results
Parameter selection: Figure 2 shows a trade-off between CNR-based tag persistence and image SNR when choosing optimal excitation flip angle. Dashed lines in Figure 2(a) indicates CNR optimal flip angle that delivers the longest threshold time. The Ernst angle for imaging tongue is 6.2° as showed in Figure 2(b). Figure 3 contains in-vivo tag persistence measurements in human tongue. The measured signal conformed to the simulation for all imaging flip angles. The CNR by FA = 3° and 5° reached the threshold level for more than 650 ms, with the latter having 35% higher image SNR. Imaging using a very small flip angle was sensitive to B1 inhomogeneity, as the signal dropped dramatically when unintentionally decreasing flip angle. As an overall result of the above considerations, we used flip angle of 5° with an imaging window of 650-800ms and ending CNR of 5-6.
Visualization of tongue deformation: Figure 4 reveals internal tongue movement during three American English diphthong articulation examples. Deformation patterns were observed in the images. For example, we associate tongue tip curving/stretching by the bended grid lines (green). Shear is identified by square grids deforming into parallelograms (cyan). Compression is recognized through deformation into bi-concave rectangles (tongue body magenta, tongue root yellow). These deformations occurred on the course of the diphthong articulation. Figure 5 shows a representative animated GIF of the diphthong articulation.