Spatiotemporal-atlas-based High-resolution Dynamic Speech MRI

Maojing Fu¹, Jonghye Woo², Marissa Barlaz³, Ryan Shosted³, Zhi-Pei Liang¹, and Bradley Sutton⁴

¹Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States, ²CAMIS (Center for Advanced Medical Imaging Sciences), Massachusetts General Hospital, Boston, MA, United States, ³Linguistics, University of Illinois at Urbana-Champaign, Urbana, IL, United States, ⁴Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States

Synopsis

Dynamic speech MRI holds great promise for visualizing articulatory motion in the vocal tract. Recent work has enabled accelerated imaging speed, resulting in the need to integrate mechanisms to enable interpretation of the dynamic images that contain great amounts of movement information. This work integrates a spatiotemporal atlas into a partial separable (PS) model-based imaging framework and uses the atlas as prior information to improve reconstruction quality. This method not only captures high-quality dynamics at 102 frames per second, but also enables quantitative characterization of articulatory variability utilizing the residual component from the atlas-based sparsity constraint.

INTRODUCTION

Dynamic MRI has been recognized as a promising method for visualizing articulatory motion of speech in scientific research and clinical applications¹. However, characterization of the gestural and acoustical properties of the vocal tract remains challenging because it requires: 1) reconstructing high-quality spatiotemporal images by incorporating strong prior knowledge; and 2) quantitatively interpreting the reconstructed images that contain great motion variability. This work presents a novel method that meets both requirements simultaneously by integrating a spatiotemporal atlas² into a Partial Separability³ (PS) model-based imaging framework. Through an atlas-driven group-sparsity constraint, this method achieves high-quality articulatory dynamics at an imaging speed of 102 frames per second (fps) and a spatial resolution of 2.2x2.2 mm². Moreover, this method enables quantitative characterization of motion variability, compared to the general motion pattern from all subjects, through the spatial residual components.

METHODS

A key feature of the proposed method lies in the use of a spatiotemporal atlas, $I_a(\mathbf{r}, t)$ , to capture the general pattern of articulatory motion across "all" subjects. As illustrated in Figure 1, a generic atlas has been created for statistical modeling of the vocal tract² and was utilized for dynamic speech imaging. Specifically, this atlas was constructed in a common space by spatially transforming the initial reconstructions from multiple subjects using group-wise diffeomorphic registrations⁴. In our application, this generic atlas was further converted into subject-specific atlas through Lipchitz-norm-based temporal warping² and diffeomorphic-mapping-based spatial warping⁵ procedures. The resulting atlas not only carries an objective description of the "expected" motion pattern for a specific subject, but also remains spatiotemporally in register with the target subject's articulatory motion.

The desired spatiotemporal image $I(\mathbf{r}, t)$ can be expressed in terms of $L^{th}$ -order partially-separable functions³: $I(\mathbf{r},t)=\sum_{l_{1}=1}^{L_{1}}\psi_{l_{1}}(\mathbf{r_{\mathrm{1}}})\phi_{l_{1}}(t)+\sum_{l_{2}=1}^{L_{2}}\psi_{l_{2}}(\mathbf{r_{\mathrm{2}}}) \phi_{l_{2}}(t),$ where $r_{\mathrm{1}}$ and $r_{\mathrm{2}}$ represent the spatial locations within and outside the articulatory movement region (determined from the atlas), $L_{1}$ and $L_{2}$ the model orders associated with $r_{\mathrm{1}}$ and $r_{\mathrm{2}}$ , $\{\psi_{l_{1}}(\mathbf{r_{\mathrm{1}}})\}_{l_{1}=1}^{L_{1}}$ and $\{\psi_{l_{2}}(\mathbf{r_{\mathrm{2}}})\}_{l_{2}=1}^{L_{2}}$ the corresponding spatial coefficients, $\{\phi_{l_{1}}(t) \}_{l_{1}=1}^{L_{1}}$ and $\{\phi_{l_{2}}(t)\}_{l_{2}=1}^{L_{2}}$ the associated temporal basis functions. $I(\mathbf{r}, t)$ can be reconstructed from highly-undersampled data $d(\mathbf{r},t)$ by jointly enforcing the regional low-rank constraint with a group-sparsity constraint driven by the spatiotemporal atlas $I_a(\mathbf{r},t)$ ⁴. The reconstruction problem is formulated as: $\hat{\mathbf{\mathrm{I}}}=\mathrm{argmin}||\mathbf{\mathrm{d}}-\mathbf{\mathrm{E}}(\mathbf{\mathrm{I}})||_{2}^{2}+\mathrm{\lambda}||\mathbf{\mathrm{I}}-\mathbf{\mathrm{I}}_a||_{1,2},$ where $\mathbf{\mathrm{E}}(\mathbf{\cdot})$ represents an encoding operator encompassing sparse sampling, regional low-rank modeling and parallel imaging, $\mathrm{\lambda}$ represents a regularization parameter. An algorithm based on half-quadratic regularization has been applied to solve this optimization problem.¹

To demonstrate the effectiveness of the proposed method, experiments were performed following a PS model-based acquisition strategy¹ to sparsely sample the $(\mathbf{k},t)$ -space. A FLASH sequence (TR = 9.78ms) integrating a cone-trajectory navigator acquisition (TE = 0.85ms) and Cartesian imaging acquisition (TE = 2.30ms) was implemented on a Siemens Trio 3T scanner to acquire data over a 260x260 mm² FOV with a 12-channel head receiver coil. During data acquisition, three volunteer subjects were requested to produce repetitive /loo/-/lee/-/la/-/za/-/na/-/za/ sounds at their natural speaking rates in accordance with the local IRB. Initial reconstructions from all subjects were input into the atlas formation procedure to enable spatiotemporal alignment of articulatory motion. The final reconstructions had a matrix size of 128x128, a spatial resolution of 2.2x2.2 mm² and a nominal frame rate of 102 fps covering the upper vocal tract.

RESULTS

Figure 2a compares the reconstruction, the subject-specific atlas and the associated residual component. As is seen, the atlas captures the general articulatory motion, while the detailed structural differences at the tongue tip and velum (indicated with arrows) are picked up as residual components by the proposed sparsity constraint. Figure 2b and 2c compares the associated temporal profiles - the reconstruction captures richer spatiotemporal dynamics and sharper temporal transition as compared to the atlas. Figure 3 shows the envelopes of tongue contours for the reconstruction and the atlas from all /a/ sounds in $\mathbf{r_{\mathrm{1}}}$ (indicated with green color) in the production of /loo/-/lee/-/la/-/za/-/na/-/za/ phrases. The atlas tongue envelopes (pink) are approximately spatiotemporally aligned with the reconstruction envelopes (aqua), but demonstrates reduced motion variance due to the "averaging" effect from atlas creation².

CONCLUSION

This work presents a novel dynamic speech MRI method to capture articulatory movements with improved spatiotemporal dynamics and enhanced interpretability of motion. This is achieved by utilizing a spatiotemporal atlas to simultaneously provide strong prior information and promote spatiotemporal sparsity in regional low-rank model-based reconstructions. The proposed method not only allows high-quality reconstruction of articulator motion at a temporal frame rate of 102 fps and a spatial resolution of 2.2 x 2.2 mm², but also provides a platform to quantitatively characterize the target subject's articulatory motion with respect to the general motion pattern from all subjects.

Acknowledgements

This work was supported by grants from NIH-1R03DC009676-01A1, NIH- / NIDCD-R00DC012575 and a dissertation travel grant from University of Illinois at Urbana-Champaign.

References

[1] Fu, M., Zhao, B., Carignan, C., Shosted, R. K., Perry, J. L., Kuehn, D. P. and Sutton, B. P. “High-resolution dynamic speech imaging with joint low-rank and sparsity constraints”, Magn Reson Med, 1820-1832, 2015.

[2] Woo, J., Lee, J., Murano, E., Xing, F., Meena, A., Stone M. and Prince J. “A high-resolution atlas and statistical model of the vocal tract from structural MRI”, Comput Methods Biomech Biomed Eng Imaging Vis, 1-14, 2014.

[3] Liang, Z.-P. “Spatiotemporal imaging with partially separable functions”, Proceedings of the Annual Conference of IEEE Engineering in Medicine and Biology Society, 181-182, 2007.

[4] Woo, J., Stone, M. and Prince J. “Multimodal registration via mutual information incorporating geometric and spatial context”, IEEE Trans Image Process, 757-769, 2015.

[5] Beg, M. F., Miller, M. I., Trouv, A. and Younes L. “Computing large deformation metric mappings via geodesic flows of diffeomorphisms”, Int J Comput Vis, 139-157, 2005.

Figures

Overview of the proposed method: a) creation of a generic spatiotemporal atlas from multiple initial reconstructions; b) generation of a subject-specific atlas employing spatial and temporal warping procedures; and c) reconstruction of spatiotemporal images by jointly enforcing partial-separability and the proposed atlas-driven sparsity constraints.

Comparison of the reconstruction and subject-specific atlas: a) the subject-specific atlas captures the general articulatory gesture, while detailed differences at the tongue and velum (yellow arrows) are picked up as residual components; b) the reconstruction demonstrates richer spatiotemporal dynamics and sharper temporal transition in temporal profile than the subject-specific atlas.

Envelopes of tongue blade contours of the subject-specific atlas (pink) and the reconstruction (aqua) for all /a/ sounds from

$\mathbf{r}_1$ (indicated with green color). The atlas envelope is approximately spatiotemporally aligned with the reconstruction envelope, but demonstrates reduced motion variance.

Proc. Intl. Soc. Mag. Reson. Med. 24 (2016)

0874