Maojing Fu1, Jonghye Woo2, Marissa Barlaz3, Ryan Shosted3, Zhi-Pei Liang1, and Bradley Sutton4
1Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States, 2CAMIS (Center for Advanced Medical Imaging Sciences), Massachusetts General Hospital, Boston, MA, United States, 3Linguistics, University of Illinois at Urbana-Champaign, Urbana, IL, United States, 4Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
Synopsis
Dynamic speech MRI holds great promise for
visualizing articulatory motion in the vocal tract. Recent work has enabled accelerated
imaging speed, resulting in the need to integrate mechanisms to enable interpretation of the dynamic images that contain great amounts of movement information. This work integrates a spatiotemporal atlas into a partial separable (PS) model-based
imaging framework and uses the atlas as prior information to improve reconstruction
quality. This method not only captures high-quality dynamics at 102
frames per second, but also enables quantitative characterization of articulatory variability utilizing the residual component from the atlas-based sparsity constraint.INTRODUCTION
Dynamic MRI has been
recognized as a promising method for visualizing articulatory motion of speech
in scientific research and clinical applications
1. However,
characterization of the gestural and acoustical properties of the vocal
tract remains challenging because it requires: 1)
reconstructing high-quality spatiotemporal images by incorporating strong prior knowledge; and 2) quantitatively interpreting the reconstructed images that
contain great motion variability. This work presents a novel method that meets both requirements simultaneously by integrating a spatiotemporal atlas
2 into a Partial Separability
3 (PS) model-based imaging framework. Through an atlas-driven group-sparsity constraint, this method achieves high-quality
articulatory dynamics at an imaging speed of 102 frames per second (fps) and a
spatial resolution of 2.2x2.2 mm
2. Moreover, this method enables quantitative characterization of motion variability, compared to the general motion pattern from all subjects, through the spatial residual components.
METHODS
A key feature of the proposed method lies in the
use of a spatiotemporal atlas, $$$I_a(\mathbf{r}, t)$$$, to capture the general
pattern of articulatory motion across "all" subjects. As illustrated
in Figure 1, a generic atlas has been created for statistical modeling of the
vocal tract2 and was utilized for dynamic speech imaging.
Specifically, this atlas was constructed in a common space by
spatially transforming the initial reconstructions from multiple subjects using
group-wise diffeomorphic registrations4. In our application, this generic atlas was
further converted into subject-specific atlas through
Lipchitz-norm-based temporal warping2 and diffeomorphic-mapping-based spatial
warping5 procedures. The resulting atlas not only carries an objective
description of the "expected" motion pattern for a specific subject,
but also remains spatiotemporally in register with the target subject's
articulatory motion.
The desired spatiotemporal image $$$I(\mathbf{r}, t)$$$ can be expressed in terms of $$$L^{th}$$$-order partially-separable functions3: $$I(\mathbf{r},t)=\sum_{l_{1}=1}^{L_{1}}\psi_{l_{1}}(\mathbf{r_{\mathrm{1}}})\phi_{l_{1}}(t)+\sum_{l_{2}=1}^{L_{2}}\psi_{l_{2}}(\mathbf{r_{\mathrm{2}}}) \phi_{l_{2}}(t),$$where $$$r_{\mathrm{1}}$$$ and $$$r_{\mathrm{2}}$$$ represent the spatial locations within and outside the articulatory movement region (determined from the atlas), $$$L_{1}$$$ and $$$L_{2}$$$ the model orders associated with $$$r_{\mathrm{1}}$$$ and $$$r_{\mathrm{2}}$$$, $$$\{\psi_{l_{1}}(\mathbf{r_{\mathrm{1}}})\}_{l_{1}=1}^{L_{1}}$$$ and $$$\{\psi_{l_{2}}(\mathbf{r_{\mathrm{2}}})\}_{l_{2}=1}^{L_{2}}$$$ the corresponding spatial coefficients, $$$\{\phi_{l_{1}}(t) \}_{l_{1}=1}^{L_{1}}$$$ and $$$\{\phi_{l_{2}}(t)\}_{l_{2}=1}^{L_{2}}$$$ the associated temporal basis functions. $$$I(\mathbf{r}, t)$$$ can be reconstructed from highly-undersampled data $$$d(\mathbf{r},t)$$$ by jointly enforcing the regional low-rank constraint with a group-sparsity constraint driven by the spatiotemporal atlas $$$I_a(\mathbf{r},t)$$$4. The reconstruction problem is formulated as:$$\hat{\mathbf{\mathrm{I}}}=\mathrm{argmin}||\mathbf{\mathrm{d}}-\mathbf{\mathrm{E}}(\mathbf{\mathrm{I}})||_{2}^{2}+\mathrm{\lambda}||\mathbf{\mathrm{I}}-\mathbf{\mathrm{I}}_a||_{1,2},$$where $$$\mathbf{\mathrm{E}}(\mathbf{\cdot})$$$ represents an encoding operator encompassing sparse sampling, regional low-rank modeling and parallel imaging, $$$\mathrm{\lambda}$$$ represents a regularization parameter. An algorithm based on half-quadratic regularization has been applied to solve this optimization problem.1
To demonstrate the effectiveness of the proposed method, experiments were performed following a PS model-based acquisition strategy1 to sparsely sample the $$$(\mathbf{k},t)$$$-space. A FLASH sequence (TR = 9.78ms) integrating a cone-trajectory navigator acquisition (TE = 0.85ms) and Cartesian imaging acquisition (TE = 2.30ms) was implemented on a Siemens Trio 3T scanner to acquire data over a 260x260 mm2 FOV with a 12-channel head receiver coil. During data acquisition, three volunteer subjects were requested to produce repetitive /loo/-/lee/-/la/-/za/-/na/-/za/ sounds at their natural speaking rates in accordance with the local IRB. Initial reconstructions from all subjects were input into the atlas formation procedure to enable spatiotemporal alignment of articulatory motion. The final reconstructions had a matrix size of 128x128, a spatial resolution of 2.2x2.2 mm2 and a nominal frame rate of 102 fps covering the upper vocal tract.
RESULTS
Figure 2a compares the reconstruction, the subject-specific atlas and the associated residual component. As is seen, the atlas captures the general articulatory motion, while the detailed structural differences at the tongue tip and velum (indicated with arrows) are picked up as residual components by the proposed sparsity constraint. Figure 2b and 2c compares the associated temporal profiles - the reconstruction captures richer spatiotemporal dynamics and sharper temporal transition as compared to the atlas. Figure 3 shows the envelopes of tongue contours for the reconstruction and the atlas from all /a/ sounds in $$$\mathbf{r_{\mathrm{1}}}$$$ (indicated with green color) in the production of /loo/-/lee/-/la/-/za/-/na/-/za/ phrases. The atlas tongue envelopes (pink) are approximately spatiotemporally aligned with the reconstruction envelopes (aqua), but demonstrates reduced motion variance due to the "averaging" effect from atlas creation2.
CONCLUSION
This work presents a novel dynamic speech MRI method to capture articulatory movements with improved spatiotemporal dynamics and enhanced interpretability of motion. This is achieved by utilizing a spatiotemporal atlas to simultaneously provide strong prior information and promote spatiotemporal sparsity in regional low-rank model-based reconstructions. The proposed method not only allows high-quality reconstruction of articulator motion at a temporal frame rate of 102 fps and a spatial resolution of 2.2 x 2.2 mm2, but also provides a platform to quantitatively characterize the target subject's articulatory motion with respect to the general motion pattern from all subjects.
Acknowledgements
This work was supported by grants from NIH-1R03DC009676-01A1, NIH- / NIDCD-R00DC012575 and a dissertation travel grant from University of Illinois at Urbana-Champaign.References
[1] Fu, M., Zhao, B.,
Carignan, C., Shosted, R. K., Perry, J. L., Kuehn, D. P. and Sutton, B. P. “High-resolution dynamic
speech imaging with joint low-rank and sparsity constraints”, Magn Reson
Med, 1820-1832, 2015.
[2] Woo,
J., Lee, J., Murano, E., Xing, F., Meena, A., Stone M. and Prince J. “A
high-resolution atlas and statistical model of the vocal tract from structural
MRI”, Comput Methods Biomech Biomed Eng Imaging Vis, 1-14, 2014.
[3] Liang, Z.-P.
“Spatiotemporal imaging with partially separable functions”, Proceedings of the Annual Conference of IEEE
Engineering in Medicine and Biology Society, 181-182, 2007.
[4] Woo, J., Stone,
M. and Prince J. “Multimodal registration via mutual information incorporating geometric and spatial context”, IEEE
Trans Image Process, 757-769, 2015.
[5] Beg, M. F., Miller,
M. I., Trouv, A. and Younes L. “Computing large deformation metric mappings via
geodesic flows of diffeomorphisms”, Int J Comput Vis, 139-157, 2005.