当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic vocal tract landmark localization from midsagittal MRI data
arXiv - CS - Sound Pub Date : 2019-07-18 , DOI: arxiv-1907.07951
Mohammad Eslami, Christiane Neuschaefer-Rube, Antoine Serrurier

The various speech sounds of a language are obtained by varying the shape and position of the articulators surrounding the vocal tract. Analyzing their variations is crucial for understanding speech production, diagnosing speech disorders and planning therapy. Identifying key anatomical landmarks of these structures on medical images is a pre-requisite for any quantitative analysis and the rising amount of data generated in the field calls for an automatic solution. The challenge lies in the high inter- and intra-speaker variability, the mutual interaction between the articulators and the moderate quality of the images. This study addresses this issue for the first time and tackles it by means by means of Deep Learning. It proposes a dedicated network architecture named Flat-net and its performance are evaluated and compared with eleven state-of-the-art methods from the literature. The dataset contains midsagittal anatomical Magnetic Resonance Images for 9 speakers sustaining 62 articulations with 21 annotated anatomical landmarks per image. Results show that the Flat-net approach outperforms the former methods, leading to an overall Root Mean Square Error of 3.6 pixels/0.36 cm obtained in a leave-one-out procedure over the speakers. The implementation codes are also shared publicly on GitHub.

中文翻译:

来自正中矢状 MRI 数据的自动声道标志定位

一种语言的各种语音是通过改变环绕声道的发音器的形状和位置而获得的。分析它们的变化对于理解言语产生、诊断言语障碍和规划治疗至关重要。在医学图像上识别这些结构的关键解剖标志是任何定量分析的先决条件,并且现场生成的数据量不断增加,需要自动解决方案。挑战在于扬声器间和扬声器内的高度可变性、咬合架之间的相互交互以及中等质量的图像。本研究首次解决了这个问题,并通过深度学习的方式解决了这个问题。它提出了一种名为 Flat-net 的专用网络架构,并对其性能进行了评估,并与文献中的 11 种最先进的方法进行了比较。该数据集包含 9 个扬声器的正中矢状解剖磁共振图像,支持 62 个关节,每张图像有 21 个带注释的解剖标志。结果表明,Flat-net 方法优于前一种方法,导致在扬声器上的留一法过程中获得的整体均方根误差为 3.6 像素/0.36 厘米。实现代码也在 GitHub 上公开共享。导致在扬声器上的留一法过程中获得的整体均方根误差为 3.6 像素/0.36 厘米。实现代码也在 GitHub 上公开共享。导致在扬声器上的留一法过程中获得的整体均方根误差为 3.6 像素/0.36 厘米。实现代码也在 GitHub 上公开共享。
更新日期:2020-02-04
down
wechat
bug