Elsevier

Methods

Volume 179, 1 July 2020, Pages 26-36
Methods

Encoder-decoder CNN models for automatic tracking of tongue contours in real-time ultrasound data

https://doi.org/10.1016/j.ymeth.2020.05.011Get rights and content

Highlights

  • A combination of standard and dilated convolution provide sharper segmentation results.

  • Tongue contour can be automatically tracked in real-time from ultrasound data using deep learning methods.

  • BowNet models can delineate a region of interest from input data using the capability of dilated convolution.

  • A large number of learnable parameters in a CNN model is not always an indication of a better generalization.

  • Ground truth labels of ultrasound data should be in binary instead of gray-scale format.

Abstract

One application of medical ultrasound imaging is to visualize and characterize human tongue shape and motion in real-time to study healthy or impaired speech production. Due to the low-contrast characteristic and noisy nature of ultrasound images, it requires knowledge about the tongue structure and ultrasound data interpretation for users to recognize tongue locations and gestures easily. Moreover, quantitative analysis of tongue motion needs the tongue contour to be extracted, tracked and visualized instead of the whole tongue region. Manual tongue contour extraction is a cumbersome, subjective, and error-prone task. Furthermore, it is not a feasible solution for real-time applications where the tongue contour moves rapidly with nuance gestures. This paper presents two new deep neural networks (named BowNet models) that benefit from the ability of global prediction of encoding–decoding fully convolutional neural networks and the capability of full-resolution extraction of dilated convolutions. Both qualitatively and quantitatively studies over datasets from two ultrasound machines disclosed the outstanding performances of the proposed deep learning models in terms of performance speed and robustness. Experimental results also revealed a significant improvement in the accuracy of prediction maps due to the better exploration and exploitation ability of the proposed network models.

Introduction

Studying and exploiting the dynamic nature of speech data from ultrasound tongue image sequences might provide valuable information, and it is of great interest in many recent studies [1]. Ultrasound imaging has been utilized for tongue motion analysis in many applications such as treatment of speech sound disorders [2], comparing healthy and impaired speech production [1], second language training and rehabilitation [3], Salient Speech Interfaces [4], research on food swallowing [5], 3D tongue modeling [6], to name a few. In ultrasound data, the tongue contour region can be seen in the shape of a thick bright white curve. However, due to the lack of a hard structure as a reference, it is not an easy task to perceive the tongue position [7] in real-time, and the interpretation of ultrasound data is a challenging task for non-expert users. Tongue contours can be tracked automatically to alleviate this difficulty for real-time applications. Therefore, it is crucial to have a fully automatic technique for real-time ultrasound tongue contour tracking [8].

Various methods have been proposed for the problem of automatic tongue tracking in the last recent years, such as active contour models or snakes [9], [10], [11], [12], [13], graph-based techniques [14], and machine learning-based methods [15], [16], [17], [18], [19], [20]. Many recent tongue contour tracking methods are listed in a study by Laporte et al. [1], while most of them require monitoring and manipulation, while the tongue tracking process is in progress. Initialization is necessary for almost all conventional methods where users should manually label at least one frame with a restriction of drawing near to the tongue region [1]. For instance, to use standard software in the literature such as Autotrace [21], EdgeTrak [10], or TongueTrack [18], users should annotate several points on at least one frame.

Convolutional Neural Network (CNN) has been the method of choice for many computer vision applications in recent years [22]. They have shown outstanding performance in many image classification tasks as well as to object detection, recognition, and tracking [23]. A dense image classification task might be considered as a segmentation problem when the goal is to categorize every single pixel by a discrete or continuous label [24]. With inspiration, modification, and adaptation of several well-known deep classification network models [25], [26], [27], Fully Convolutional Neural (FCN) Network was successfully exploited for the image segmentation problem in a study by [28]. Instead of utilizing a classifier in the last layer, a fully convolutional layer was used in FCN to provide a prediction map output with arbitrary size and fewer parameters. From that time, the performance of FCN network models has been improved by proposing different operations in the decoder block of FCN such as deconvolution operator [29], concatenating features of the encoder by decoder features [30], [31], using indexed un-pooling [32], and adding post-processing stages such as CRFs [33].

Many of these innovations significantly improved the accuracy of segmentation results, usually at the expense of more computational costs due to the significant number of network parameters. Furthermore, consecutive pooling layers that are used to improve receptive field and localization invariance [34] in almost all of those methods cause a considerable reduction of feature resolution in the decoding stage [35]. To alleviate this issue, dilated (sometimes called atrous) convolution has been employed recently [24], [34], [35], [36], [33]. Dilated convolutions [36], [37], [38] help the network model to predict instances without losing receptive field, without needs of the fully convolutional layer, and with less learnable parameters in comparison to previous FCN methods. To automatically extract tongue contours without any manipulation on a large number of image frames, modified versions of UNet [30] have been successfully used recently for ultrasound tongue extraction [8], [39]. The drawback of those methods was that the accuracy of instances was not satisfying in real-time with a considerable number of network parameters. Furthermore, the speed of those models still on the edge of real-time performance. Therefore, there is a trade-off between speed [8], [40] and generalization [41] and accuracy [30], and a gap for a general model with keeping both aspects was seen in the field.

In this work, we develop two new deep convolutional neural models (named BowNet and wBowNet) for the problem of ultrasound tongue extraction. For the first time, dilated convolution is combined with the standard deep, dense classification method in two new network architectures to extract both local and global context from each frame at the same time in an end-to-end fashion. The BowNet is a parallel combination of a dense classification architecture inspired by VGG16 [25] and UNet [30] with a segmentation model inspired by the DeepLab network where dilated convolution layers are used without pooling layers [35]. The wBowNet is the counterpart of the BowNet, where a combination of classification and segmentation networks were designed interconnected to support an even higher resolution in prediction outcomes. We evaluated our proposed architectures on two different datasets of ultrasound tongue images, and the experimental results demonstrate that our fully-automatic proposed models are capable of achieving accurate predictions with less learnable parameters and with real-time performance. It is noteworthy to mention that Google company proposed an advanced version of the BowNet model recently for semantic segmentation application [42].

Section snippets

Network architectures

The consecutive combination of convolutional and pooling layers in the encoder of a CNN model results in a significantly lower spatial resolution of output feature maps, typically by a factor of 32 for modern current deep learning architectures, which is not the desired resolution for the semantic segmentation purposes [35]. Several ideas have been proposed to reconstruct an input-size segmented output from the course feature map of the encoding part. Interpolation (up-sampling) could be the

System setup

In order to train our proposed network models, we conducted an extensive random search hyperparameter tuning [46] for finding the optimum value of parameters such as filter size (double for each consecutive layer starts from 16, 32, or 64), kernel size (3 × 3 and 5 × 5), dilation factor (double for each consecutive layer starts from 1 or 2), the number of global iterations (iteration and epoch size), batch size (10, 20, and 50 depend on GPU memory), augmentation parameters (online and offline),

Conclusion and discussion

In this paper, we have proposed and presented two new deep convolutional neural networks called BowNet and wBowNet for tongue contour extraction benefiting from dense image classification and dilated convolution for globally and locally accurate segmentation results. Extensive experimental studies on several types of datasets using online and offline augmentation with productive comparison results demonstrated the outstanding performance of the two proposed deep learning techniques. From the

CRediT authorship contribution statement

M. Hamed Mozaffari: Conceptualization, Methodology, Software, Data curation, Writing - original draft, Visualization, Investigation, Software, Validation, Writing - review & editing. Won-Sook Lee: Supervision.

References (55)

  • C. Laporte et al.

    Multi-hypothesis tracking of the tongue surface in ultrasound video recordings of normal and impaired speech

    Med. Image Anal.

    (2018)
  • B. Denby et al.

    Silent speech interfaces

    Speech Commun.

    (2010)
  • A. Eshky et al.

    UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

    Interspeech.

    (2018)
  • B. Gick, B.M. Bernhardt, P. Bacsfalvi, I. Wilson, Ultrasound imaging applications in second language acquisition, in:...
  • M. Ohkubo et al.

    Tongue Shape Dynamics in Swallowing Using Sagittal Ultrasound

    Dysphagia.

    (2018)
  • S. Chen, Y. Zheng, C. Wu, G. Sheng, P. Roussel, B. Denby, Direct, Near Real Time Animation of a 3D Tongue Model Using...
  • M. Stone

    A guide to analysing tongue motion from ultrasound images

    Clin. Linguist. Phon.

    (2005)
  • M.H. Mozaffari, S. Guan, S. Wen, N. Wang, W.-S. Lee, Guided Learning of Pronunciation by Visualizing Tongue...
  • K. Xu et al.

    Development of a 3D tongue motion visualization platform based on ultrasound image sequences

    ArXiv Prepr. ArXiv1605.06106.

    (2016)
  • M. Li et al.

    Automatic contour tracking in ultrasound images

    Clin. Linguist. Phon.

    (2005)
  • S. Ghrenassia, L. Ménard, C. Laporte, Interactive segmentation of tongue contours in ultrasound video sequences using...
  • C. Laporte, L. Ménard, Robust tongue tracking in ultrasound images: a multi-hypothesis approach, in: Sixt. Annu. Conf....
  • K. Xu et al.

    Robust contour tracking in ultrasound tongue image sequences

    Clin. Linguist. Phon.

    (2016)
  • L. Tang, G. Hamarneh, Graph-based tracking of the tongue contour in ultrasound sequences with adaptive temporal...
  • J. Berry et al.

    Dynamics of tongue gestures extracted automatically from ultrasound ICASSP

    IEEE Int. Conf. Acoust. Speech Signal Process. - Proc.

    (2011)
  • I. Fasel, J. Berry, Deep belief networks for real-time extraction of tongue contours from ultrasound during speech,...
  • A. Jaumard-Hakoun, K. Xu, P. Roussel-ragot, M.L. Stone, Tongue Contour Extraction From Ultrasound Images, Proc. 18th...
  • T. L., B. T., H. G., L. Tang, T. Bressmann, G. Hamarneh, Tongue contour tracking in dynamic ultrasound via higher-order...
  • D. Fabre et al.

    Tongue tracking in ultrasound images using eigentongue decomposition and artificial neural networks

    Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH.

    (2015-Janua)
  • A. Jaumard-Hakoun, K. Xu, P. Roussel-Ragot, G. Dreyfus, B. Denby, Tongue contour extraction from ultrasound images...
  • G.V. Hahn-powell et al.

    AutoTrace: An automatic system for tracing tongue contours

    J. Acoust. Soc. Am.

    (2014)
  • S.K. Zhou et al.

    Deep learning for medical image analysis

    (2017)
  • G. Lin, A. Milan, C. Shen, I. Reid, RefineNet: Multi-path refinement networks for high-resolution semantic...
  • F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions, Proc. ICLR. (2016)....
  • K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, CoRR. abs/1409.1 (2015)....
  • C. Szegedy et al.

    Going Deeper with Convolutions

    Popul. Health Manag.

    (2014)
  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    Commun. ACM.

    (2017)
  • Cited by (25)

    • Real-time detection and forecast of flashovers by the visual room fire features using deep convolutional neural networks

      2023, Journal of Building Engineering
      Citation Excerpt :

      Details of how CNN layers work can be easily found in [41,42]. To extract the flashover feature from recorded videos in our room test experiments, we designed a CNN model named “FlashoverNet”, which is a modified version of the popular encoder-decoder [43] CNN model VGG network [32]. We optimized the network parameters after a comprehensive hyper-parameter tuning over our dataset.

    • A novel tongue segmentation method based on improved U-Net

      2022, Neurocomputing
      Citation Excerpt :

      In complex environments, the general machine learning algorithms proposed recently show poor results. U-Net, DeepLab V3 and Resnet are selected as the basic comparison model by referring to several models with better segmentation effect of multi-terminal image acquisition [29,36,37]. It has been proved that the attentional mechanism can effectively improve the accuracy of standard tongue image segmentation [14,38].

    • Establishing and validating a spotted tongue recognition and extraction model based on multiscale convolutional neural network

      2022, Digital Chinese Medicine
      Citation Excerpt :

      With the development of artificial intelligence (AI), new technology has been provided for tongue image recognition and extraction [6-8]. In recent years, deep learning and convolutional neural network (CNN) have been applied to the TCM feature analysis method of tongue images [9-16] and achieved better recognition accuracy than image algorithms and machine learning. However, the irregular distribution of spots on the tongue, the large difference between the size of spots and the tongue, and the small difference between the color of the spots and the tongue under natural light conditions, etc., are the difficulties and challenge of automatic recognition for spotted tongue [17].

    View all citing articles on Scopus
    View full text