Encoder-decoder CNN models for automatic tracking of tongue contours in real-time ultrasound data

doi:10.1016/j.ymeth.2020.05.011

Methods

Volume 179, 1 July 2020, Pages 26-36

https://doi.org/10.1016/j.ymeth.2020.05.011 Get rights and content

Highlights

•
A combination of standard and dilated convolution provide sharper segmentation results.
•
Tongue contour can be automatically tracked in real-time from ultrasound data using deep learning methods.
•
BowNet models can delineate a region of interest from input data using the capability of dilated convolution.
•
A large number of learnable parameters in a CNN model is not always an indication of a better generalization.
•
Ground truth labels of ultrasound data should be in binary instead of gray-scale format.

Abstract

One application of medical ultrasound imaging is to visualize and characterize human tongue shape and motion in real-time to study healthy or impaired speech production. Due to the low-contrast characteristic and noisy nature of ultrasound images, it requires knowledge about the tongue structure and ultrasound data interpretation for users to recognize tongue locations and gestures easily. Moreover, quantitative analysis of tongue motion needs the tongue contour to be extracted, tracked and visualized instead of the whole tongue region. Manual tongue contour extraction is a cumbersome, subjective, and error-prone task. Furthermore, it is not a feasible solution for real-time applications where the tongue contour moves rapidly with nuance gestures. This paper presents two new deep neural networks (named BowNet models) that benefit from the ability of global prediction of encoding–decoding fully convolutional neural networks and the capability of full-resolution extraction of dilated convolutions. Both qualitatively and quantitatively studies over datasets from two ultrasound machines disclosed the outstanding performances of the proposed deep learning models in terms of performance speed and robustness. Experimental results also revealed a significant improvement in the accuracy of prediction maps due to the better exploration and exploitation ability of the proposed network models.

Graphical abstract

Introduction

Studying and exploiting the dynamic nature of speech data from ultrasound tongue image sequences might provide valuable information, and it is of great interest in many recent studies [1]. Ultrasound imaging has been utilized for tongue motion analysis in many applications such as treatment of speech sound disorders [2], comparing healthy and impaired speech production [1], second language training and rehabilitation [3], Salient Speech Interfaces [4], research on food swallowing [5], 3D tongue modeling [6], to name a few. In ultrasound data, the tongue contour region can be seen in the shape of a thick bright white curve. However, due to the lack of a hard structure as a reference, it is not an easy task to perceive the tongue position [7] in real-time, and the interpretation of ultrasound data is a challenging task for non-expert users. Tongue contours can be tracked automatically to alleviate this difficulty for real-time applications. Therefore, it is crucial to have a fully automatic technique for real-time ultrasound tongue contour tracking [8].

Various methods have been proposed for the problem of automatic tongue tracking in the last recent years, such as active contour models or snakes [9], [10], [11], [12], [13], graph-based techniques [14], and machine learning-based methods [15], [16], [17], [18], [19], [20]. Many recent tongue contour tracking methods are listed in a study by Laporte et al. [1], while most of them require monitoring and manipulation, while the tongue tracking process is in progress. Initialization is necessary for almost all conventional methods where users should manually label at least one frame with a restriction of drawing near to the tongue region [1]. For instance, to use standard software in the literature such as Autotrace [21], EdgeTrak [10], or TongueTrack [18], users should annotate several points on at least one frame.

Convolutional Neural Network (CNN) has been the method of choice for many computer vision applications in recent years [22]. They have shown outstanding performance in many image classification tasks as well as to object detection, recognition, and tracking [23]. A dense image classification task might be considered as a segmentation problem when the goal is to categorize every single pixel by a discrete or continuous label [24]. With inspiration, modification, and adaptation of several well-known deep classification network models [25], [26], [27], Fully Convolutional Neural (FCN) Network was successfully exploited for the image segmentation problem in a study by [28]. Instead of utilizing a classifier in the last layer, a fully convolutional layer was used in FCN to provide a prediction map output with arbitrary size and fewer parameters. From that time, the performance of FCN network models has been improved by proposing different operations in the decoder block of FCN such as deconvolution operator [29], concatenating features of the encoder by decoder features [30], [31], using indexed un-pooling [32], and adding post-processing stages such as CRFs [33].

Many of these innovations significantly improved the accuracy of segmentation results, usually at the expense of more computational costs due to the significant number of network parameters. Furthermore, consecutive pooling layers that are used to improve receptive field and localization invariance [34] in almost all of those methods cause a considerable reduction of feature resolution in the decoding stage [35]. To alleviate this issue, dilated (sometimes called atrous) convolution has been employed recently [24], [34], [35], [36], [33]. Dilated convolutions [36], [37], [38] help the network model to predict instances without losing receptive field, without needs of the fully convolutional layer, and with less learnable parameters in comparison to previous FCN methods. To automatically extract tongue contours without any manipulation on a large number of image frames, modified versions of UNet [30] have been successfully used recently for ultrasound tongue extraction [8], [39]. The drawback of those methods was that the accuracy of instances was not satisfying in real-time with a considerable number of network parameters. Furthermore, the speed of those models still on the edge of real-time performance. Therefore, there is a trade-off between speed [8], [40] and generalization [41] and accuracy [30], and a gap for a general model with keeping both aspects was seen in the field.

In this work, we develop two new deep convolutional neural models (named BowNet and wBowNet) for the problem of ultrasound tongue extraction. For the first time, dilated convolution is combined with the standard deep, dense classification method in two new network architectures to extract both local and global context from each frame at the same time in an end-to-end fashion. The BowNet is a parallel combination of a dense classification architecture inspired by VGG16 [25] and UNet [30] with a segmentation model inspired by the DeepLab network where dilated convolution layers are used without pooling layers [35]. The wBowNet is the counterpart of the BowNet, where a combination of classification and segmentation networks were designed interconnected to support an even higher resolution in prediction outcomes. We evaluated our proposed architectures on two different datasets of ultrasound tongue images, and the experimental results demonstrate that our fully-automatic proposed models are capable of achieving accurate predictions with less learnable parameters and with real-time performance. It is noteworthy to mention that Google company proposed an advanced version of the BowNet model recently for semantic segmentation application [42].

Section snippets

Network architectures

The consecutive combination of convolutional and pooling layers in the encoder of a CNN model results in a significantly lower spatial resolution of output feature maps, typically by a factor of 32 for modern current deep learning architectures, which is not the desired resolution for the semantic segmentation purposes [35]. Several ideas have been proposed to reconstruct an input-size segmented output from the course feature map of the encoding part. Interpolation (up-sampling) could be the

System setup

In order to train our proposed network models, we conducted an extensive random search hyperparameter tuning [46] for finding the optimum value of parameters such as filter size (double for each consecutive layer starts from 16, 32, or 64), kernel size (3 × 3 and 5 × 5), dilation factor (double for each consecutive layer starts from 1 or 2), the number of global iterations (iteration and epoch size), batch size (10, 20, and 50 depend on GPU memory), augmentation parameters (online and offline),

Conclusion and discussion

In this paper, we have proposed and presented two new deep convolutional neural networks called BowNet and wBowNet for tongue contour extraction benefiting from dense image classification and dilated convolution for globally and locally accurate segmentation results. Extensive experimental studies on several types of datasets using online and offline augmentation with productive comparison results demonstrated the outstanding performance of the two proposed deep learning techniques. From the

CRediT authorship contribution statement

M. Hamed Mozaffari: Conceptualization, Methodology, Software, Data curation, Writing - original draft, Visualization, Investigation, Software, Validation, Writing - review & editing. Won-Sook Lee: Supervision.

References (55)

C. Laporte et al.
Multi-hypothesis tracking of the tongue surface in ultrasound video recordings of normal and impaired speech
Med. Image Anal.
(2018)
B. Denby et al.
Silent speech interfaces
Speech Commun.
(2010)
A. Eshky et al.
UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions
Interspeech.
(2018)
B. Gick, B.M. Bernhardt, P. Bacsfalvi, I. Wilson, Ultrasound imaging applications in second language acquisition, in:...
M. Ohkubo et al.
Tongue Shape Dynamics in Swallowing Using Sagittal Ultrasound
Dysphagia.
(2018)
S. Chen, Y. Zheng, C. Wu, G. Sheng, P. Roussel, B. Denby, Direct, Near Real Time Animation of a 3D Tongue Model Using...
M. Stone
A guide to analysing tongue motion from ultrasound images
Clin. Linguist. Phon.
(2005)
M.H. Mozaffari, S. Guan, S. Wen, N. Wang, W.-S. Lee, Guided Learning of Pronunciation by Visualizing Tongue...
K. Xu et al.
Development of a 3D tongue motion visualization platform based on ultrasound image sequences
ArXiv Prepr. ArXiv1605.06106.
(2016)
M. Li et al.
Automatic contour tracking in ultrasound images
Clin. Linguist. Phon.
(2005)

S. Ghrenassia, L. Ménard, C. Laporte, Interactive segmentation of tongue contours in ultrasound video sequences using...

C. Laporte, L. Ménard, Robust tongue tracking in ultrasound images: a multi-hypothesis approach, in: Sixt. Annu. Conf....

K. Xu et al.

Robust contour tracking in ultrasound tongue image sequences

Clin. Linguist. Phon.

(2016)

L. Tang, G. Hamarneh, Graph-based tracking of the tongue contour in ultrasound sequences with adaptive temporal...

J. Berry et al.

Dynamics of tongue gestures extracted automatically from ultrasound ICASSP

IEEE Int. Conf. Acoust. Speech Signal Process. - Proc.

(2011)

I. Fasel, J. Berry, Deep belief networks for real-time extraction of tongue contours from ultrasound during speech,...

A. Jaumard-Hakoun, K. Xu, P. Roussel-ragot, M.L. Stone, Tongue Contour Extraction From Ultrasound Images, Proc. 18th...

T. L., B. T., H. G., L. Tang, T. Bressmann, G. Hamarneh, Tongue contour tracking in dynamic ultrasound via higher-order...

D. Fabre et al.

Tongue tracking in ultrasound images using eigentongue decomposition and artificial neural networks

Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH.

(2015-Janua)

A. Jaumard-Hakoun, K. Xu, P. Roussel-Ragot, G. Dreyfus, B. Denby, Tongue contour extraction from ultrasound images...

G.V. Hahn-powell et al.

AutoTrace: An automatic system for tracing tongue contours

J. Acoust. Soc. Am.

(2014)

S.K. Zhou et al.

Deep learning for medical image analysis

(2017)

G. Lin, A. Milan, C. Shen, I. Reid, RefineNet: Multi-path refinement networks for high-resolution semantic...

F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions, Proc. ICLR. (2016)....

K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, CoRR. abs/1409.1 (2015)....

C. Szegedy et al.

Going Deeper with Convolutions

Popul. Health Manag.

(2014)

A. Krizhevsky et al.

ImageNet classification with deep convolutional neural networks

Commun. ACM.

(2017)

Cited by (25)

TSRNet: Tongue image segmentation with global and local refinement
2024, Displays
Tongue Image Segmentation is an essential task for intelligent Traditional Chinese Medicine (TCM), as the tongue is sensitive to the physiological conditions and pathological changes of patients and can help physicians determine strategies for the syndrome differentiation. However, it is a big challenge to acquire an accurate tongue segmentation mask, due to the varied shape and texture of the tongue. This paper proposes a novel tongue segmentation network based on an encoder–decoder framework with global and local refinement, named TSRNet. In the global refinement module, we design an effective module for fusing features from an autoencoder, which is pre-trained on tongue images with segmentation labels, so that the network can make use of the prior knowledge. Moreover, in the local refinement module, we perform patch sampling according to the coarse prediction boundary and correct errors through a patch segmentation module. Both two modules are plugged into the decoder to obtain better tongue segmentation results by training end-to-end. Experimental results compared with state-of-the-art models on two real-world tongue datasets demonstrate the effectiveness of the proposed TSRNet.
Precise angle estimation of capsule robot in ultrasound using heatmap guided two-stage network
2023, Computer Methods and Programs in Biomedicine
A capsule robot can be controlled inside gastrointestinal (GI) tract by an external permanent magnet outside of human body for finishing non-invasive diagnosis and treatment. Locomotion control of capsule robot relies on the precise angle feedback that can be achieved by ultrasound imaging. However, ultrasound-based angle estimation of capsule robot is interfered by gastric wall tissue and the mixture of air, water, and digestive matter existing in the stomach.
To tackle these issues, we introduce a heatmap guided two-stage network to detect the position and estimate the angle of the capsule robot in ultrasound images. Specifically, this network proposes the probability distribution module and skeleton extraction-based angle calculation to obtain accurate capsule robot position and angle estimation.
Extensive experiments were finished on the ultrasound image dataset of capsule robot within porcine stomach. Empirical results showed that our method obtained small position center error of 0.48 mm and high angle estimation accuracy of 96.32%.
Our method can provide precise angle feedback for locomotion control of capsule robot.
Real-time detection and forecast of flashovers by the visual room fire features using deep convolutional neural networks
2023, Journal of Building Engineering
Citation Excerpt :
Details of how CNN layers work can be easily found in [41,42]. To extract the flashover feature from recorded videos in our room test experiments, we designed a CNN model named “FlashoverNet”, which is a modified version of the popular encoder-decoder [43] CNN model VGG network [32]. We optimized the network parameters after a comprehensive hyper-parameter tuning over our dataset.
Flashover occurs in building fires whereby all the combustible materials in an enclosure begin to ignite rapidly. This detrimental phenomenon occurs because of the intense heat radiating from the hot smoke layer accumulated near the ceiling, which is one of the major causes of firefighter fatalities. Thus, forecasting flashover in real-time is of great importance to firefighters’ safety. Previous empirical and theoretical analysis of the flashover phenomenon provides criteria for predicting flashover on-set and fire size, yet they are based on the thermal sensor data in the room. It is not straightforward to transfer the flashover criteria and understanding from the previous studies in application to actual fire grounds since thermal sensor data are not readily available. As a novel non-invasive prediction technique of flashover onset, this investigation explores the potential of recognizing vision indicators of flashover, such as flame extrusion or rollover. Adopting artificial intelligence (AI) technologies for pattern recognition and feature extraction, we proposed to build an optimized convolutional neural network named “FlashoverNet”, a modern AI algorithm, to detect and forecast flashover occurrence in enclosure fire incidents. FlashoverNet was trained and validated by data from several full-size actual room fire experiments. FlashoverNet imitates the approach of flashover onset prediction by firefighters by monitoring fire growth and looking for early visible indicators in a room fire. Our investigations demonstrate that FlashoverNet can successfully detect and forecast flashover onset in real-time with more than 94% accuracy in full-scale actual room fire tests.
A novel tongue segmentation method based on improved U-Net
2022, Neurocomputing
Citation Excerpt :
In complex environments, the general machine learning algorithms proposed recently show poor results. U-Net, DeepLab V3 and Resnet are selected as the basic comparison model by referring to several models with better segmentation effect of multi-terminal image acquisition [29,36,37]. It has been proved that the attentional mechanism can effectively improve the accuracy of standard tongue image segmentation [14,38].
Accurate segmentation of tongue image is a prerequisite to intelligent tongue diagnosis. Current tongue segmentation algorithms have good performances in tongue image segmentation in a standard environment. However, tongue segmentation is still challenging to be used in an open environment due to factors such as the complex external open environment and relatively simple equipment. This study aims to construct a novel deep neural network tongue segmentation system suitable for mobile devices, which performs tongue segmentation rapidly in an open environment.
Firstly, a new OET-NET is constructed based on U-Net combined with a residual soft connection module and salient image fusion module to train the tongue image from a different device. Secondly, a new loss function is established based on the Focal loss. Finally, the number of network parameters, the inferred time are treated as indicators to evaluate the model. At the same time, MIoU, precision, recall, F1-Score and FLOPs are used to compare with different tongue segmentation frameworks.
According to the training results, the proposed OET-NET’s parameter number is 7.75 MB, the required time of tongue segmentation is about 59 ms/piece, the MIoU is 96.98% and the FLOPs is 15.50 MFLOPs. Compared with U-Net, OET-NET increased the total number of parameters by 0.39 MB and the inference time by 1 ms/piece, but achieved the best segmentation effect compared with the reference models.
According to less time consumption and less space, the precision of segmentation results is higher than that of other segmentation models. OET-NET can quickly and accurately extract tongue bodies from the open environment, and its relatively smaller number of model parameters made it suitable for mobile devices. It is potential for OET-NET to be applied to tongue image segmentation on mobile terminals.
Establishing and validating a spotted tongue recognition and extraction model based on multiscale convolutional neural network
2022, Digital Chinese Medicine
Citation Excerpt :
With the development of artificial intelligence (AI), new technology has been provided for tongue image recognition and extraction [6-8]. In recent years, deep learning and convolutional neural network (CNN) have been applied to the TCM feature analysis method of tongue images [9-16] and achieved better recognition accuracy than image algorithms and machine learning. However, the irregular distribution of spots on the tongue, the large difference between the size of spots and the tongue, and the small difference between the color of the spots and the tongue under natural light conditions, etc., are the difficulties and challenge of automatic recognition for spotted tongue [17].
In tongue diagnosis, the location, color, and distribution of spots can be used to speculate on the viscera and severity of the heat evil. This work focuses on the image analysis method of artificial intelligence (AI) to study the spotted tongue recognition of traditional Chinese medicine (TCM).
A model of spotted tongue recognition and extraction is designed, which is based on the principle of image deep learning and instance segmentation. This model includes multiscale feature map generation, region proposal searching, and target region recognition. Firstly, deep convolution network is used to build multiscale low- and high-abstraction feature maps after which, target candidate box generation algorithm and selection strategy are used to select high-quality target candidate regions. Finally, classification network is used for classifying target regions and calculating target region pixels. As a result, the region segmentation of spotted tongue is obtained. Under non-standard illumination conditions, various tongue images were taken by mobile phones, and experiments were conducted.
The spotted tongue recognition achieved an area under curve (AUC) of 92.40%, an accuracy of 84.30% with a sensitivity of 88.20%, a specificity of 94.19%, a recall of 88.20%, a regional pixel accuracy index pixel accuracy (PA) of 73.00%, a mean pixel accuracy (mPA) of 73.00%, an intersection over union (IoU) of 60.00%, and a mean intersection over union (mIoU) of 56.00%.
The results of the study verify that the model is suitable for the application of the TCM tongue diagnosis system. Spotted tongue recognition via multiscale convolutional neural network (CNN) would help to improve spot classification and the accurate extraction of pixels of spot area as well as provide a practical method for intelligent tongue diagnosis of TCM.
Interpretable machine learning in bioinformatics
2020, Methods

View all citing articles on Scopus

View full text

Encoder-decoder CNN models for automatic tracking of tongue contours in real-time ultrasound data

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Network architectures

System setup

Conclusion and discussion

CRediT authorship contribution statement

Med. Image Anal.

Speech Commun.

UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

Interspeech.

Tongue Shape Dynamics in Swallowing Using Sagittal Ultrasound

Dysphagia.

A guide to analysing tongue motion from ultrasound images

Clin. Linguist. Phon.

Development of a 3D tongue motion visualization platform based on ultrasound image sequences

ArXiv Prepr. ArXiv1605.06106.

Automatic contour tracking in ultrasound images

Clin. Linguist. Phon.

Robust contour tracking in ultrasound tongue image sequences

Clin. Linguist. Phon.

Dynamics of tongue gestures extracted automatically from ultrasound ICASSP

IEEE Int. Conf. Acoust. Speech Signal Process. - Proc.

Tongue tracking in ultrasound images using eigentongue decomposition and artificial neural networks

Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH.

AutoTrace: An automatic system for tracing tongue contours

J. Acoust. Soc. Am.

Deep learning for medical image analysis

Going Deeper with Convolutions

Popul. Health Manag.

ImageNet classification with deep convolutional neural networks

Commun. ACM.