-
An intention multiple-representation model with expanded information Comput. Speech Lang (IF 2.116) Pub Date : 2021-01-17 Jingxiang Hu; Junjie Peng; Wenqiang Zhang; Lizhe Qi; Miao Hu; Huanxiang Zhang
Short text is the main carrier for people to express their ideas and opinions. And it is very important as well a big challenge to understand the meaning of short text or recognize the semantic patterns of different short texts. Most of existing methods use word embedding and short text interaction to learn short text pairs semantic patterns. However, some of these methods are complicated and cannot
-
Novel textual entailment technique for the Arabic language using genetic algorithm Comput. Speech Lang (IF 2.116) Pub Date : 2021-01-12 Bushra Alhijawi; Arafat Awajan
This paper presents a textual entailment (TE) model that considers entailment as an optimization problem. The proposed TE model employs a genetic algorithm to derive an optimal similarity function and correlated entailment judgment threshold. The similarity function is formulated through a linear combination of text similarity measures and weights. Two text similarity measures are considered: cosine
-
Verbal fluency in normal aging and cognitive decline: Results of a longitudinal study Comput. Speech Lang (IF 2.116) Pub Date : 2021-01-12 Claudia Frankenberg; Jochen Weiner; Maren Knebel; Ayimunishagu Abulimiti; Pablo Toro; Christina J. Herold; Tanja Schultz; Johannes Schröder
Verbal fluency – i.e. the ability to name as many words of a given category as possible in a defined time interval – is an integral part of neuropsychological test batteries for the diagnosis of dementia. Verbal fluency can be easily administered and thus may also be implemented in computerized dementia screening tests. In the present study we sought to investigate the capability of phonemic verbal
-
Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification Comput. Speech Lang (IF 2.116) Pub Date : 2021-01-15 Jianfeng Deng; Lianglun Cheng; Zhuowei Wang
Neural networks have been widely used in the field of text classification, and have achieved good results on various Chinese datasets. However, for long text classification, there are a lot of redundant information in text data, and some of the redundant information may involve other topic information, which makes long text classification challenging. To solve the above problems, this paper proposes
-
Generating unambiguous and diverse referring expressions Comput. Speech Lang (IF 2.116) Pub Date : 2020-12-31 Nikolaos Panagiaris; Emma Hart; Dimitra Gkatzia
Neural Referring Expression Generation (REG) models have shown promising results in generating expressions which uniquely describe visual objects. However, current REG models still lack the ability to produce diverse and unambiguous referring expressions (REs). To address the lack of diversity, we propose generating a set of diverse REs, rather than one-shot REs. To reduce the ambiguity of referring
-
Identification of related languages from spoken data: Moving from off-line to on-line scenario Comput. Speech Lang (IF 2.116) Pub Date : 2020-12-15 Petr Cerva; Lukas Mateju; Jindrich Zdansky; Radek Safarik; Jan Nouza
The accelerating flow of information we encounter around the world today makes many companies deploy speech recognition systems that, to an ever-growing extent, process data on-line rather than off-line. These systems, e.g., for real-time 24/7 broadcast transcription, often work with input-stream data containing utterances in more than one language. This multilingual data can correctly be transcribed
-
Exploring neural models for predicting dementia from language Comput. Speech Lang (IF 2.116) Pub Date : 2020-12-21 Weirui Kong; Hyeju Jang; Giuseppe Carenini; Thalia S. Field
Early prediction of neurodegenerative disorders such as Alzheimer’s disease (AD) and related dementias may facilitate earlier access to medical and social supports. Further, detection of individuals with preclinical disease may help to enrich clinical trial populations for studies examining disease-modifying interventions. Changes in speech and language patterns may occur in the early stages of neurodegenerative
-
An unsupervised approach to detect review spam using duplicates of images, videos and Chinese texts Comput. Speech Lang (IF 2.116) Pub Date : 2020-12-24 Jiandun Li; Pengpeng Zhang; Liu Yang
Intuitively, image- or video-based recommendations seem to be more reliable than those containing plain text, and these types of recommendations have recently become widely encouraged and commonly seen across opinion sharing platforms. Considering their potential for manipulation, graphs (e.g., images and videos) are more vulnerable to spam than scripts. However, most state-of-the-art solutions for
-
Hierarchical state recurrent neural network for social emotion ranking Comput. Speech Lang (IF 2.116) Pub Date : 2020-12-15 Deyu Zhou; Meng Zhang; Yang Yang; Yulan He
Text generation with auxiliary attributes, such as topics or sentiments, has made remarkable progress. However, high-quality labeled data is difficult to obtain for the large-scale corpus. Therefore, this paper focuses on social emotion ranking aiming to identify social emotions with different intensities evoked by online documents, which could be potentially beneficial for further controlled text
-
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis Comput. Speech Lang (IF 2.116) Pub Date : 2020-12-15 Yusuke Yasuda; Xin Wang; Junichi Yamagishi
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully
-
Turn-taking in Conversational Systems and Human-Robot Interaction: A Review Comput. Speech Lang (IF 2.116) Pub Date : 2020-12-16 Gabriel Skantze
The taking of turns is a fundamental aspect of dialogue. Since it is difficult to speak and listen at the same time, the participants need to coordinate who is currently speaking and when the next person can start to speak. Humans are very good at this coordination, and typically achieve fluent turn-taking with very small gaps and little overlap. Conversational systems (including voice assistants and
-
To what extent does content selection affect surface realization in the context of headline generation? Comput. Speech Lang (IF 2.116) Pub Date : 2020-12-15 Cristina Barros; Marta Vicente; Elena Lloret
Headline generation is a task where the most important information of a news article is condensed and embodied into a single short sentence. This task is normally addressed by summarization techniques, ideally combining extractive and abstractive methods together with sentence compression or fusion techniques. Although Natural Language Generation (NLG) techniques have not been directly exploited for
-
Learning to generate structured queries from natural language with indirect supervision Comput. Speech Lang (IF 2.116) Pub Date : 2020-12-15 Ziwei Bai; Bo yu; Bowen Wu; Zhuoran Wang; Baoxun Wang
Generating structured query language (SQL) from natural language is an emerging research topic. This paper presents a new learning paradigm from indirect supervision of the answers to natural language questions, instead of SQL queries. This paradigm facilitates the acquisition of training data due to the abundant resources of question-answer pairs for various domains in the Internet, and expels the
-
Human evaluation of automatically generated text: Current trends and best practice guidelines Comput. Speech Lang (IF 2.116) Pub Date : 2020-11-22 Chris van der Lee; Albert Gatt; Emiel van Miltenburg; Emiel Krahmer
Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated, with a particularly high degree of variation in the way that human evaluation is carried out. This paper provides an overview of how (mostly intrinsic) human evaluation is currently conducted and presents a set of best practices, grounded in the literature. These best practices are also linked
-
BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling Comput. Speech Lang (IF 2.116) Pub Date : 2020-11-26 Jing Su; Qingyun Dai; Frank Guerin; Mian Zhou
Visual storytelling is a creative and challenging task, aiming to automatically generate a story-like description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual
-
A Spanish multispeaker database of esophageal speech Comput. Speech Lang (IF 2.116) Pub Date : 2020-11-05 Luis Serrano García; Sneha Raman; Inma Hernáez Rioja; Eva Navas Cordón; Jon Sanchez; Ibon Saratxaga
A laryngectomee is a person whose larynx has been removed by surgery, usually due to laryngeal cancer. After surgery, most laryngectomees are able to speak again, using techniques that are learned with the help of a speech therapist. This is termed as alaryngeal speech, and esophageal speech (ES) is one of the several alaryngeal speech production modes. A considerable amount of research has been dedicated
-
A Bayesian end-to-end model with estimated uncertainties for simple question answering over knowledge bases Comput. Speech Lang (IF 2.116) Pub Date : 2020-10-27 Linhai Zhang; Chao Lin; Deyu Zhou; Yulan He; Meng Zhang
Existing methods for question answering over knowledge bases (KBQA) ignore the consideration of the model prediction uncertainties. We argue that estimating such uncertainties is crucial for the reliability and interpretability of KBQA systems. Therefore, we propose a novel end-to-end KBQA model based on Bayesian Neural Network (BNN) to estimate uncertainties arose from both model and data. To our
-
QBSUM: A large-scale query-based document summarization dataset from real-world applications Comput. Speech Lang (IF 2.116) Pub Date : 2020-10-28 Mingjun Zhao; Shengli Yan; Bang Liu; Xinwang Zhong; Qian Hao; Haolan Chen; Di Niu; Bowei Long; Weidong Guo
Query-based document summarization aims to extract or generate a summary of a document which directly answers or is relevant to the search query. It is an important technique that can be beneficial to a variety of applications such as search engines, document-level machine reading comprehension, and chatbots. Currently, datasets designed for query-based summarization are short in numbers and existing
-
A label-oriented loss function for learning sentence representations Comput. Speech Lang (IF 2.116) Pub Date : 2020-10-26 Yihong Liu; Wei Guan; Dongxu Lu; Xianchun Zou
Neural network methods which leverage word-embedding obtained from unsupervised learning models have been widely adopted in many natural language processing (NLP) tasks, including sentiment analysis and sentence classification. Existing sentence representation generation approaches which serve for classification tasks generally rely on complex deep neural networks but relatively simple loss functions
-
MuST-C: A multilingual corpus for end-to-end speech translation Comput. Speech Lang (IF 2.116) Pub Date : 2020-10-07 Roldano Cattoni; Mattia Antonino Di Gangi; Luisa Bentivogli; Matteo Negri; Marco Turchi
End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions
-
An analysis of observation length requirements for machine understanding of human behaviors from spoken language Comput. Speech Lang (IF 2.116) Pub Date : 2020-10-16 Sandeep Nallan Chakravarthula; Brian R.W. Baucom; Shrikanth Narayanan; Panayiotis Georgiou
The task of quantifying human behavior by observing interaction cues is an important and useful one across a range of domains in psychological research and practice. Machine learning-based approaches typically perform this task by first estimating behavior based on cues within an observation window, such as a fixed number of words, and then aggregating the behavior over all the windows in that interaction
-
Neural candidate-aware language models for speech recognition Comput. Speech Lang (IF 2.116) Pub Date : 2020-09-24 Tomohiro Tanaka; Ryo Masumura; Takanobu Oba
This paper presents novel neural network based language models that can correct automatic speech recognition (ASR) errors by using speech recognizer outputs as a context. Our proposed models, called neural candidate-aware language models (NCALMs), estimate the generative probability of a target sentence while considering ASR outputs including hypotheses and their posterior probabilities. Recently,
-
On the use of blind channel response estimation and a residual neural network to detect physical access attacks to speaker verification systems Comput. Speech Lang (IF 2.116) Pub Date : 2020-10-20 Anderson R. Avila; Jahangir Alam; Fabiano O. Costa Prado; Douglas O’Shaughnessy; Tiago H. Falk
Spoofing attacks have been acknowledged as a serious threat to automatic speaker verification (ASV) systems. In this paper, we are specifically concerned with replay attack scenarios. As a countermeasure to the problem, we propose a front-end based on the blind estimation of the channel response magnitude and as a back-end a residual neural network. Our hypothesis is that the magnitude response of
-
Transfer fine-tuning of BERT with phrasal paraphrases Comput. Speech Lang (IF 2.116) Pub Date : 2020-10-20 Yuki Arase; Junichi Tsujii
Sentence pair modelling is defined as the task of identifying the semantic interaction between a sentence pair, i.e., paraphrase and textual entailment identification and semantic similarity measurement. It constitutes a set of crucial tasks for research in the area of natural language understanding. Sentence representation learning is a fundamental technology for sentence pair modelling, where the
-
Controlling contents in data-to-document generation with human-designed topic labels Comput. Speech Lang (IF 2.116) Pub Date : 2020-09-29 Kasumi Aoki; Akira Miyazawa; Tatsuya Ishigaki; Tatsuya Aoki; Hiroshi Noji; Keiichi Goshima; Hiroya Takamura; Yusuke Miyao; Ichiro Kobayashi
We propose a data-to-document generator that can easily control the contents of output texts based on a neural language model. Conventional data-to-text model is useful when a reader seeks a global summary of data because it has only to describe an important part that has been extracted beforehand. However, since it differs from users to users what they are interested in, it is necessary to develop
-
Advances in subword-based HMM-DNN speech recognition across languages Comput. Speech Lang (IF 2.116) Pub Date : 2020-09-28 Peter Smit; Sami Virpioja; Mikko Kurimo
We describe a novel way to implement subword language models in speech recognition systems based on weighted finite state transducers, hidden Markov models, and deep neural networks. The acoustic models are built on graphemes in a way that no pronunciation dictionaries are needed, and they can be used together with any type of subword language model, including character models. The advantages of short
-
Replay attack detection using variable-frequency resolution phase and magnitude features Comput. Speech Lang (IF 2.116) Pub Date : 2020-09-28 Meng Liu; Longbiao Wang; Jianwu Dang; Kong Aik Lee; Seiichi Nakagawa
Replay attacks pose the most severe threat to automatic speaker verification systems among various spoofing attacks. In this paper, we propose a novel feature extraction method that leverages both the phase-based and magnitude-based features. The proposed method fully utilizes the subband information and the complementary information from the phase and magnitude spectra. First, we conduct a discriminative
-
Acoustic and articulatory analysis and synthesis of shouted vowels Comput. Speech Lang (IF 2.116) Pub Date : 2020-10-09 Yawen Xue; Michael Marxen; Masato Akagi; Peter Birkholz
Acoustic and articulatory differences between spoken and shouted vowels were analyzed for two male and two female subjects by means of acoustic recordings and midsagittal magnetic resonance images of the vocal tract. In accordance with previous acoustic findings, the fundamental frequencies, intensities, and formant frequencies were all generally higher for shouted than for spoken vowels. The harmonics-to-noise
-
A deep neural network based correction scheme for improved air-tissue boundary prediction in real-time magnetic resonance imaging video Comput. Speech Lang (IF 2.116) Pub Date : 2020-09-28 Renuka Mannem; Prasanta Kumar Ghosh
The real-time Magnetic Resonance Imaging (rtMRI) video captures the vocal tract movements in the mid-sagittal plane during speech. Air tissue boundaries (ATBs) are contours that trace the transition between the high-intensity tissue corresponding to the speech articulators and the low-intensity airway cavity in the rtMRI video. The ATB segmentation in an rtMRI video is a common preprocessing step which
-
Gated dynamic convolutions with deep layer fusion for abstractive document summarization Comput. Speech Lang (IF 2.116) Pub Date : 2020-09-25 Hongseok Kwon; Byung-Hyun Go; Juhong Park; Wonkee Lee; Yewon Jeong; Jong-Hyeok Lee
We present a novel abstractive document summarization based on the recently proposed dynamic convolutional encoder-decoder architectures. We address several aspects of summarization that are not well modeled by the basic architecture, by integrating multiple layers of the encoder, controlling information flow in the hierarchy, and exploiting external knowledge. First, we propose a simple and efficient
-
An online multi-source summarization algorithm for text readability in topic-based search Comput. Speech Lang (IF 2.116) Pub Date : 2020-08-29 Arturo Curiel; Claudio Gutiérrez-Soto; José-Rafael Rojano-Cáceres
Web search users are likely to face problems related to the availability of large amounts of data. As the quantity of online content grows, the risk of missing relevant information during search can only increase. Moreover, external variables such as the users’ reading proficiency level can further complicate the task. This article proposes an online multi-document summarization algorithm for text
-
Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian Comput. Speech Lang (IF 2.116) Pub Date : 2020-09-01 Matti Varjokallio; Sami Virpioja; Mikko Kurimo
We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach
-
Low resource end-to-end spoken language understanding with capsule networks Comput. Speech Lang (IF 2.116) Pub Date : 2020-08-20 Jakob Poncelet; Vincent Renkens; Hugo Van hamme
Designing a Spoken Language Understanding (SLU) system for command-and-control applications is challenging. Both Automatic Speech Recognition and Natural Language Understanding are language and application dependent to a great extent. Even with a lot of design effort, users often still have to know what to say to the system for it to do what they want. We propose to use an end-to-end SLU system that
-
Detection of replay spoof speech using teager energy feature cues Comput. Speech Lang (IF 2.116) Pub Date : 2020-08-14 Madhu R. Kamble; Hemant A. Patil
The vulnerability of Automatic Speaker Verification (ASV) systems to spoofing or presentation attacks is still an open security issue. In this context, replay spoofing attacks pose a great threat to an ASV system since they can be easily performed (using a playback device, and without needing any technical skill). In this paper, we analyze replay speech signals in terms of reverberation that may occur
-
HSCJN: A holistic semantic constraint joint network for diverse response generation Comput. Speech Lang (IF 2.116) Pub Date : 2020-07-29 Yiru Wang; Pengda Si; Zeyang Lei; Guangxu Xun; Yujiu Yang
The sequence-to-sequence (Seq2Seq) model generates target words iteratively given the previously observed words during the decoding process, which results in the loss of the holistic semantics in the target response and the complete semantic relationship between responses and dialogue histories. In this paper, we propose a generic diversity-promoting joint network, called Holistic Semantic Constraint
-
Assessing the effect of visual servoing on the performance of linear microphone arrays in moving human-robot interaction scenarios Comput. Speech Lang (IF 2.116) Pub Date : 2020-07-30 Alejandro Díaz; Rodrigo Mahu; Jose Novoa; Jorge Wuth; Jayanta Datta; Nestor Becerra Yoma
Social robotics is becoming a reality and voice-based human-robot interaction is essential for a successful human-robot collaborative symbiosis. The main objective of this paper is to assess the effect of visual servoing in the performance of a linear microphone array regarding distant ASR in a mobile, dynamic and non-stationary robotic testbed that can be representative of real HRI scenarios. Visual
-
Speaker clustering quality estimation with logistic regression Comput. Speech Lang (IF 2.116) Pub Date : 2020-08-07 Yishai Cohen; Itshak Lapidot
This paper focuses on estimating the quality of a clustering process. The task is to cluster short speech segments that belong to different speakers. A variety of statistical parameters are estimated from the output of the clustering process. These parameters are used to train a logistic regression to serve as a clustering quality estimation system. In this paper, mean-shift clustering with either
-
Detection of speech playback attacks using robust harmonic trajectories Comput. Speech Lang (IF 2.116) Pub Date : 2020-07-16 Wei Shang; Maryhelen Stevenson
In this paper, a new feature set is proposed for use in a playback attack detector (PAD) aimed at safeguarding a passphrase and speaker-verified protected system that can be remotely accessed from an arbitrary location using an arbitrary telecommunication channel. The new feature set, termed VoicedTracks, is a time-frequency map of the most robust harmonic trajectories in an utterance and serves as
-
Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM Comput. Speech Lang (IF 2.116) Pub Date : 2020-07-31 Wasan AlKhwiter; Nora Al-Twairesh
Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. This content has been a rich source for several studies that focused on natural language processing (NLP) research. However, Twitter data pose numerous challenges and obstacles to NLP tasks. For the English language, Twitter has an NLP tool that provides tweet-specific NLP tasks
-
A neural network approach for speech activity detection for Apollo corpus Comput. Speech Lang (IF 2.116) Pub Date : 2020-07-30 Vishala Pannala; B. Yegnanarayana
This paper describes a new method for speech activity detection (SAD) based on the recently proposed single frequency filtering (SFF) analysis of speech signals and a neural network model. The SFF analysis gives instantaneous spectrum of the speech signal at each sampling instant. The frequency resolution of the spectrum is decided by the number of frequencies used in the SFF analysis, which in turn
-
LIS-Net: An end-to-end light interior search network for speech command recognition Comput. Speech Lang (IF 2.116) Pub Date : 2020-07-17 Nguyen Tuan Anh; Yongjian Hu; Qianhua He; Tran Thi Ngoc Linh; Hoang Thi Kim Dung; Chen Guang
-
A Korean named entity recognition method using Bi-LSTM-CRF and masked self-attention Comput. Speech Lang (IF 2.116) Pub Date : 2020-07-18 Guozhe Jin; Zhezhou Yu
Named entity recognition (NER) is a fundamental task in natural language processing. The existing Korean NER methods use the Korean morpheme, syllable sequence, and part-of-speech as features, and use a sequence labeling model to tackle this problem. In Korean, on one hand, morpheme itself contains strong indicative information of named entity (especially for time and person). On the other hand, the
-
Voice spoofing detection corpus for single and multi-order audio replays Comput. Speech Lang (IF 2.116) Pub Date : 2020-07-16 Roland Baumann; Khalid Mahmood Malik; Ali Javed; Andersen Ball; Brandon Kujawa; Hafiz Malik
The evolution of modern voice-controlled devices (VCDs) has revolutionized the Internet of Things (IoT) and resulted in the increased realization of smart homes, personalization, and home automation through voice commands. These VCDs can be exploited in IoT driven environments to generate various spoofing attacks, including the chaining of replay attacks (i.e. multi-order replay attacks). Existing
-
Towards a speech therapy support system based on phonological processes early detection Comput. Speech Lang (IF 2.116) Pub Date : 2020-06-24 Maria Helena Franciscatto; Marcos Didonet Del Fabro; João Carlos Damasceno Lima; Celio Trois; Augusto Moro; Vinícius Maran; Marcia Keske-Soares
Phonological disorders are characterized by substitutions, insertion and/or deletions of sounds during the process of language acquisition, which are known as Phonological Processes (PPs). In the speech therapy domain, an early identification of PPs allows the diagnosis and treatment of various pathologies and may improve clinical tasks, however, there are few proposals that focus on the identification
-
Linguistic features and automatic classifiers for identifying mild cognitive impairment and dementia Comput. Speech Lang (IF 2.116) Pub Date : 2020-06-29 Laura Calzà; Gloria Gagliardi; Rema Rossini Favretti; Fabio Tamburini
Almost 50 million people are living with dementia in 2018 worldwide, and the number will double every 20 years. The effectiveness of existing pharmacologic treatments for the disease is limited to symptoms control, and none of them are able to prevent, reverse or turn off the neurodegenerative process that leads to dementia; therefore, a prompt detection of the “disease signature” is a key problem
-
TOP-Rank: A TopicalPostionRank for Extraction and Classification of Keyphrases in Text Comput. Speech Lang (IF 2.116) Pub Date : 2020-06-17 Mubashar Nazar Awan; Mirza Omer Beg
Keyphrase extraction is the task of extracting the most important phrases from a document. Automatic keyphrase extraction attempts to itemize a document content as metainformation and facilitate efficient information retrieval. In this paper we propose TOP-Rank, an approach for keyphrase extraction and keyphrase classification. For keyphrase extraction, we build an approach based on the position of
-
Variational model for low-resource natural language generation in spoken dialogue systems Comput. Speech Lang (IF 2.116) Pub Date : 2020-06-07 Van-Khanh Tran; Le-Minh Nguyen
Natural Language Generation (NLG) plays a critical role in Spoken Dialogue Systems (SDSs), aims at converting a meaning representation into natural language utterances. Recent deep learning-based generators have shown improving results irrespective of providing sufficient annotated data. Nevertheless, how to build a generator that can effectively utilize as much of knowledge from a low-resource setting
-
Analysis of gender and identity issues in depression detection on de-identified speech Comput. Speech Lang (IF 2.116) Pub Date : 2020-06-07 Paula Lopez-Otero; Laura Docio-Fernandez
Research in the area of automatic monitoring of emotional state from speech permits envisaging future novel applications for the remote monitoring of some common mental disorders, such as depression. However, these tools raise some privacy concerns since speech is sent via telephone or the Internet, and it is further stored or processed in remote servers. Speaker de-identification can be used to protect
-
Automatic assessment of intelligibility in speakers with dysarthria from coded telephone speech using glottal features Comput. Speech Lang (IF 2.116) Pub Date : 2020-06-02 N.P. Narendra; Paavo Alku
In clinical practice, assessment of intelligibility in speakers with dysarthria is performed by speech-language pathologists through auditory perceptual tests which demand patients’ presence at hospital and involve time-consuming examinations. Frequent clinical monitoring can be costly and logistically inconvenient both for patients and medical experts. Here, we aim to automate the procedure of assessment
-
Emotion recognition in low-resource settings: An evaluation of automatic feature selection methods Comput. Speech Lang (IF 2.116) Pub Date : 2020-06-01 Fasih Haider; Senja Pollak; Pierre Albert; Saturnino Luz
Research in automatic affect recognition has seldom addressed the issue of computational resource utilization. With the advent of ambient intelligence technology which employs a variety of low-power, resource-constrained devices, this issue is increasingly gaining interest. This is especially the case in the context of health and elderly care technologies, where interventions may rely on monitoring
-
ASVspoof 2019: a large-scale public database of synthetized, converted and replayed speech Comput. Speech Lang (IF 2.116) Pub Date : 2020-05-20 Xin Wang; Junichi Yamagishi; Massimiliano Todisco; Héctor Delgado; Andreas Nautsch; Nicholas Evans; Md Sahidullah; Ville Vestman; Tomi Kinnunen; Kong Aik Lee; Lauri Juvela; Paavo Alku; Yu-Huai Peng; Hsin-Te Hwang; Yu Tsao; Hsin-Min Wang; Sébastien Le Maguer; Markus Becker; Zhen-Hua Ling
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as “presentation attacks.” These vulnerabilities are generally unacceptable and call for spoofing countermeasures or “presentation attack detection” systems. In addition to impersonation
-
Replay anti-spoofing countermeasure based on data augmentation with post selection Comput. Speech Lang (IF 2.116) Pub Date : 2020-05-16 Yuanjun Zhao; Roberto Togneri; Victor Sreeram
Automatic Speaker Verification (ASV) systems have been widely applied for speaker authentication for biometric security especially in e-business scenarios. However, vulnerabilities of the ASV technology have been discovered and have generated much interest to design anti-spoofing countermeasures. Serious threats can be posed by replay attacks, which are difficult to detect and easy to mount with accessible
-
Replay spoofing countermeasure using autoencoder and siamese networks on ASVspoof 2019 challenge Comput. Speech Lang (IF 2.116) Pub Date : 2020-05-15 Mohammad Adiban; Hossein Sameti; Saeedreza Shehnepoor
Automatic Speaker Verification (ASV) is authentication of individuals by analyzing their speech signals. Different synthetic approaches allow spoofing to deceive ASV systems (ASVs), whether using techniques to imitate a voice or reconstruct the features. Attackers beat up the ASVs using four general techniques; impersonation, speech synthesis, voice conversion, and replay. The last technique is considered
-
Novel textual features for language modeling of intra-sentential code-switching data Comput. Speech Lang (IF 2.116) Pub Date : 2020-05-08 Sreeram Ganji; Kunal Dhawan; Rohit Sinha
Code-switching refers to the frequent use of non-native language words/phrases by speakers while conversating in their native languages. Traditionally, for training a language model (LM) for code-switching data, one is required to tediously collect a large amount of text corpus in the respective code-switching domain. Alternately, we recently proposed a more viable approach that adapts an existing
-
Investigating topics, audio representations and attention for multimodal scene-aware dialog Comput. Speech Lang (IF 2.116) Pub Date : 2020-05-06 Shachi H Kumar; Eda Okur; Saurav Sahay; Jonathan Huang; Lama Nachman
With the recent advancements in Artificial Intelligence(AI), Intelligent Virtual Assistants (IVA) such as Alexa and Google Home have become a ubiquitous part of every home. Currently, such IVAs are mostly audio-based, but going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances. This will
-
Hybrid-task learning for robust automatic speech recognition Comput. Speech Lang (IF 2.116) Pub Date : 2020-05-03 Gueorgui Pironkov; Sean UN Wood; Stéphane Dupont
In order to properly train an automatic speech recognition system, speech with its annotated transcriptions is most often required. The amount of real annotated data recorded in noisy and reverberant conditions is extremely limited, especially compared to the amount of data than can be simulated by adding noise to clean annotated speech. Thus, using both real and simulated data is important in order
-
tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification Comput. Speech Lang (IF 2.116) Pub Date : 2020-04-29 Blaž Škrlj; Matej Martinc; Jan Kralj; Nada Lavrač; Senja Pollak
The use of background knowledge is largely unexploited in text classification tasks. This paper explores word taxonomies as means for constructing new semantic features, which may improve the performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy-based features, and demonstrate its use on six short text classification problems: prediction
-
Determination of glottal closure instants from clean and telephone quality speech signals using single frequency filtering Comput. Speech Lang (IF 2.116) Pub Date : 2020-04-29 Sudarsana Reddy Kadiri; B. Yegnanarayana
A new approach for determining the glottal activity from speech signals is presented in this paper. The approach is based on the use of single frequency filtering (SFF), proposed recently for voice activity detection. The variance (across frequency) of the spectral envelopes at each sampling instant is derived using the SFF of speech signal. The variance plot shows discontinuities corresponding to
-
Multilingual and unsupervised subword modeling for zero-resource languages Comput. Speech Lang (IF 2.116) Pub Date : 2020-04-17 Enno Hermann; Herman Kamper; Sharon Goldwater
Subword modeling for zero-resource languages aims to learn low-level representations of speech audio without using transcriptions or other resources from the target language (such as text corpora or pronunciation dictionaries). A good representation should capture phonetic content and abstract away from other types of variability, such as speaker differences and channel noise. Previous work in this
-
Leveraging Linguistic Context in Dyadic Interactions to Improve Automatic Speech Recognition for Children. Comput. Speech Lang (IF 2.116) Pub Date : 2020-04-16 Manoj Kumar,So Hyun Kim,Catherine Lord,Thomas D Lyon,Shrikanth Narayanan
Automatic speech recognition for child speech has been long considered a more challenging problem than for adult speech. Various contributing factors have been identified such as larger acoustic speech variability including mispronunciations due to continuing biological changes in growth, developing vocabulary and linguistic skills, and scarcity of training corpora. A further challenge arises when
Contents have been reproduced by permission of the publishers.