当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Novel textual features for language modeling of intra-sentential code-switching data
Computer Speech & Language ( IF 3.1 ) Pub Date : 2020-05-08 , DOI: 10.1016/j.csl.2020.101099
Sreeram Ganji , Kunal Dhawan , Rohit Sinha

Code-switching refers to the frequent use of non-native language words/phrases by speakers while conversating in their native languages. Traditionally, for training a language model (LM) for code-switching data, one is required to tediously collect a large amount of text corpus in the respective code-switching domain. Alternately, we recently proposed a more viable approach that adapts an existing native LM to handle the code-switching data. In this work, we present our efforts for language modeling of code-switching data following both the traditional and the proposed approaches. The salient contributions of this paper includes: (i) creation of the Hindi-English code-switching text corpus, (ii) an improved parts-of-speech (POS) labeling scheme for accurate tagging of non-native words embedded in the code-switching data, and (iii) the proposal of a novel textual feature referred to as the code-switching location (CSL) feature, that allows LMs to predict the code-switching instances. The evaluation of the proposed features has been done on two code-switching datasets: Hindi-English and Mandarin-English. On experimental evaluation, a substantial reduction in the perplexity is achieved with the use of the improvised POS features. It is also observed that the proposed CSL features provide an independent and additive improvement over the POS features in terms of perplexity.



中文翻译:

用于句内代码转换数据的语言建模的新颖文本功能

代码转换是指说话者在使用其母语进行交谈时经常使用非母语的单词/短语。传统上,为了训练用于代码交换数据的语言模型(LM),需要一个乏味的工作来在相应的代码交换域中收集大量文本语料库。或者,我们最近提出了一种更可行的方法,该方法适用于现有的本机LM以处理代码交换数据。在这项工作中,我们将根据传统方法和提议方法介绍我们为代码交换数据的语言建模所做的努力。本文的主要贡献包括:(i)创建北印度语-英语代码转换文本语料库,(ii)改进的词性(POS)标记方案,用于准确标记嵌入在代码中的非本地单词切换数据 (iii)提出了一种新颖的文本功能,称为代码切换位置(CSL)功能,该功能使LM可以预测代码切换实例。对建议功能的评估已在两个代码转换数据集上完成:印地语-英语和普通话-英语。在实验评估中,使用简易的POS功能可大大降低困惑度。还可以观察到,就复杂性而言,所提出的CSL功能相对于POS功能提供了独立和累加的改进。使用简易的POS功能可大大减少困惑。还可以观察到,就复杂性而言,所提出的CSL功能相对于POS功能提供了独立和累加的改进。使用简易的POS功能可大大降低困惑度。还可以观察到,就复杂性而言,所提出的CSL功能相对于POS功能提供了独立和累加的改进。

更新日期:2020-05-08
down
wechat
bug