Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text
arXiv - CS - Computation and Language Pub Date : 2020-11-23 , DOI: arxiv-2011.11263
Ramchandra Joshi, Raviraj Joshi

Natural language processing (NLP) techniques have become mainstream in the recent decade. Most of these advances are attributed to the processing of a single language. More recently, with the extensive growth of social media platforms focus has shifted to code-mixed text. The code-mixed text comprises text written in more than one language. People naturally tend to combine local language with global languages like English. To process such texts, current NLP techniques are not sufficient. As a first step, the text is processed to identify the language of the words in the text. In this work, we focus on language identification in code-mixed sentences for Hindi-English mixed text. The task of language identification is formulated as a token classification task. In the supervised setting, each word in the sentence has an associated language label. We evaluate different deep learning models and input representation combinations for this task. Mainly, character, sub-word, and word embeddings are considered in combination with CNN and LSTM based models. We show that sub-word representation along with the LSTM model gives the best results. In general sub-word representations perform significantly better than other input representations. We report the best accuracy of 94.52% using a single layer LSTM model on the standard SAIL ICON 2017 test set.

中文翻译：

评估输入表示形式以识别印地语-英语代码混合文本

近十年来，自然语言处理（NLP）技术已成为主流。这些进步大部分归功于一种语言的处理。最近，随着社交媒体平台的广泛发展，重点已转移到代码混合文本。混合代码文本包括用一种以上语言编写的文本。人们自然倾向于将本地语言与全球性语言（例如英语）结合起来。为了处理此类文本，当前的NLP技术还不够。第一步，处理文本以识别文本中单词的语言。在这项工作中，我们专注于印地语-英语混合文本的代码混合句子中的语言识别。语言识别任务被表述为令牌分类任务。在监督设置中，句子中的每个单词都有一个关联的语言标签。我们为此任务评估了不同的深度学习模型和输入表示组合。主要将字符，子词和词的嵌入与基于CNN和LSTM的模型结合使用。我们显示，子词表示形式与LSTM模型一起提供了最佳结果。通常，子词表示的性能明显优于其他输入表示。我们在标准SAIL ICON 2017测试仪上使用单层LSTM模型报告了94.52％的最佳精度。通常，子词表示的性能明显优于其他输入表示。我们在标准SAIL ICON 2017测试仪上使用单层LSTM模型报告了94.52％的最佳精度。通常，子词表示的性能明显优于其他输入表示。我们在标准SAIL ICON 2017测试仪上使用单层LSTM模型报告了94.52％的最佳精度。

更新日期：2020-11-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>