当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving Code-mixed POS Tagging Using Code-mixed Embeddings
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2020-03-29 , DOI: 10.1145/3380967
S. Nagesh Bhattu 1 , Satya Krishna Nunna 2 , D. V. L. N. Somayajulu 3 , Binay Pradhan 4
Affiliation  

Social media data has become invaluable component of business analytics. A multitude of nuances of social media text make the job of conventional text analytical tools difficult. Code-mixing of text is a phenomenon prevalent among social media users, wherein words used are borrowed from multiple languages, though written in the commonly understood roman script. All the existing supervised learning methods for tasks such as Parts Of Speech (POS) tagging for code-mixed social media (CMSM) text typically depend on a large amount of training data. Preparation of such large training data is resource-intensive, requiring expertise in multiple languages. Though the preparation of small dataset is possible, the out of vocabulary (OOV) words pose major difficulty, while learning models from CMSM text as the number of different ways of writing non-native words in roman script is huge. POS tagging for code-mixed text is non-trivial, as tagging should deal with syntactic rules of multiple languages. The important research question addressed by this article is whether abundantly available unlabeled data can help in resolving the difficulties posed by code-mixed text for POS tagging. We develop an approach for scraping and building word embeddings for code-mixed text illustrating it for Bengali-English, Hindi-English, and Telugu-English code-mixing scenarios. We used a hierarchical deep recurrent neural network with linear-chain CRF layer on top of it to improve the performance of POS tagging in CMSM text by capturing contextual word features and character-sequence–based information. We prepared a labeled resource for POS tagging of CMSM text by correcting 19% of labels from an existing resource. A detailed analysis of the performance of our approach with varying levels of code-mixing is provided. The results indicate that the F1-score of our approach with custom embeddings is better than the CRF-based baseline by 5.81%, 5.69%, and 6.3% in Bengali, Hindi , and Telugu languages, respectively.

中文翻译:

使用代码混合嵌入改进代码混合词性标注

社交媒体数据已成为业务分析的宝贵组成部分。社交媒体文本的众多细微差别使传统文本分析工具的工作变得困难。文本的代码混合是社交媒体用户中普遍存在的一种现象,其中使用的单词是从多种语言中借用的,尽管是用通常理解的罗马文字编写的。用于代码混合社交媒体 (CMSM) 文本的词性 (POS) 标记等任务的所有现有监督学习方法通​​常依赖于大量训练数据。准备如此庞大的训练数据是资源密集型的,需要多种语言的专业知识。虽然准备小数据集是可能的,但词汇量不足(OOV)的词造成了很大的困难,而从 CMSM 文本中学习模型,因为用罗马文字书写非母语单词的不同方式的数量是巨大的。代码混合文本的 POS 标记很重要,因为标记应该处理多种语言的句法规则。本文解决的重要研究问题是,大量可用的未标记数据是否有助于解决 POS 标记的代码混合文本所带来的困难。我们开发了一种方法,用于为代码混合文本抓取和构建词嵌入,以说明它孟加拉语-英语、印地语-英语、泰卢固语-英语代码混合场景。我们使用了一个层次化的深度递归神经网络,其上带有线性链 CRF 层,通过捕获上下文单词特征和基于字符序列的信息来提高 CMSM 文本中 POS 标记的性能。我们通过更正现有资源中 19% 的标签,为 CMSM 文本的 POS 标记准备了一个标签资源。提供了对我们的方法在不同级别的代码混合下的性能的详细分析。结果表明,我们的自定义嵌入方法的 F1 分数比基于 CRF 的基线好 5.81%、5.69% 和 6.3%孟加拉语、印地语, 和泰卢固语语言,分别。
更新日期:2020-03-29
down
wechat
bug