当前位置: X-MOL 学术Nat. Lang. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SwitchNet: Learning to switch for word-level language identification in code-mixed social media text
Natural Language Engineering ( IF 2.5 ) Pub Date : 2021-06-03 , DOI: 10.1017/s1351324921000115
Neelakshi Sarma 1 , Ranbir Sanasam Singh 2 , Diganta Goswami 2
Affiliation  

Word-level language identification is an essential prerequisite for extracting useful information from code-mixed social media content. Previous studies in word-level language identification show two important observations. First, the local context is an important indicator of the language of a word when a word is valid in multiple languages. Second, considering the word in isolation from its context leads to more effective language classification when a word is borrowed or embedded into sentences of other languages. In this paper, we propose a framework for language identification that makes use of a dynamic switching mechanism for effective language classification of both words that are borrowed or embedded from other languages as well as words that are valid in multiple languages. For a given input, the proposed switching mechanism makes a dynamic decision to bias its prediction either towards the prediction obtained by the contextual information or that obtained by the word in isolation. In contrast to existing studies that rely upon large amounts of annotated data for robust performance in a multilingual environment, the proposed approach uses minimal annotated resources and no external resources, making it easily extendible to newer languages. Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classification based on the word in isolation, as well as an ensemble of the two classifiers.



中文翻译:

SwitchNet:学习在代码混合的社交媒体文本中切换单词级语言识别

词级语言识别是从代码混合的社交媒体内容中提取有用信息的必要先决条件。以前在词级语言识别方面的研究显示了两个重要的观察结果。首先,当一个词在多种语言中有效时,本地上下文是一个词的语言的重要指标。其次,当单词被借用或嵌入到其他语言的句子中时,将单词从其上下文中分离出来会导致更有效的语言分类。在本文中,我们提出了一种语言识别框架,该框架利用动态切换机制对从其他语言借用或嵌入的单词以及在多种语言中有效的单词进行有效的语言分类。对于给定的输入,所提出的切换机制做出动态决策,使其预测偏向于通过上下文信息获得的预测或通过孤立的单词获得的预测。与依赖大量注释数据在多语言环境中实现稳健性能的现有研究相比,所提出的方法使用最少的注释资源且无需外部资源,使其易于扩展到新语言。对音译 Facebook 评论语料库的评估表明,所提出的方法优于其基线方法:基于上下文信息的分类、基于孤立词的分类,以及两个分类器的集合。

更新日期:2021-06-03
down
wechat
bug