An auxiliary Part-of-Speech tagger for blog and microblog cyber-slang,Statistical Analysis and Data Mining

当前位置： X-MOL 学术 › Stat. Anal. Data Min. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An auxiliary Part-of-Speech tagger for blog and microblog cyber-slang
Statistical Analysis and Data Mining ( IF 1.3 ) Pub Date : 2022-09-06 , DOI: 10.1002/sam.11596
Silvia Golia ₁ , Paola Zola ₂

Affiliation

The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state-of-the-art Part-of-Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens' information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.

中文翻译：

用于博客和微博网络俚语的辅助词性标注器

Web 2.0 的影响越来越大，俚语、缩写和强调词的使用越来越多，这限制了传统自然语言处理模型的性能。最先进的词性 (POS) 标记器通常无法为 Web 2.0 文本中的所有单词分配有意义的词性标记。为了解决这个限制，我们提出了一个辅助 POS 标记器，它根据从前后 POS 标签序列中派生的信息将 POS 标签分配给给定的标记。所提出的辅助 POS 标记器的主要优点是它能够克服令牌信息的需要，因为它只依赖于现有 POS 标签的序列。这个标注器被称为辅助标注器，因为它需要一个初始的 POS 标注过程，可以使用在线词典（例如，Wikidictionary）或其他词性标注算法。辅助词性标注器依赖于使用有关前后词性标签信息的贝叶斯网络。它是在通用语言学语料库 Brown Corpus、由 Twitter 消息组成的现代 ARK 数据集以及手动标记的 Web 2.0 数据语料库上进行评估的。

更新日期：2022-09-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>