Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features
arXiv - CS - Computation and Language Pub Date : 2020-03-24 , DOI: arxiv-2003.11545
Nicole Mariah Sharon Belvisi, Naveed Muhammad, Fernando Alonso-Fernandez

In recent years, messages and text posted on the Internet are used in criminal investigations. Unfortunately, the authorship of many of them remains unknown. In some channels, the problem of establishing authorship may be even harder, since the length of digital texts is limited to a certain number of characters. In this work, we aim at identifying authors of tweet messages, which are limited to 280 characters. We evaluate popular features employed traditionally in authorship attribution which capture properties of the writing style at different levels. We use for our experiments a self-captured database of 40 users, with 120 to 200 tweets per user. Results using this small set are promising, with the different features providing a classification accuracy between 92% and 98.5%. These results are competitive in comparison to existing studies which employ short texts such as tweets or SMS.

中文翻译：

使用 N-Grams 和文体特征对微博文本进行法医作者分析

近年来，在互联网上发布的消息和文本被用于刑事调查。不幸的是，其中许多的作者身份仍然未知。在某些渠道中，确定作者身份的问题可能更加困难，因为数字文本的长度仅限于一定数量的字符。在这项工作中，我们的目标是识别推文消息的作者，限制为 280 个字符。我们评估了传统上在作者归属中使用的流行特征，这些特征捕获了不同级别的写作风格的属性。我们在实验中使用了一个包含 40 个用户的自捕获数据库，每个用户有 120 到 200 条推文。使用这个小集合的结果很有希望，不同的特征提供了 92% 到 98.5% 的分类准确率。

更新日期：2020-10-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文