Authorship Identification on Limited Samplings,Computers & Security

当前位置： X-MOL 学术 › Comput. Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Authorship Identification on Limited Samplings
Computers & Security ( IF 4.8 ) Pub Date : 2020-10-01 , DOI: 10.1016/j.cose.2020.101943
Tudor Boran , Muhamet Martinaj , Md Shafaeat Hossain

Abstract The internet has changed the way that many people access written works. Books and articles, of various lengths, in several formats can be bought and accessed online, both legally and illegally. Texts in even shorter form are originating through forums, SMS, blogs, emails, and social media. Automating the process of determining the authorship of posted texts would help combat online piracy of copyrighted text and plagiarism. In addition, authorship identification could help detect fraudulent email messages from dangerous sources and combat cyberattacks by identifying authentic sources. We experiment with several machine learning algorithms on a limited set of public domain literature to identify the most efficient method of authorship identification using the least amount of samples. Different sized data sets are created by 5 predefined rounds of random sampling of 1500 word blocks on a total of 28 text books from a corpus of 7 authors. Traditional methods of authorship identification, such as Naive Bayes, Artificial Neural Network, and Support Vector Machine are implemented in addition to using a modern Deep Learning Neural Network for classification. Thirteen stylometric features are extracted ranging from character based, word based, and syntactic features. Our model consistently showed that Support Vector Machine out performs other classification methods.

中文翻译：

有限样本的作者身份识别

摘要互联网改变了许多人访问书面作品的方式。可以合法和非法地在线购买和访问多种格式的各种长度的书籍和文章。更短形式的文本来自论坛、短信、博客、电子邮件和社交媒体。自动化确定已发布文本的作者身份的过程将有助于打击受版权保护的文本的在线盗版和剽窃行为。此外，作者身份识别可以帮助检测来自危险来源的欺诈性电子邮件消息，并通过识别真实来源来对抗网络攻击。我们在有限的公共领域文献集上试验了几种机器学习算法，以使用最少的样本确定最有效的作者身份识别方法。不同大小的数据集是通过 5 轮预定义的 1500 个词块随机抽样创建的，这些数据集来自 7 个作者的语料库，共 28 本教科书。除了使用现代深度学习神经网络进行分类之外，还实现了传统的作者身份识别方法，例如朴素贝叶斯、人工神经网络和支持向量机。提取了 13 个文体特征，包括基于字符、基于单词和句法特征。我们的模型一致表明支持向量机执行其他分类方法。除了使用现代深度学习神经网络进行分类之外，还实现了支持向量机和支持向量机。提取了 13 个文体特征，包括基于字符、基于单词和句法特征。我们的模型一致表明支持向量机执行其他分类方法。除了使用现代深度学习神经网络进行分类之外，还实现了支持向量机和支持向量机。提取了 13 个文体特征，包括基于字符、基于单词和句法特征。我们的模型一致表明支持向量机执行其他分类方法。

更新日期：2020-10-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11