Authorship Attribution in Bangla literature using Character-level CNN,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Authorship Attribution in Bangla literature using Character-level CNN
arXiv - CS - Computation and Language Pub Date : 2020-01-11 , DOI: arxiv-2001.05316
Aisha Khatun, Anisur Rahman, Md. Saiful Islam, Marium-E-Jannat

Characters are the smallest unit of text that can extract stylometric signals to determine the author of a text. In this paper, we investigate the effectiveness of character-level signals in Authorship Attribution of Bangla Literature and show that the results are promising but improvable. The time and memory efficiency of the proposed model is much higher than the word level counterparts but accuracy is 2-5% less than the best performing word-level models. Comparison of various word-based models is performed and shown that the proposed model performs increasingly better with larger datasets. We also analyze the effect of pre-training character embedding of diverse Bangla character set in authorship attribution. It is seen that the performance is improved by up to 10% on pre-training. We used 2 datasets from 6 to 14 authors, balancing them before training and compare the results.

中文翻译：

使用字符级 CNN 在孟加拉文学中的作者归属

字符是文本的最小单位，可以提取文体信号来确定文本的作者。在本文中，我们研究了字符级信号在孟加拉文学作者归属中的有效性，并表明结果很有希望但可以改进。所提出模型的时间和内存效率远高于词级模型，但准确率比表现最佳的词级模型低 2-5%。对各种基于词的模型进行了比较，结果表明所提出的模型在更大的数据集上表现越来越好。我们还分析了不同孟加拉语字符集的预训练字符嵌入对作者归属的影响。可以看出，在预训练时性能提高了 10%。我们使用了 6 到 14 个作者的 2 个数据集，

更新日期：2020-11-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文