当前位置:
X-MOL 学术
›
arXiv.cs.CL
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Authorship Attribution in Bangla literature using Character-level CNN
arXiv - CS - Computation and Language Pub Date : 2020-01-11 , DOI: arxiv-2001.05316 Aisha Khatun, Anisur Rahman, Md. Saiful Islam, Marium-E-Jannat
arXiv - CS - Computation and Language Pub Date : 2020-01-11 , DOI: arxiv-2001.05316 Aisha Khatun, Anisur Rahman, Md. Saiful Islam, Marium-E-Jannat
Characters are the smallest unit of text that can extract stylometric signals
to determine the author of a text. In this paper, we investigate the
effectiveness of character-level signals in Authorship Attribution of Bangla
Literature and show that the results are promising but improvable. The time and
memory efficiency of the proposed model is much higher than the word level
counterparts but accuracy is 2-5% less than the best performing word-level
models. Comparison of various word-based models is performed and shown that the
proposed model performs increasingly better with larger datasets. We also
analyze the effect of pre-training character embedding of diverse Bangla
character set in authorship attribution. It is seen that the performance is
improved by up to 10% on pre-training. We used 2 datasets from 6 to 14 authors,
balancing them before training and compare the results.
中文翻译:
使用字符级 CNN 在孟加拉文学中的作者归属
字符是文本的最小单位,可以提取文体信号来确定文本的作者。在本文中,我们研究了字符级信号在孟加拉文学作者归属中的有效性,并表明结果很有希望但可以改进。所提出模型的时间和内存效率远高于词级模型,但准确率比表现最佳的词级模型低 2-5%。对各种基于词的模型进行了比较,结果表明所提出的模型在更大的数据集上表现越来越好。我们还分析了不同孟加拉语字符集的预训练字符嵌入对作者归属的影响。可以看出,在预训练时性能提高了 10%。我们使用了 6 到 14 个作者的 2 个数据集,
更新日期:2020-11-06
中文翻译:
使用字符级 CNN 在孟加拉文学中的作者归属
字符是文本的最小单位,可以提取文体信号来确定文本的作者。在本文中,我们研究了字符级信号在孟加拉文学作者归属中的有效性,并表明结果很有希望但可以改进。所提出模型的时间和内存效率远高于词级模型,但准确率比表现最佳的词级模型低 2-5%。对各种基于词的模型进行了比较,结果表明所提出的模型在更大的数据集上表现越来越好。我们还分析了不同孟加拉语字符集的预训练字符嵌入对作者归属的影响。可以看出,在预训练时性能提高了 10%。我们使用了 6 到 14 个作者的 2 个数据集,