Author identification of short texts using dependency treebanks without vocabulary,Digital Scholarship in the Humanities

当前位置： X-MOL 学术 › Digit. Scholarsh. Hum.it. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Author identification of short texts using dependency treebanks without vocabulary
Digital Scholarship in the Humanities ( IF 0.7 ) Pub Date : 2019-10-24 , DOI: 10.1093/llc/fqz070
Robert Gorman ₁

Affiliation

How to classify short texts effectively remains an important question in computational stylometry. This study presents the results of an experiment involving authorship attribution of ancient Greek texts. These texts were chosen to explore the effectiveness of digital methods as a supplement to the author’s work on text classification based on traditional stylometry. Here it is crucial to avoid confounding effects of shared topic, etc. Therefore, this study attempts to identify authorship using only morpho-syntactic data without regard to specific vocabulary items. The data are taken from the dependency annotations published in the Ancient Greek and Latin Dependency Treebank. The independent variables for classification are combinations generated from the dependency label and the morphology of each word in the corpus and its dependency parent. To avoid the effects of the combinatorial explosion, only the most frequent combinations are retained as input features. The authorship classification (with thirteen classes) is done with standard algorithms—logistic regression and support vector classification. During classification, the corpus is partitioned into increasingly smaller ‘texts’. To explore and control for the possible confounding effects of, e.g. different genre and annotator, three corpora were tested: a mixed corpus of several genres of both prose and verse, a corpus of prose including oratory, history, and essay, and a corpus restricted to narrative history. Results are surprisingly good as compared to those previously published. Accuracy for fifty-word inputs is 84.2– 89.6%. Thus, this approach may prove an important addition to the prevailing methods for small text classification. .................................................................................................................................................................................

中文翻译：

作者使用不带词汇的依赖树库识别短文本

如何有效地对短文本进行分类仍然是计算风格学中的一个重要问题。这项研究提出了一项涉及古希腊文字著作权归属的实验结果。选择这些文本是为了探索数字方法作为作者基于传统笔法的文本分类工作的补充的有效性。在这里，至关重要的是避免混淆共享主题等的影响。因此，本研究试图仅使用形态-句法数据来识别作者身份，而不考虑特定的词汇项目。数据取自在古希腊和拉丁语依赖树库中发布的依赖项注释。用于分类的自变量是从依赖项标签以及语料库及其依赖项父级中每个单词的形态生成的组合。为避免组合爆炸的影响，仅保留最频繁的组合作为输入要素。作者身份分类（共13类）是使用标准算法进行的-Logistic回归和支持向量分类。在分类过程中，语料库被分成越来越小的“文本”。为了探索和控制例如不同体裁和注释符可能造成的混淆作用，测试了三种语料库：几种散文和经文的混合语料库，包括演讲，历史和散文的散文语料库，以及受限制的语料库叙述历史。与以前发布的结果相比，结果出奇地好。五十个单词输入的准确性为84.2–89.6％。从而，这种方法可能被证明是对小文本分类的流行方法的重要补充。................................................... ................................................... ................................................... ....................................

更新日期：2019-10-24

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文