当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DT-grams: Structured Dependency Grammar Stylometry for Cross-Language Authorship Attribution
arXiv - CS - Computation and Language Pub Date : 2021-06-10 , DOI: arxiv-2106.05677
Benjamin Murauer, Günther Specht

Cross-language authorship attribution problems rely on either translation to enable the use of single-language features, or language-independent feature extraction methods. Until recently, the lack of datasets for this problem hindered the development of the latter, and single-language solutions were performed on machine-translated corpora. In this paper, we present a novel language-independent feature for authorship analysis based on dependency graphs and universal part of speech tags, called DT-grams (dependency tree grams), which are constructed by selecting specific sub-parts of the dependency graph of sentences. We evaluate DT-grams by performing cross-language authorship attribution on untranslated datasets of bilingual authors, showing that, on average, they achieve a macro-averaged F1 score of 0.081 higher than previous methods across five different language pairs. Additionally, by providing results for a diverse set of features for comparison, we provide a baseline on the previously undocumented task of untranslated cross-language authorship attribution.

中文翻译:

DT-grams:跨语言作者归属的结构化依赖语法文体法

跨语言作者归属问题依赖于翻译以启用单语言特征,或独立于语言的特征提取方法。直到最近,这个问题的数据集的缺乏阻碍了后者的发展,并且在机器翻译的语料库上执行了单语言解决方案。在本文中,我们提出了一种新的语言独立特征,用于基于依赖图和通用词性标签的作者身份分析,称为 DT-gram(依赖树语法),它是通过选择依赖图的特定子部分构建的句子。我们通过对双语作者的未翻译数据集执行跨语言作者归属来评估 DT-gram,表明平均而言,他们的宏观平均 F1 分数为 0。在五个不同的语言对中比以前的方法高 081。此外,通过提供用于比较的不同特征集的结果,我们为以前未记录的未翻译跨语言作者归属任务提供了基线。
更新日期:2021-06-11
down
wechat
bug