当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Crosslingual Topic Modeling with WikiPDA
arXiv - CS - Computation and Language Pub Date : 2020-09-23 , DOI: arxiv-2009.11207
Tiziano Piccardi, Robert West

We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification. Finally, we highlight WikiPDA's capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning.

中文翻译:

使用 WikiPDA 进行跨语言主题建模

我们提出了基于 Wikipedia 的 Polyglot Dirichlet Allocation (WikiPDA),这是一种跨语言主题模型,可以学习将用任何语言编写的 Wikipedia 文章表示为一组通用的独立于语言的主题的分布。它利用维基百科文章相互链接并映射到维基数据知识库中的概念这一事实,这样,当表示为链接包时,文章本质上是独立于语言的。WikiPDA 分两步工作,首先使用矩阵完成对链接袋进行增密,然后训练标准的单语主题模型。人工评估表明 WikiPDA 比基于单语文本的 LDA 产生更多连贯的主题,从而免费提供跨语言。我们在两个应用程序中展示了 WikiPDA 的实用性:对 28 个维基百科版本中的主题偏见的研究,和跨语言监督分类。最后,我们强调了 WikiPDA 的零样本语言迁移能力,其中模型无需任何微调即可重用于新语言。
更新日期:2020-09-24
down
wechat
bug