Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies
arXiv - CS - Information Retrieval Pub Date : 2020-12-15 , DOI: arxiv-2101.03026
Carlos Badenes-Olmedo, Jose-Luis Redondo García, Oscar Corcho

With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning models that can be used to perform thematic explorations on collections of texts in multiple languages. However, these approaches require theme-aligned training data to create a language-independent space. This constraint limits the amount of scenarios that this technique can offer solutions to train and makes it difficult to scale up to situations where a huge collection of multi-lingual documents are required during the training phase. This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora, or any other type of translation resource. The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels and describes documents by hierarchies of multi-lingual concepts from independently-trained models. Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.

中文翻译：

通过特定于语言的概念层次结构可扩展的跨语言文档相似性

随着使用多种语言的数字文章数量的不断增长以及不同语言的广泛使用，我们需要能够浏览多语言语料库的注释方法。多语言概率主题模型最近作为一组半监督的机器学习模型出现，可用于对多种语言的文本集合进行主题探索。但是，这些方法需要主题对齐的训练数据才能创建独立于语言的空间。这种限制限制了该技术可以提供解决方案进行培训的方案的数量，并且使其难以扩展到在培训阶段需要大量多语言文档的情况。本文提出了一种无需监督的文档相似度算法，该算法不需要并行或可比的语料库或任何其他类型的翻译资源。该算法使用跨语言标签注释从单一语言的文档中自动创建的主题，并通过来自独立训练模型的多语言概念层次结构描述文档。对英语，西班牙语和法语版本的JCR-Acquis语料库进行的实验表明，按相似的内容对文档进行分类和排序的结果令人鼓舞。

更新日期：2021-01-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文