当前位置: X-MOL 学术J. Inf. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Parallel sentence extraction to improve cross-language information retrieval from Wikipedia
Journal of Information Science ( IF 1.8 ) Pub Date : 2021-02-10 , DOI: 10.1177/0165551521992754
Juryong Cheon 1 , Youngjoong Ko 2
Affiliation  

Translation language resources, such as bilingual word lists and parallel corpora, are important factors affecting the effectiveness of cross-language information retrieval (CLIR) systems. In particular, when large domain-appropriate parallel corpora are not available, developing an effective CLIR system is particularly difficult. Furthermore, creating a large parallel corpus is costly and requires considerable effort. Therefore, we here demonstrate the construction of parallel corpora from Wikipedia as well as improved query translation, wherein the queries are used for a CLIR system. To do so, we first constructed a bilingual dictionary, termed WikiDic. Then, we evaluated individual language resources and combinations of them in terms of their ability to extract parallel sentences; the combinations of our proposed WikiDic with the translation probability from the Web’s bilingual example sentence pairs and WikiDic was found to be best suited to parallel sentence extraction. Finally, to evaluate the parallel corpus generated from this best combination of language resources, we compared its performance in query translation for CLIR to that of a manually created English–Korean parallel corpus. As a result, the corpus generated by our proposed method achieved a better performance than did the manually created corpus, thus demonstrating the effectiveness of the proposed method for automatic parallel corpus extraction. Not only can the method demonstrated herein be used to inform the construction of other parallel corpora from language resources that are readily available, but also, the parallel sentence extraction method will naturally improve as Wikipedia continues to be used and its content develops.



中文翻译:

平行句子提取可改善Wikipedia的跨语言信息检索

翻译语言资源(例如双语单词列表和并行语料库)是影响跨语言信息检索(CLIR)系统有效性的重要因素。特别是,当没有适用于大型域的并行语料库时,开发有效的CLIR系统特别困难。此外,创建大型并行语料库是昂贵的并且需要大量的努力。因此,我们在这里展示了维基百科上并行语料库的构建以及改进的查询翻译,其中查询用于CLIR系统。为此,我们首先构建了双语词典,称为WikiDic。然后,我们根据提取平行句子的能力评估了各个语言资源及其组合。我们建议的WikiDic与来自Web双语例句对和WikiDic的翻译概率的组合被发现最适合并行句子提取。最后,为了评估从这种最佳语言资源组合中生成的并行语料库,我们将其在CLIR查询翻译中的性能与手动创建的英语-韩语并行语料库的性能进行了比较。结果,我们提出的方法生成的语料库比手动创建的语料库具有更好的性能,从而证明了该方法对于自动并行语料库提取的有效性。本文演示的方法不仅可以用于从易于获得的语言资源中告知其他并行语料库的构建,而且随着维基百科的继续使用和其内容的发展,并行句子提取方法自然会得到改善。

更新日期:2021-02-11
down
wechat
bug