当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Match-Ignition: Plugging PageRank into Transformer for Long-form Text Matching
arXiv - CS - Information Retrieval Pub Date : 2021-01-16 , DOI: arxiv-2101.06423
Liang Pang, Yanyan Lan, Xueqi Cheng

Semantic text matching models have been widely used in community question answering, information retrieval, and dialogue. However, these models cannot well address the long-form text matching problem. That is because there are usually many noises in the setting of long-form text matching, and it is difficult for existing semantic text matching to capture the key matching signals from this noisy information. Besides, these models are computationally expensive because they simply use all textual data indiscriminately in the matching process. To tackle the effectiveness and efficiency problem, we propose a novel hierarchical noise filtering model in this paper, namely Match-Ignition. The basic idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information in the matching process. Noisy sentences are usually easy to detect because the sentence is the basic unit of a long-form text, so we directly use PageRank to filter such information, based on a sentence similarity graph. While words need to rely on their contexts to express concrete meanings, so we propose to jointly learn the filtering process and the matching process, to reflect the contextual dependencies between words. Specifically, a word graph is first built based on the attention scores in each self-attention block of Transformer, and keywords are then selected by applying PageRank on this graph. In this way, noisy words will be filtered out layer by layer in the matching process. Experimental results show that Match-Ignition outperforms both traditional text matching models for short text and recent long-form text matching models. We also conduct detailed analysis to show that Match-Ignition can efficiently capture important sentences or words, which are helpful for long-form text matching.

中文翻译:

Match-Ignition:将PageRank插入Transformer中以进行长格式文本匹配

语义文本匹配模型已广泛用于社区问题解答,信息检索和对话。但是,这些模型不能很好地解决长格式文本匹配问题。这是因为在长格式文本匹配的设置中通常会有很多噪音,并且现有的语义文本匹配很难从该嘈杂的信息中捕获关键匹配信号。此外,这些模型在计算上很昂贵,因为它们仅在匹配过程中随意使用所有文本数据。为了解决有效性和效率问题,我们提出了一种新颖的分层噪声过滤模型,即匹配点火。基本思想是将著名的PageRank算法插入Transformer,在匹配过程中识别并过滤句子和单词级别的噪音信息。嘈杂的句子通常很容易检测,因为句子是长格式文本的基本单位,因此我们基于句子相似度图直接使用PageRank过滤此类信息。尽管单词需要依靠其上下文来表达具体含义,所以我们建议共同学习过滤过程和匹配过程,以反映单词之间的上下文相关性。具体来说,首先根据Transformer的每个自我关注块中的关注分数构建单词图,然后通过在该图上应用PageRank来选择关键字。这样,在匹配过程中,嘈杂的单词会被逐层过滤掉。实验结果表明,Match-Ignition优于传统的短文本匹配模型和最近的长文本匹配模型。我们还将进行详细分析,以显示Match-Ignition可以有效地捕获重要的句子或单词,这对于长格式的文本匹配很有帮助。
更新日期:2021-01-19
down
wechat
bug