Deep truth discovery for pattern-based fact extraction,Information Sciences

当前位置： X-MOL 学术 › Inform. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep truth discovery for pattern-based fact extraction
Information Sciences ( IF 8.1 ) Pub Date : 2021-08-24 , DOI: 10.1016/j.ins.2021.08.084
Chen Ye _{1,

2} , Hongzhi Wang ₂ , Wenbo Lu ₂ , Jing Gao ₃ , Guojun Dai ₁

Affiliation

Fact extraction, which aims to extract (entity, attribute, value)-tuples from massive text corpora, is crucial in the area of text data mining. Recent approaches have focused on extracting facts by mining textual patterns with semantic types, where the quality of a pattern is evaluated based on content-based criteria, such as frequency. However, these approaches overlook the dimension of pattern reliability, which reflects how likely the extracted facts are correct. As a result, a pattern of good content-quality (e.g., high frequency) may still extract incorrect facts. In this study, we consider both pattern reliability and fact trustworthiness in addressing the pattern-based fact extraction problem. To learn the complex relationship between pattern reliability and fact trustworthiness, we propose a novel deep learning model using a hybrid of the CNN and LSTM architecture. For fact embedding, we adopt CNN to extract a fix-sized representation of each component, i.e., entity, attribute, and value, of the fact. For pattern embedding, we represent the pattern as a semantic composition of its extracted fact representations. To de-emphasis the noisy facts, we consider the fact trustworthiness and frequency during the process of pattern embedding, where the features of the tuple trustworthiness information are extracted by a long short-term memory (LSTM) model. To learn the pattern-fact relational dependency, we train the model with both pattern and tuple labels. Extensive experiments involving three real-world datasets demonstrated that the proposed model significantly improves the quality of the patterns and the extracted facts in the pattern-based information extraction.

中文翻译：

基于模式的事实提取的深度真相发现

事实提取旨在从海量文本语料库中提取（实体、属性、值）元组，在文本数据挖掘领域至关重要。最近的方法侧重于通过挖掘具有语义类型的文本模式来提取事实，其中基于基于内容的标准（例如频率）评估模式的质量。然而，这些方法忽略了模式可靠性的维度，这反映了提取的事实正确的可能性。结果，良好内容质量（例如，高频率）的模式仍可能提取不正确的事实。在这项研究中，我们在解决基于模式的事实提取问题时考虑了模式可靠性和事实可信度。为了了解模式可靠性和事实可信度之间的复杂关系，我们提出了一种使用CNN混合的新型深度学习模型和 LSTM 架构。对于事实嵌入，我们采用 CNN 来提取每个组件的固定大小表示，即事实的实体、属性和值。对于模式嵌入，我们将模式表示为其提取的事实表示的语义组合。为了不强调嘈杂的事实，我们在模式嵌入过程中考虑了事实可信度和频率，其中元组可信度信息的特征是通过长短期记忆（LSTM）模型提取的。为了学习模式-事实关系依赖，我们用模式和元组标签训练模型。涉及三个真实世界数据集的大量实验表明，所提出的模型显着提高了基于模式的信息提取中模式和提取事实的质量。

更新日期：2021-09-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>