当前位置: X-MOL 学术Nat. Lang. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cluster-based mention typing for named entity disambiguation
Natural Language Engineering ( IF 2.5 ) Pub Date : 2020-08-20 , DOI: 10.1017/s1351324920000443
Arda Çelebi , Arzucan Özgür

An entity mention in text such as “Washington” may correspond to many different named entities such as the city “Washington D.C.” or the newspaper “Washington Post.” The goal of named entity disambiguation (NED) is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g., location or person) of a mentioned entity can be correctly predicted from the context, it may increase the chance of selecting the right candidate by assigning low probability to the unlikely ones. This paper proposes cluster-based mention typing for NED. The aim of mention typing is to predict the type of a given mention based on its context. Generally, manually curated type taxonomies such as Wikipedia categories are used. We introduce cluster-based mention typing, where named entities are clustered based on their contextual similarities and the cluster ids are assigned as types. The hyperlinked mentions and their context in Wikipedia are used in order to obtain these cluster-based types. Then, mention typing models are trained on these mentions, which have been labeled with their cluster-based types through distant supervision. At the NED phase, first the cluster-based types of a given mention are predicted and then, these types are used as features in a ranking model to select the best entity among the candidates. We represent entities at multiple contextual levels and obtain different clusterings (and thus typing models) based on each level. As each clustering breaks the entity space differently, mention typing based on each clustering discriminates the mention differently. When predictions from all typing models are used together, our system achieves better or comparable results based on randomization tests with respect to the state-of-the-art levels on four defacto test sets.

中文翻译:

用于命名实体消歧的基于集群的提及类型

文本中提及的实体(例如“Washington”)可能对应于许多不同的命名实体,例如城市“Washington DC”或报纸“Washington Post”。命名实体消歧(NED)的目标是在所有可能的候选者中正确识别所提到的命名实体。如果可以从上下文中正确预测所提及实体的类型(例如,位置或人),则可以通过将低概率分配给不太可能的实体来增加选择正确候选者的机会。本文提出了基于集群的 NED 提及类型。提及类型的目的是根据上下文预测给定提及的类型。通常,使用手动管理的类型分类法,例如 Wikipedia 类别。我们引入了基于集群的提及类型,其中命名实体根据它们的上下文相似性进行聚类,并将聚类 ID 分配为类型。维基百科中的超链接提及及其上下文用于获取这些基于集群的类型。然后,提及类型模型在这些提及上进行训练,这些提及已通过远程监督标记为基于集群的类型。在 NED 阶段,首先预测给定提及的基于集群的类型,然后将这些类型用作排序模型中的特征,以在候选者中选择最佳实体。我们在多个上下文级别表示实体,并根据每个级别获得不同的聚类(从而键入模型)。由于每个聚类以不同方式破坏实体空间,因此基于每个聚类的提及类型以不同方式区分提及。
更新日期:2020-08-20
down
wechat
bug