当前位置: X-MOL 学术Sci. China Inf. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hierarchical LSTM with char-subword-word tree-structure representation for Chinese named entity recognition
Science China Information Sciences ( IF 8.8 ) Pub Date : 2020-09-16 , DOI: 10.1007/s11432-020-2982-y
Chen Gong , Zhenghua Li , Qingrong Xia , Wenliang Chen , Min Zhang

Chinese named entity recognition (CNER) aims to identify entity names such as person names and organization names from Chinese raw text and thus can quickly extract the entity information that people are concerned about from large-scale texts. Recent studies attempt to improve performance by integrating lexicon words into char-based CNER models. These existing studies, however, usually focus on leveraging the context-free words in lexicon without considering the contextual information of words and subwords in the sentences. To address this issue, in addition to utilizing the lexicon words, we further propose to construct a hierarchical tree structure representation composed of characters, subwords and context-aware predicted words from segmentor to represent each sentence for CNER. Based on the tree-structure representation, we propose a hierarchical long short-term memory (HiLSTM) framework, which consists of hierarchical encoding layer, fusion layer and CRF layer, to capture linguistic knowledge at different levels. On the one hand, the interactions within each level help to obtain the contextual information. On the other hand, the propagations from the lower-levels to the upper-levels can provide additional semantic knowledge for CNER. Experimental results on three widely used CNER datasets show that our proposed HiLSTM model achieves significant improvement over several strong benchmark methods.



中文翻译:

带有字符子词词树结构表示的分层LSTM用于中文命名实体识别

中文命名实体识别(CNER)旨在从中文原始文本中识别诸如人名和组织名称之类的实体名称,从而可以从大规模文本中快速提取人们关注的实体信息。最近的研究试图通过将词汇词集成到基于字符的CNER模型中来提高性能。但是,这些现有的研究通常集中在利用词典中的上下文无关单词而不考虑句子中单词和子单词的上下文信息。为了解决这个问题,除了利用词典词外,我们还建议构造一个分层的树形结构表示,该结构由字符,子词和来自分割器的上下文感知的预测词组成,以表示CNER的每个句子。根据树形结构表示,我们提出了一个分层的长期短期记忆(HiLSTM)框架,该框架由分层的编码层,融合层和CRF层组成,以捕获不同级别的语言知识。一方面,每个级别内的交互有助于获得上下文信息。另一方面,从较低级别到较高级别的传播可以为CNER提供附加的语义知识。在三个广泛使用的CNER数据集上的实验结果表明,我们提出的HiLSTM模型相对于几种强大的基准测试方法取得了显着改进。另一方面,从较低级别到较高级别的传播可以为CNER提供附加的语义知识。在三个广泛使用的CNER数据集上的实验结果表明,我们提出的HiLSTM模型相对于几种强大的基准测试方法取得了显着改进。另一方面,从较低级别到较高级别的传播可以为CNER提供附加的语义知识。在三个广泛使用的CNER数据集上的实验结果表明,我们提出的HiLSTM模型相对于几种强大的基准测试方法取得了显着改进。

更新日期:2020-09-25
down
wechat
bug