当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Statistical principle-based approach for gene and protein related object recognition.
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2018-12-17 , DOI: 10.1186/s13321-018-0314-7
Po-Ting Lai , Ming-Siang Huang , Ting-Hao Yang , Wen-Lian Hsu , Richard Tzong-Han Tsai

The large number of chemical and pharmaceutical patents has attracted researchers doing biomedical text mining to extract valuable information such as chemicals, genes and gene products. To facilitate gene and gene product annotations in patents, BioCreative V.5 organized a gene- and protein-related object (GPRO) recognition task, in which participants were assigned to identify GPRO mentions and determine whether they could be linked to their unique biological database records. In this paper, we describe the system constructed for this task. Our system is based on two different NER approaches: the statistical-principle-based approach (SPBA) and conditional random fields (CRF). Therefore, we call our system SPBA-CRF. SPBA is an interpretable machine-learning framework for gene mention recognition. The predictions of SPBA are used as features for our CRF-based GPRO recognizer. The recognizer was developed for identifying chemical mentions in patents, and we adapted it for GPRO recognition. In the BioCreative V.5 GPRO recognition task, SPBA-CRF obtained an F-score of 73.73% on the evaluation metric of GPRO type 1 and an F-score of 78.66% on the evaluation metric of combining GPRO types 1 and 2. Our results show that SPBA trained on an external NER dataset can perform reasonably well on the partial match evaluation metric. Furthermore, SPBA can significantly improve performance of the CRF-based recognizer trained on the GPRO dataset.



大量的化学和制药专利吸引了研究人员进行生物医学文本挖掘,以提取有价值的信息,例如化学物质,基因和基因产物。为了促进专利中基因和基因产品的注释,BioCreative V.5组织了一个基因和蛋白质相关对象(GPRO)识别任务,其中分配了参与者以标识GPRO提及的内容,并确定是否可以将其链接到其独特的生物学数据库记录。在本文中,我们描述了为此任务构建的系统。我们的系统基于两种不同的NER方法:基于统计原理的方法(SPBA)和条件随机字段(CRF)。因此,我们将系统称为SPBA-CRF。SPBA是用于基因提及识别的可解释的机器学习框架。SPBA的预测用作基于CRF的GPRO识别器的功能。该识别器是专为识别专利中的化学提及而开发的,我们对其进行了修改,使其可用于GPRO识别。在BioCreative V.5 GPRO识别任务中,SPBA-CRF在1类GPRO的评估指标上获得了73.73%的F评分,在将1类和2类GPRO组合在一起的评估上获得了78.66%的F评分。结果表明,在外部NER数据集上训练的SPBA在部分匹配评估指标上可以表现良好。此外,SPBA可以显着提高在GPRO数据集上训练的基于CRF的识别器的性能。SPBA-CRF在GPRO类型1的评估指标上获得73.73%的F评分,在将GPRO类型1和2组合在一起的评估指标上获得78.66%的F评分。我们的结果表明,SPBA在外部NER数据集上训练可以在部分匹配评估指标上表现出色。此外,SPBA可以显着提高在GPRO数据集上训练的基于CRF的识别器的性能。SPBA-CRF在GPRO类型1的评估指标上获得73.73%的F评分,在将GPRO类型1和2组合在一起的评估指标上获得78.66%的F评分。我们的结果表明,SPBA在外部NER数据集上训练可以在部分匹配评估指标上表现出色。此外,SPBA可以显着提高在GPRO数据集上训练的基于CRF的识别器的性能。