当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
Journal of Cheminformatics ( IF 8.6 ) Pub Date : 2022-08-13 , DOI: 10.1186/s13321-022-00633-4
O A Tarasova 1 , A V Rudik 1 , N Yu Biziukova 1 , D A Filimonov 1 , V V Poroikov 1
Affiliation  

Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry.

中文翻译:

使用朴素贝叶斯分类器方法在科学出版物文本中识别化学命名实体

化学命名实体识别 (CNER) 算法的应用允许从文本中检索有关化合物标识符的信息,并创建与物理化学性质和生物活动的关联。科学文本代表了低形式化的信息来源。针对 CNER 的大多数方法都基于机器学习方法,包括条件随机场和深度神经网络。一般来说,大多数机器学习方法都需要文本的向量或稀疏词表示。化学命名实体 (CNE) 仅占整个文本的一小部分,用于训练的数据集高度不平衡。我们提出了一种从文本中提取 CNE 的新方法,该方法基于朴素贝叶斯分类器和专门开发的过滤器。与早期开发的 CNER 方法相比,我们的方法将数据表示为一组文本片段 (FoT),随后准备了一组多 n-gram(从一个到 n 个符号的序列)对于每个 FoT。我们的方法可以识别新的 CNE。对于 CHEMDNER 语料库,基于五重交叉验证的灵敏度(召回)值为 0.95,精度为 0.74,特异性为 0.88,平衡精度为 0.92。我们将开发的算法应用于提取的潜在严重急性呼吸综合征冠状病毒 2 (SARS-CoV-2) 主要蛋白酶 (Mpro) 抑制剂的 CNE。检索了一组与用于发现 Mpro 抑制剂的生化测定中评估的化学物质相对应的 CNE。对适当文本的手动分析表明,我们的方法成功地鉴定了潜在 SARS-CoV-2 Mpro 抑制剂的 CNE。所得结果表明,该方法可用于过滤掉与CNEs无关的词;因此,它可以成功地应用于化学信息学和药物化学的CNEs的提取。
更新日期:2022-08-13
down
wechat
bug