Skip to main content
Log in

Automatic Recognition of Chemical Entity Mentions in Texts of Scientific Publications

  • AUTOMATED TEXT PROCESSING
  • Published:
Automatic Documentation and Mathematical Linguistics Aims and scope

Abstract—

The huge space of experimental data on biological and chemical objects and their interactions contributes to the rapid growth of scientific publications containing their analysis. Such data may include characteristics of low-molecular-weight compounds, results of their biological activity evaluation, and their interaction with human and animal proteins, methods of synthesis of organic compounds, and their classification. The past decades saw the development of methods for automated extraction of data from texts of scientific publications, including those for retrieval of names of organic compounds. These data can be used for the automatic identification of the names of organic compounds, including all possible synonyms. Since the topics of scientific publications are diverse, the extracted data can be applied to obtain information about (1) classification of organic compounds (2) methods of synthesis of a given organic compound; (3) physicochemical properties of this compound; (4) its interaction with high-molecular-weight compounds (including proteins, mRNA of animals and humans); and (5) the therapeutic properties of organic compounds, the active substance of the drug, and data on clinical trials. This review considers the methods aimed at searching and extracting data on names of low-molecular-weight compounds and interactions between them and animal and human proteins (biological objects), as well as data on experimentally confirmed biological activity and the effects of organic compounds (including drugs) on pathological processes. Here, we discuss the methods developed and the results of their application published over the past 10 years.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.

Similar content being viewed by others

REFERENCES

  1. Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J., and Valencia, A., Information retrieval and text mining technologies for chemistry, Chem. Rev., 2017, vol. 117, no. 12, pp. 7673–7761.

    Article  Google Scholar 

  2. Przybyła, P., Shardlow, M., Aubin, S., Bossy, R., Eckart de Castilho, R., Piperidis, S., Mcnaught, J., and Ananiadou, S., Text mining resources for the life sciences, Database, 2016, vol. 2016, pp. 1–30.

    Article  Google Scholar 

  3. Oellrich, A., Gkoutos, G.V., Hoehndorf, R., and Rebholz-Schuhmann, D., Quantitative comparison of mapping methods between human and mammalian phenotype ontology, J. Biomed. Semantics, 2012, vol. 3, no. s2/s1, pp. 1–10.

  4. O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., and Ananiadou, S., Using text mining for study identification in systematic reviews: A systematic review of current approaches, Syst. Rev., 2015, vol. 4, no. 5, pp. 1–22.

    Article  Google Scholar 

  5. Smink, W.A.C., Fox, J.-P., Tjong Kim Sang, E., Sools, A.M., Westerhof, G.J., and Veldkamp, B.P., Understanding therapeutic change process research through multilevel modeling and text mining, Front. Psychol., 2019, vol. 10, p. 1186.

    Article  Google Scholar 

  6. PubMed. https://pubmed.ncbi.nlm.nih.gov/.

  7. Krallinger, M., Rabal, O., Leitner, F., Vazquez, M., Salgado, D., Lu, Zh., Leaman, R., Lu, Y., Ji, D., Lowe, D.M., Sayle, R.A., Batista-Navarro, R.Th., Rak, R., Huber, T., Rocktäschel, T., et al., The CHEMDNER corpus of chemicals and drugs and its annotation principles, J. Cheminf., 2015, vol. 7, artic. no. S2.

  8. khondi, S.A., Hettne, K.M., van der Horst, E., van Mulligen, E.M., and Kors, J.A., Recognition of chemical entities: Combining dictionary-based and grammar-based approaches, J. Cheminf., 2015, vol. 7, artic. no. S10

  9. NCBI. https://www.ncbi.nlm.nih.gov/mesh/.

  10. Li, J., Sun, Y., Johnson, R.J., Sciaky, D., Wei, C.-H., Leaman, R., Davis, A.P., Mattingly, C.J., Wiegers, T.C., and Lu, Z., BioCreative V CDR task corpus: A resource for chemical disease relation extraction, Database, 2016, vol. 2016, artic. no. baw086.

    Article  Google Scholar 

  11. Wei, C.-H., Peng, Y., Leaman, R., Davis, A.P., Mattingly, C.J., Li, J., Wiegers, T.C., and Lu, Z., Assessing the state of the art in biomedical relation extraction: Overview of the BioCreative V chemical-disease relation (CDR) task, Database, 2016, vol. 2016, artic. no. baw032.

    Article  Google Scholar 

  12. Madan, S., Szostak, J., Komandur Elayavilli, R., Tsai, R.T.-H., Ali, M., Qian, L., Rastegar-Mojarad, M., Hoeng, J., and Fluck, J., The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2019) BEL track, Database, 2019, vol. 2019, artic. no. baz084.

    Article  Google Scholar 

  13. Martínez, V., Navarro, C., Cano, C., Fajardo, W., and Blanco, A., DrugNet: Network-based drug–disease prioritization by integrating heterogeneous data, Artif. Intell. Med., 2015, vol. 63, no. 1, pp. 41–49.

    Article  Google Scholar 

  14. Herrero-Zazo, M., Segura-Bedmar, I., Martinez, P., and Declerck, T., The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions, J. Biomed. Inf., 2013, vol. 46, no. 5, pp. 914–920.

    Article  Google Scholar 

  15. Pérez-Pérez, M., Rabal, O., Pérez-Rodríguez, G., Vazquez, M., Fdez-Riverola, F., Oyarzabal, J., Valencia, A., Lourenço, A., and Krallinger, M., Evaluation of chemical and gene/protein entity recognition systems at BioCreative V.5: The CEMP and GPRO patents tracks, Proceedings of the BioCreative. Vers.5. Challenge Evaluation Workshop, 2017, pp. 11–18. https://b-iocreative.bioinformatics.udel.edu/media/store/files/2017/BioCreative_V5_paper2.pdf.

  16. Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D., Baumgartner, Jr., W.A., Cohen, B., Verspoor, K., Blake, J.A., and Hunter, L.E., Concept annotation in the CRAFT corpus, BMC Bioinf., 2012, vol. 13, no. 161, pp. 1–10.

    Article  Google Scholar 

  17. Kolarik, C., Klinger, R., Friedrich, C.M., Hofmann-Apitius, M., and Fluck, J., Chemical names: Terminological resources and corpora annotation, Workshop on Building and Evaluating Resources for Biomedical Text Mining (6th Edition of the Language Resources and Evaluation Conference), Marrakech, Morocco, 2008, pp. 51–58. https://pub.uni-bielefeld.de/record/2603498.

  18. Cañada, A., Capella-Gutierrez, S., Rabal, O., Oyarzabal, J., Valencia, A., and Krallinger, M., LimTox: A web tool for applied text mining of adverse event and toxicity associations of compounds, drugs and genes, Nucleic Acids Res., 2017, vol. 45, no. W1, pp. W484–W489.

    Article  Google Scholar 

  19. Swain, M.C. and Cole, J.M., ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature, J. Chem. Inf. Model., 2016, vol. 56, no. 10, pp. 1894–1904.

    Article  Google Scholar 

  20. Batista-Navarro, R., Rak, R., and Ananiadou, S., Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics, J. Cheminf., 2015, vol. 7, artic. no. S6.

  21. Leaman, R., Khare, R., and Lu, Z., Challenges in clinical natural language processing for automated disorder normalization, J. Biomed. Inf., 2015, vol. 57, pp. 28–37.

    Article  Google Scholar 

  22. Rocktäschel, T., Weidlich, M., and Leser, U., ChemSpot: A hybrid system for chemical named entity recognition, Bioinformatics, 2012, vol. 28, no. 12, pp. 1633–1640.

    Article  Google Scholar 

  23. Campos, D., Bui, Q.-C., Matos, S., and Oliveira, J.L., TrigNER: Automatically optimized biomedical event trigger recognition on scientific documents, Source Code Biol. Med., 2014, vol. 9, no. 1, p. 1.

    Article  Google Scholar 

  24. Lu, Z. and Hirschman, L., Biocuration workflows and text mining: Overview of the BioCreative 2012 Workshop Track II, Database, 2012, vol. 2012, artic. no. bas043.

    Google Scholar 

  25. Liu, H., Christiansen, T., Baumgartner, W.A., and Verspoor, K., BioLemmatizer: A lemmatization tool for morphological processing of biomedical text, J. Biomed. Semantics, 2012, vol. 3, no. 3, pp. 1–29.

    Article  Google Scholar 

  26. Song, H.-J., Jo, B.-C., Park, C.-Y., Kim, J.-D., and Kim, Y.-S., Comparison of named entity recognition methodologies in biomedical documents, BioMed. Eng. OnLine, 2018, vol. 17, suppl. 2, pp. 158–192.

    Article  Google Scholar 

  27. Halberstam, N.M., Baskin, I.I., Palyulin, V.A., and Zefirov, N.S., Neural networks as a method for elucidating structure-property relationships for organic compounds, Russ. Chem. Rev., 2003, vol. 72, no. 7, pp. 629–649.

    Article  Google Scholar 

  28. Baskin, I.I., Madzhidov, T.I., Antipin, I.S., and Varnek, A.A., Artificial intelligence in synthetic chemistry: Achievements and prospects, Russ. Chem. Rev., 2017, vol. 86, no. 11, pp. 1127–1156.

    Article  Google Scholar 

  29. Cho, H. and Lee, H., Biomedical named entity recognition using deep neural networks with contextual information, BMC Bioinf., 2019, vol. 20, no. 1, pp. 735–746.

    Article  Google Scholar 

  30. Maheswaranathan, N., Williams, A.H., Golub, M.D., Ganguli, S., and Sussillo, D., Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics, Adv. Neural Inf. Process. Syst., 2019, vol. 32, pp. 15696–15705.

    Google Scholar 

  31. Li, Z., Gurgel, H., Dessay, N., Hu, L., Xu, L., and Gong, P., Semi-supervised text classification framework: An overview of dengue landscape factors and satellite earth observation, Int. J. Environ. Res. Public Health, 2020, vol. 17, no. 12, pp. 4509–4538.

    Article  Google Scholar 

  32. Kaewphan, S., Hakala, K., Miekka, N., Salakoski, T., and Ginter, F., Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling, Database, 2018, vol. 2018, artic. no. bay096.

    Article  Google Scholar 

  33. Campos, D., Matos, S., and Oliveira, J.L., A document processing pipeline for annotating chemical entities in scientific documents, J. Cheminf., 2015, vol. 7, artic. no. S7.

  34. Korvigo, I., Holmatov, M., Zaikovskii, A., and Skoblov, M., Putting hands to rest: Efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules, J. Cheminf., 2018, vol. 1, p. 28.

    Article  Google Scholar 

  35. Luo, L., Yang, Z., Yang, P., Zhang, Y., Wang, L., Lin, H., and Wang, J., An attention-based BILSTM-CRF approach to document-level chemical named entity recognition, Bioinformatics, 2018, vol. 34, no. 8, pp. 1381–1388.

    Article  Google Scholar 

  36. Hemati, W. and Mehler, A., LSTMVoter: Chemical named entity recognition using a conglomerate of sequence labeling tools, J. Cheminf., 2019, vol. 11, no. 3, pp. 1–7.

    Article  Google Scholar 

  37. Lung, P.-Y., He, Z., Zhao, T., Yu, D., and Zhang, J., Extracting chemical–protein interactions from literature using sentence structure analysis and feature engineering, Database, 2019, vol. 2019, artic. no. bay138.

    Article  Google Scholar 

  38. Capuzzi, S.J., Thornton, T.E., Liu, K., Baker, N., Lam, W.I., O’Banion, C.P., Muratov, E.N., Pozefsky, D., and Tropsha, A., Chemotext: A publicly available web server for mining drug–target–disease relationships in PubMed, J. Chem. Inf. Model., 2018, vol. 58, no. 2, pp. 212–218.

    Article  Google Scholar 

  39. Mao, Y. and Lu, Z., MeSH Now: Automatic MeSH indexing at PubMed scale via learning to rank, J. Biomed. Semantics, 2017, vol. 8, no. 1, pp. 15–24.

    Article  Google Scholar 

  40. Ponomarenko, E.A., Lisitsa, A.V., Il’gisonis, E.V., and Archakov, A.I., Construction of protein semantic networks using PubMed/MEDLINE, Mol. Biol., 2010, vol. 44, pp. 140–149.

    Article  Google Scholar 

  41. Vempati, U.D. and Schurer, S.C., Development and applications of the bioassay ontology (BAO) to describe and categorize high-throughput assays, in Assay Guidance Manual, Markossian, S., Sittampalam, G.S., Grossman, A., et al., Bethesda: Eli Lilly & Company and the National Center for Advancing Translational Sciences, 2004, pp. 1045–1069.

  42. Hastings, J., Chepelev, L., Willighagen, E., Adams, N., Steinbeck, Ch., and Dumontier, M., The chemical information ontology: Provenance and disambiguation for chemical data on the biological semantic web, PLoS ONE, 2011, vol. 6, no. 10, p. e25513.

    Article  Google Scholar 

Download references

Funding

The study was financed by the Russian Science Foundation (project no. 19-15-00396).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to N. Yu. Biziukova, O. A. Tarasova, A. V. Rudik, D. A. Filimonov or V. V. Poroikov.

Ethics declarations

The authors declare that they have no conflicts of interest.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biziukova, N.Y., Tarasova, O.A., Rudik, A.V. et al. Automatic Recognition of Chemical Entity Mentions in Texts of Scientific Publications. Autom. Doc. Math. Linguist. 54, 306–315 (2020). https://doi.org/10.3103/S0005105520060023

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0005105520060023

Keywords:

Navigation