Skip to main content
Log in

An integrated pipeline model for biomedical entity alignment

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Biomedical entity alignment, composed of two sub-tasks: entity identification and entity-concept mapping, is of great research value in biomedical text mining while these techniques are widely used for name entity standardization, information retrieval, knowledge acquisition and ontology construction.

Previous works made many efforts on feature engineering to employ feature-based models for entity identification and alignment. However, the models depended on subjective feature selection may suffer error propagation and are not able to utilize the hidden information. With rapid development in health-related research, researchers need an effective method to explore the large amount of available biomedical literatures.

Therefore, we propose a two-stage entity alignment process, biomedical entity exploring model, to identify biomedical entities and align them to the knowledge base interactively. The model aims to automatically obtain semantic information for extracting biomedical entities and mining semantic relations through the standard biomedical knowledge base. The experiments show that the proposed method achieves better performance on entity alignment. The proposed model dramatically improves the F1 scores of the task by about 4.5% in entity identification and 2.5% in entity-concept mapping.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Amith M, Zhang Y, Xu H, Tao C. Knowledge-based approach for named entity recognition in biomedical literature: a use case in biomedical software identification, In: Benferhat S, Tabia K, Ali M, eds. Advances in Artificial Intelligence: From Theory to Practice. Springer, Cham, 2017

    Google Scholar 

  2. Dang T H, Le H Q, Trang M N. D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information. Bioinformatics, 2018, 34(20): 3539–3546

    Article  Google Scholar 

  3. Dieter G, Ivan L, Kirill A V. Exploiting and assessing multi-source data for supervised biomedical named entity recognition. Bioinformatics, 2018, 34(14): 2474–2482

    Article  Google Scholar 

  4. Lossio-Ventura J A, Bian J, Jonguet C, Roche M, Teisseire M. A novel framework for biomedical entity sense induction. Journal of Biomedical Informatics, 2018, 84: 31–41

    Article  Google Scholar 

  5. Chris J L, Destinee T, Lynn M C. Enhanced lexsynonym acquisition for effective UMLS concept mapping. In: Proceedings of the 16th World Congress on Medical and Health Informatics. 2017, 501–505

  6. Mollie R C, Kristina D H, Joseph P. Automated mapping of NPDS data elements to the UMLS metathesaurus. In: Proceedings of American Medical Informatics Association Annual Symposium. 2013

  7. Paul T, John M N, Simonetta M. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics, 2011, 12: 397–426

    Article  Google Scholar 

  8. Hans-Michael M, Kimberly V A, Li Y. Textpresso central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics, 2018, 19(1): 1–16

    Article  Google Scholar 

  9. Song M, Han W S, Yu H. BoDBES: a boosted dictionary-based biomedical entity spotter. In: Proceeding of the 7rd International Workshop on Data and Text Mining in Bioinformatics. 2013, 21–22

  10. Song M, Yu H, Han W S. Developing a hybrid dictionary-based bio-entity recognition technique. BMC Medical Informatics and Decision Making, 2015, 15(S1): S9

    Article  Google Scholar 

  11. Lars J J. One tagger, many uses: illustrating the power of ontologies in dictionary-based named entity recognition. In: Proceedings of the Joint International Conference on Biological Ontology and BioCreative. 2016, 1747–1749

  12. Yang Z, Li H, Li Y. Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Computational Biology and Chemistry, 2008, 32(4): 287–291

    Article  Google Scholar 

  13. Martijn J S, Barend M, Marc W. Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification. Journal of Biomedical Informatics, 2007, 40(3): 316–324

    Article  Google Scholar 

  14. Zeng D, Sun C, Lin L, Liu B. Enlarging drug dictionary with semi-supervised learning for drug entity recognition. In: Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2016, 1929–1931

  15. Laura C, Rajasekar K, Li Y, Frederick R, Shivakumar V. Domain adaptation of rule-based annotators for named-entity recognition tasks. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 2010, 1002–1012

  16. Ryan G, Jay D, Constantine L, Marjorie F, Ralph M W. Combining rule-based and statistical mechanisms for low-resource named entity recognition. Machine Translation, 2018, 32(1–2): 31–43

    Google Scholar 

  17. Peng M, Xing X, Zhang Q, Fu J, Huang X. Distantly supervised named entity recognition using positive-unlabeled learning. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 2409–2419

  18. Li Q, Wang X, Zhang Y, Ling F, Wu C H, Han J. Pattern discovery for wide-window open information extraction in biomedical literature. In: Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2018, 420–427

  19. Hanisch D, Fundel K, Mevissen H T, Zimmer R, Fluck J. ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics, 2005, 6(S1): S14

    Article  Google Scholar 

  20. Nigel C, Chikashi N, Junichi T. Extracting the names of genes and gene products with a hidden markov model. In: Proceedings of the 18th International Conference on Computational Linguistics. 2000, 201–207

  21. Burr S. Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, 107–110

  22. Kazuhiro S, Javed M. A hybrid approach to protein name identification in biomedical texts. Information Processing and Management, 2005, 41(4): 723–743

    Article  Google Scholar 

  23. Liu J, Huang M, Zhu X. Recognizing biomedical named entities using skip-chain conditional random fields. In: Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010, 10–18

  24. Sujan K S, Sudeshna S, Pabitra M. Feature selection techniques for maximum entropy based biomedical named entity recognition. Journal of Biomedical Informatics, 2009, 42(5): 905–911

    Article  Google Scholar 

  25. Zhu Q, Li X, Ana C, Cecile P. GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics, 2018, 34(9): 1547–1554

    Article  Google Scholar 

  26. Nathan G, Trapit B, Patrick V. Marginal likelihood training of BiLSTMCRF for biomedical named entity recognition from disjoint label sets. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2824–2829

  27. Maryam H, Leon W, Mariana L N. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics, 2017, 33(14): i37–i48

    Article  Google Scholar 

  28. Li H, Yang M, Chen Q, Tang B, Wang X, Yan J. Chemical-induced disease extraction via recurrent piecewise convolutional neural networks. BMC Medical Informatics and Decision Making, 2018, 18(S2): 45–51

    Article  Google Scholar 

  29. Lucy L W, Chandra B, Mark N. Ontology alignment in the biomedical domain using entity definitions and context. In: Proceedings of the BioNLP 2018 Workshop. 2018, 47–55

  30. Wang Y, Majid R M, Ravikumar K E, Liu H. Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts. Database, 2017, 1: 13

    Google Scholar 

  31. Naiara P, Montse C, German R. Biomedical term normalization of EHRs with UMLS. In: Proceedings of the 17th International Conference on Language Resources and Evaluation. 2018, 2045–2051

  32. Ali H P, Paul C. Do character-level neural network language models capture knowledge of multiword expression compositionality? In: Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions. 2018, 185–192

  33. Michael H, Marco B. Tabula nearly rasa: probing the linguistic knowledge of character-level neural language models trained on unsegmented text. Transactions of the Association for Computational Linguistics, 2019, 7: 467–484

    Article  Google Scholar 

  34. Ruiz-Martinez J M, Valencia-Garcia R, Fernández-Breis J T, García-Sánchez F, Martinez-Béjar R. Ontology learning from biomedical natural language documents using UMLS. Expert Systems with Applications, 2011, 38(10): 12365–12378

    Article  Google Scholar 

  35. He Z, Yehoshua P, Gai E, Chen Y, James G, Bian J. Auditing the assignments of top-level semantic types in the UMLS semantic network to UMLS concepts. In: Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2017, 1262–1269

  36. EI-Rab W G, Zaïane D R, EI-Hajj M. Biomedical text disambiguation using UMLS. In: Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2013, 943–947

  37. Lin Y F, Tsai T H, Chou W C, Wu K P, Sung T Y, Hsu W L. A maximum entropy approach to biomedical named entity recognition. In: Proceedings of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics. 2004, 56–61

  38. Zhang S, Noemie E. Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. Journal of Biomedical Informatics, 2013, 46(6): 1088–1098

    Article  Google Scholar 

  39. Serhan T, Ilyas C. Two learning approaches for protein name extraction. Journal of Biomedical Informatics, 2009, 42(6): 1046–1055

    Article  Google Scholar 

  40. Lyu C, Chen B, Ren Y. Long short-term memory RNN for biomedical named entity recognition. BMC Bioinformatics, 2017, 18(1): 462–473

    Article  Google Scholar 

  41. Andrea B, Elisabeth L. Data-intensive modelling and simulation in life sciences and socio-economical and physical sciences. Data Science and Engineering, 2017, 2(3): 197–198

    Article  Google Scholar 

  42. Kim J D, Wang Y, Nicola C, Seung H B, Kim Y H, Song M. Refactoring the genia event extraction shared task toward a general framework for IE-Driven KB development. In: Proceedings of the 4th BioNLP Shared Task Workshop. 2016, 23–31

  43. Ju Z, Wang J, Zhu F. Named entity recognition from biomedical text using SVM. In: Proceedings of the 5th International Conference on Bioinformatics and Biomedical Engineering. 2011, 1–4

  44. Kuo H C, Lin K. Extracting protein names from biological literature. Advances in Computer Science: an International Journal, 2017, 3(2): 58–68

    Google Scholar 

  45. Nigel C, Hyun S P, Norihiro O. The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers. In: Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics. 1999, 271–272

  46. Li F, Zhang M, Fu G, Ji D. A neural joint model for entity and relation extraction from biomedical text. BMC Bioinformatics, 2017, 18(1): 1–11

    Article  Google Scholar 

  47. Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. 2015, arXiv preprint arXiv:1508.01991

Download references

Acknowledgements

This research was supported by the National Key Research and Development Program of China (2018YFB1003404), the National Natural Science Foundation of China (Grant Nos. 61672142, 61402213), the Fundamental Research Funds for the Central Universities (N150408001-3, N150404013), Natural Science Foundation of Liaoning Province (20170540471).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Hu.

Additional information

Yu Hu received his Bachelor Degree of Engineering and Master Degree of Computer Science from Northeastern University, China. He is currently a PhD candidate in Computer Software and Theory at NEU. His research interests include machine learning and the applications in bioinformatics.

Tiezheng Nie received his PhD Degree at Northeastern University, China. He is currently an assistant professor at the Department of Computer Science and Engineering. His research interests are related to database, data integration and data quality.

Derong Shen received her PhD Degree in Computer Science at Northeastern University (NEU), China. She is currently a professor at the Department of Computer Science and Engineering, NEU. Her research interests are related to distributed computing, data integration and knowledge base.

Yue Kou received her PhD Degree in Computer Science at Northeastern University (NEU), China. She is currently an assistant professor at the Department of Computer Science and Engineering, NEU. Her research interests include database theory, machine learning and software engineering.

Ge Yu received his PhD Degree in Information Engineering at Kyushu University, Japan. He is an experienced researcher. He now serves as head of the Department of Computer Science and Engineering, Northeastern University, China. His research interests include database theory.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Y., Nie, T., Shen, D. et al. An integrated pipeline model for biomedical entity alignment. Front. Comput. Sci. 15, 153321 (2021). https://doi.org/10.1007/s11704-020-8426-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-020-8426-4

Keywords

Navigation