Abstract
Purpose
Due to the exceptionally high standards for accuracy and data integrity in scientific regulatory reporting, it is vital that any tool that aims to streamline this process is as efficient or more in gathering data as a team of scientists, without higher cost in terms of time or resources. For this reason, an artificial intelligence-based tool with parallel search, document creation, and data integrity review capabilities is being investigated as a potential solution. This paper describes a proof of concept project to develop an AI-based tool to rapidly assemble an end-of-phase 2 (EOP2) briefing document for a potential medicine. We have called the tool an Intelligent Machine for Document Preparation or IMDP.
Methods
A training corpus of approximately 65,000 pdf documents derived from electronic lab notebooks and technical reports related to five molecules (including Merestinib) was ingested, and prior EOP2 documents from the remaining four molecules was used to generate training questions and answers. Then, an annotation-light natural language processing algorithm analyzed a set of structured and unstructured data regarding Merestinib. A simple user interface was created allowing scientists to query the system in natural language, and a table builder, image/plot finder, and free-text addition features were added to allow for advanced search without dependence on keywords.
Results
Three significant innovations were designed-in to improve overall performance as compared to our benchmark solution without sacrificing usability. First, the AI-based IMDP was built to improve accuracy and accelerate document creation with remarkably low amount of training. Second, image search capability was added to enrich the knowledge base, and third, the IMDP was integrated with the existing process rather than adding a step in the workflow. Finally, accuracy and total document creation time were compared with the existing tool (benchmark tool). Our experiments show that the AI-based technology reached 89% accuracy which surpassed the internal benchmark of 54% and retrieved the right information 3.6 times faster.
Conclusions
The main contribution of this study is to show the value of artificial intelligence-based tools in accelerating all major stages of regulatory report creation while allowing a team of scientists to seamlessly collaborate.
Similar content being viewed by others
References
Venkatasubramanian V. The promise of artificial intelligence in chemical engineering: is it here, finally? AIChE J. 2018;65(2).
Gupta A. Introduction to deep learning. Chem Eng Prog. 2018.
Yu LX, Raw A, Wu A, Capacci-Daniel C, Zhang Y, Rosencrance S. FDA’s new pharmaceutical quality initiative: knowledge-aided assessment & structured applications. Int J Pharm. 2019;1.
Remolona MFM, Conway MF, Balasubramanian S, Fan L, Feng Z, Gu T, et al. Hybrid ontology-learning materials engineering system for pharmaceutical products: Multi-label entity recognition and concept detection. Comput Chem Eng. 2017;107:49–60.
Flower A, McKenna JW, Upreti G. Validity and reliability of GraphClick and DataThief III for data extraction. 2016;40(3):396–413.
Filippov IV, Nicklaus MC. Optical structure recognition software to recover chemical information: OSRA, an open source solution. J Chem Inf Model. 2009;49(3):740–3.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.
Goldberg Y. A primer on neural network models for natural language processing. Journal of Artifical Intelligence Research. 2016;57:345–420. https://doi.org/10.1613/jair.4992.
Omer Levy YG. ACL anthology. Dependency-Based Word Embeddings 2014.
Tomas Mikolov KC, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv. 2013.
Jeffrey Pennington RS. Christopher manning. Global Vectors for Word Representation. ACL Anthology: Glove; 2014.
Jacob Devlin M-WC, Kenton Lee, Kristina Toutanova. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. arXiv 2018.
Jinhyuk Lee WY, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. arXiv. 2019.
Lu J, Batra D, Parikh D, Lee S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv. 2019.
Kulkarni R, Kulkarni H, Balar K, Krishna P. Cognitive natural language search using calibrated quantum mesh. IEEE. 2018.
Document Management – Portable Document Format. 2008;PDF 1.7.
Weir R. OpenDocument format: the standard for office documents. IEEE Internal Computing. 2009;13(2):83–7.
Still M. The definitive guide to ImageMagick. 2006.
Smith R. An overview of the Tesseract OCR engine. IEEE. 2007.
Quality Risk Management. 2005.
Lubani M, Noah SAM, Mahmud R. Ontology population: approaches and design aspects. J Inf Sci. 2018;45(4):502–15.
Acknowledgments
The authors wish to acknowledge Rocketspace Inc. as a key collaborator before and during the project execution, as well as Justin Burt, Himanshu Gupta, and Harshad Kulkarni.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Availability of Data and Material
Given that this is a Natural Language Processing application on Pharmaceutical CMC data, the raw data as mentioned in the paper was composed of ~ 65,000 pdfs, and it is not practical to share so many pdfs.
Code Availability
We used a proprietary vendor platform augmented by custom coding onto the platform to generate the solution. The code is therefore not available, as such, for review.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Viswanath, S., Fennell, J.W., Balar, K. et al. An Industrial Approach to Using Artificial Intelligence and Natural Language Processing for Accelerated Document Preparation in Drug Development. J Pharm Innov 16, 302–316 (2021). https://doi.org/10.1007/s12247-020-09449-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12247-020-09449-x