Skip to main content
Log in

An Industrial Approach to Using Artificial Intelligence and Natural Language Processing for Accelerated Document Preparation in Drug Development

  • Original Article
  • Published:
Journal of Pharmaceutical Innovation Aims and scope Submit manuscript

Abstract

Purpose

Due to the exceptionally high standards for accuracy and data integrity in scientific regulatory reporting, it is vital that any tool that aims to streamline this process is as efficient or more in gathering data as a team of scientists, without higher cost in terms of time or resources. For this reason, an artificial intelligence-based tool with parallel search, document creation, and data integrity review capabilities is being investigated as a potential solution. This paper describes a proof of concept project to develop an AI-based tool to rapidly assemble an end-of-phase 2 (EOP2) briefing document for a potential medicine. We have called the tool an Intelligent Machine for Document Preparation or IMDP.

Methods

A training corpus of approximately 65,000 pdf documents derived from electronic lab notebooks and technical reports related to five molecules (including Merestinib) was ingested, and prior EOP2 documents from the remaining four molecules was used to generate training questions and answers. Then, an annotation-light natural language processing algorithm analyzed a set of structured and unstructured data regarding Merestinib. A simple user interface was created allowing scientists to query the system in natural language, and a table builder, image/plot finder, and free-text addition features were added to allow for advanced search without dependence on keywords.

Results

Three significant innovations were designed-in to improve overall performance as compared to our benchmark solution without sacrificing usability. First, the AI-based IMDP was built to improve accuracy and accelerate document creation with remarkably low amount of training. Second, image search capability was added to enrich the knowledge base, and third, the IMDP was integrated with the existing process rather than adding a step in the workflow. Finally, accuracy and total document creation time were compared with the existing tool (benchmark tool). Our experiments show that the AI-based technology reached 89% accuracy which surpassed the internal benchmark of 54% and retrieved the right information 3.6 times faster.

Conclusions

The main contribution of this study is to show the value of artificial intelligence-based tools in accelerating all major stages of regulatory report creation while allowing a team of scientists to seamlessly collaborate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Venkatasubramanian V. The promise of artificial intelligence in chemical engineering: is it here, finally? AIChE J. 2018;65(2).

  2. Gupta A. Introduction to deep learning. Chem Eng Prog. 2018.

  3. Yu LX, Raw A, Wu A, Capacci-Daniel C, Zhang Y, Rosencrance S. FDA’s new pharmaceutical quality initiative: knowledge-aided assessment & structured applications. Int J Pharm. 2019;1.

  4. Remolona MFM, Conway MF, Balasubramanian S, Fan L, Feng Z, Gu T, et al. Hybrid ontology-learning materials engineering system for pharmaceutical products: Multi-label entity recognition and concept detection. Comput Chem Eng. 2017;107:49–60.

    Article  CAS  Google Scholar 

  5. Flower A, McKenna JW, Upreti G. Validity and reliability of GraphClick and DataThief III for data extraction. 2016;40(3):396–413.

  6. Filippov IV, Nicklaus MC. Optical structure recognition software to recover chemical information: OSRA, an open source solution. J Chem Inf Model. 2009;49(3):740–3.

    Article  CAS  Google Scholar 

  7. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.

    Article  CAS  Google Scholar 

  8. Goldberg Y. A primer on neural network models for natural language processing. Journal of Artifical Intelligence Research. 2016;57:345–420. https://doi.org/10.1613/jair.4992.

  9. Omer Levy YG. ACL anthology. Dependency-Based Word Embeddings 2014.

  10. Tomas Mikolov KC, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv. 2013.

  11. Jeffrey Pennington RS. Christopher manning. Global Vectors for Word Representation. ACL Anthology: Glove; 2014.

    Google Scholar 

  12. Jacob Devlin M-WC, Kenton Lee, Kristina Toutanova. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. arXiv 2018.

  13. Jinhyuk Lee WY, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. arXiv. 2019.

  14. Lu J, Batra D, Parikh D, Lee S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv. 2019.

  15. Kulkarni R, Kulkarni H, Balar K, Krishna P. Cognitive natural language search using calibrated quantum mesh. IEEE. 2018.

  16. Document Management – Portable Document Format. 2008;PDF 1.7.

  17. Weir R. OpenDocument format: the standard for office documents. IEEE Internal Computing. 2009;13(2):83–7.

    Article  Google Scholar 

  18. Still M. The definitive guide to ImageMagick. 2006.

    Google Scholar 

  19. Smith R. An overview of the Tesseract OCR engine. IEEE. 2007.

  20. Quality Risk Management. 2005.

  21. Lubani M, Noah SAM, Mahmud R. Ontology population: approaches and design aspects. J Inf Sci. 2018;45(4):502–15.

    Article  Google Scholar 

Download references

Acknowledgments

The authors wish to acknowledge Rocketspace Inc. as a key collaborator before and during the project execution, as well as Justin Burt, Himanshu Gupta, and Harshad Kulkarni.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shekhar Viswanath.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Availability of Data and Material

Given that this is a Natural Language Processing application on Pharmaceutical CMC data, the raw data as mentioned in the paper was composed of ~ 65,000 pdfs, and it is not practical to share so many pdfs.

Code Availability

We used a proprietary vendor platform augmented by custom coding onto the platform to generate the solution. The code is therefore not available, as such, for review.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Viswanath, S., Fennell, J.W., Balar, K. et al. An Industrial Approach to Using Artificial Intelligence and Natural Language Processing for Accelerated Document Preparation in Drug Development. J Pharm Innov 16, 302–316 (2021). https://doi.org/10.1007/s12247-020-09449-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12247-020-09449-x

Keywords

Navigation