An approach for detecting the commonality and specialty between scientific publications and patents

Xu, Shuo; Li, Ling; An, Xin; Hao, Liyuan; Yang, Guancan

doi:10.1007/s11192-021-04085-9

An approach for detecting the commonality and specialty between scientific publications and patents

Published: 05 July 2021

Volume 126, pages 7445–7475, (2021)
Cite this article

Scientometrics Aims and scope Submit manuscript

Shuo Xu¹,
Ling Li¹,
Xin An ORCID: orcid.org/0000-0002-0291-2711²,
Liyuan Hao¹ &
…
Guancan Yang³

899 Accesses
10 Citations
Explore all metrics

Abstract

Scientific publications and patents are usually viewed as respective proxies of scientific research and technical development. There is considerable effort spent towards establishing topic linkages between science and technology with the lexical- or topic-based approaches. However, due to the heterogeneity between scholarly articles and patents in terms of purpose, statement, and quality, the performance is not satisfactory. To understand the difficulties of topic linkages and improve the performance, a framework is proposed to detect the commonality and specialty between scientific publications and patents from the two perspectives: linguistic characteristics and thematic structures. Extensive experimental results on the DrugBank dataset discover five commonness and five significant differences in terms of linguistic characteristics. For example, nouns are used most frequently among them, and scientific publications contain more word tokens than patent documents, but patents have usually longer sentences and use more clauses. In the meanwhile, common and special thematic structures are also uncovered between scientific publications and patents. The themes about general description in the pharmaceutical field are shared by two heterogeneous resources. The scientific publications tend to explain the disease mechanism and the medication content, while patents bias towards the preparation and practical application of drugs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How to design bibliometric research: an overview and a framework proposal

Article Open access 06 March 2024

The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis

Article 26 March 2021

Plagiarism in research

Article 04 July 2014

Notes

References

Albert, T. (2016). Measuring technology maturity: Operationalizing information from patents, scientific publications and the web. Springer.
Book Google Scholar
An, X., Li, J., Xu, S., Chen, L., & Sun, W. (2021). An improved patent similarity measurement based on entities and semantic relations. Journal of Informetrics, 15(2), 101135.
Article Google Scholar
An, X., Xu, S., Wen, Y., & Hu, M. (2014). A shared interest discovery model for coauthor relationship in SNS. International Journal of Distributed Sensor Networks, 2014, 1–9.
Google Scholar
Andy, S. (2007). A general framework for analysing diversity in science, technology and society. Journal of the Royal Society Interface, 15, 707–719.
Google Scholar
Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., & Curran, J. (2009). Named entity recognition in Wikipedia. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources (People’s Web) (pp. 10–18). Suntec, Singapore.
Bassecouolard, E., & Zitt, M. (2004). Patents and publications: The lexical connection. In H. F. Moed, W. Glänzel, & U. Schoch (Eds.), Handbook of quantitative science and technology research: The use of publication and patent statistics in studies of S&T systems (pp. 665–694). Springer.
Chapter Google Scholar
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55, 77–84.
Article Google Scholar
Blei, D. M., Ng, A. Y., Jordan, M. I., & Lafferty, J. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Brants, T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the sixth conference on applied natural language processing (pp. 224–231). Somerset: ACL.
Brooks, H. (1994). The relationship between science and technology. Research Policy, 23(5), 477–486.
Article Google Scholar
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19, 263–311.
Google Scholar
Calero-Medina, C., & Noyons, E. C. M. (2008). Combining mapping and citation network analysis for a better understanding of the scientific development: The case of the absorptive capacity field. Journal of Informetrics, 2(4), 272–279.
Article Google Scholar
Chen, C., Buntine, W., Ding, N., Xie, L., & Du, L. (2015). Differential topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 230–242.
Article Google Scholar
Chen, L., Xu, S., Zhu, L., Zhang, J., Lei, X., & Yang, G. (2020). A deep learning based method for extracting semantic information from patent documents. Scientometrics, 125(1), 289–312.
Article Google Scholar
Christopher, F. (1989). A stop list for general text. ACM SIGIR Forum, 24, 19–21.
Article Google Scholar
Dubaric, E., Giannoccaro, D., Bengtsson, R., & Ackermann, T. (2011). Patent data as indicators of wind power technology development. World Patent Information, 33(2), 144–149.
Article Google Scholar
Ellis, R., & Yuan, F. (2004). The effects of planning on fluency, complexity, and accuracy in second language narrative writing. Studies in Second Language Acquisition, 26, 59–84.
Article Google Scholar
Ferris, D. R. (1994). Lexical and syntactic features of ESL writing by students at different levels of L2 proficiency. TESOL Quarterly, 28, 414–420.
Article Google Scholar
Forti, E., Sobrero, M., & Franzoni, C. (2007). The effect of patenting on the networks and connections of academic scientists (pp. 272–284). Social Science Electronic Publishing.
Google Scholar
Gao, H., Tang, S., Zhang, Y., Jiang, D., Wu, F., & Zhuang, Y. (2012b). Supervised cross-collection topic modeling. In Proceedings of the 20th ACM international conference on multimedia (pp. 957–960). New York: ACM.
Gao, J. P., Ding, K., Teng, L., & Pang, J. (2012a). Hybrid documents co-citation analysis: Making sense of the interaction between science and technology in technology diffusion. Scientometrics, 93, 459–471.
Article Google Scholar
Gazni, A. (2011). Are the abstracts of high impact articles more readable? Investigating the evidence from top research institutions in the world. Journal of Information Science, 37, 273–281.
Article Google Scholar
Gerard, S. (1963). Associative document retrieval techniques using bibliographic information. ACM, 10, 440–457.
MATH Google Scholar
Gerlach, M., Shi, H., & Amaral, L. A. N. (2019). A universal information theoretic approach to the identification of stopwords. Nature Machine Intelligence, 1, 606–612.
Article Google Scholar
Glänzel, W., & Meyer, M. (2003). Patents cited in the scientific literature: An exploratory study of ‘reverse’ citation relations. Scientometrics, 58, 415–428.
Article Google Scholar
Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2004). Integrating topics and syntax. In Advances in neural information processing systems 17 (pp. 537–544). Vancouver, Canada.
Hartley, J., Pennebaker, J. W., & Fox, C. L. (2003). Abstracts, introductions and discussions: How far do they differ in style? Scientometrics, 57, 389–398.
Article Google Scholar
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the international ACM conference on research and development in information retrieval (SIGIR’99) (pp.50–57). New York: ACM.
Hua, T., Lu, C.-T., Choo, J., & Reddy, C. K. (2020). Probabilistic topic modeling for comparative analysis of document collections. ACM Transactions on Knowledge Discovery from Data, 14, 24:1-24:27.
Article Google Scholar
Huang, M. H., Yang, H. W., & Chen, D. Z. (2015). Increasing science and technology linkage in fuel cells: A cross citation analysis of papers and patents. Journal of Informetrics, 9, 237–249.
Article Google Scholar
Kim, H., Choo, J., Kim, J., Reddy, C. K., & Park, H. (2015). Simultaneous discovery of common and discriminative topics via joint nonnegative matrix factorization. In Proceedings of the ACM international conference on knowledge discovery and data mining (pp. 567–576). New York: ACM.
Kormos, J. (2011). Task complexity and linguistic and discourse features of narrative writing performance. Journal of Second Language Writing, 20, 148–161.
Article Google Scholar
Lee, K., Mi, Y., Kim, M., Ji, Y., & Son, J. (2014). Abstract LB-100: Discovery of HM61713 as an orally available and mutant EGFR selective inhibitor. Cancer Research, 74(19 Supplement), LB-100.
Google Scholar
Lee, M., Lee, S., Kim, J., Seo, D., Kim, P., Jung, H., Lee, J., Kim, T., Koo, H. K., & Sung, W. K., et al. (2011). Decision-making support service based on technology opportunity discovery model. In T.-H. Kim (Ed.), FGIT-UNESST 2011 (Vol. 264, pp. 263–268). Springer.
Google Scholar
Lu, C., Bu, Y., Wang, J., Ding, Y., Torvik, V., Schnaars, M., et al. (2019). Examining scientific writing styles from the perspective of linguistic complexity. Journal of the Association for Information Science and Technology, 70, 462–475.
Article Google Scholar
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496.
Article Google Scholar
Makrehchi, M., & Kamel, M. S. (2008). Automatic extraction of domain-specific stopwords from labeled documents. In Proceedings of the 30th European conference on IR research (pp. 222–233). Berlin: Springer.
Makrehchi, M., & Kamel, M. S. (2017). Extracting domain-specific stop words for text classifiers. Intelligent Data Analysis, 21, 39–62.
Article Google Scholar
Montemurro, M. A., & Zanette, D. H. (2010). Towards the quantification of the semantic information encoded in written language. Advances in Complex Systems, 13, 135–153.
Article MATH Google Scholar
Narin, F., Hamilton, K. S., & Olivastro, D. (1997). The increasing linkage between U.S. technology and public science. Research Policy, 26, 317–330.
Article Google Scholar
Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. Applied Linguistics, 24, 492–518.
Article Google Scholar
Paul, M. (2009). Cross-collection topic models: Automatically comparing and contrasting text. Urbana, 51, 61801.
Google Scholar
Paul, M., & Girju, R. (2010). A two-dimensional topic-aspect model for discovering multi-faceted topics. In Proceedings of the 20th national conference on artificial intelligence (pp. 545–550). CA: AAAI.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286.
Article Google Scholar
Sætre, R., Yoshida, K., Yakushiji, A., Miyao, Y., Matsubayashi, Y., & Ohta, T. (2007). AKANE system: protein-protein interaction pairs in the BioCreAtlvE2 challenge, PPI-IPS subtask. In Proceedings of the 2nd BioCreative challenge evaluation workshop (pp. 209–212). Madrid, Spain.
Salton, G., & Yang, C. S. (1973). On the specification of term values in automatic indexing. Journal of Documentation, 29, 351–372.
Article Google Scholar
Schmiedel, T., Müller, O., & vom Brocke, J. (2019). Topic modeling as a strategy of inquiry in organizational research: A tutorial with an application example on organizational culture. Organizational Research Methods, 22(4), 941–968.
Article Google Scholar
Seki, K., & Mostafa, J. (2005). An application of text categorization methods to gene ontology annotation. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 138–145). New York: ACM.
Shibata, N., Kajikawa, Y., & Sakata, I. (2010). Extracting the commercialization gap between science and technology—Case study of a solar cell. Technological Forecasting and Social Change, 77, 1147–1155.
Article Google Scholar
Shibata, N., Kajikawa, Y., & Sakata, I. (2011). Detecting potential technological fronts by comparing scientific papers and patents. Foresight, 13, 51–60.
Article Google Scholar
Takano, Y., Mejia, C., & Kajikawa, Y. (2016). Unconnected component inclusion technique for patent network analysis: Case study of internet of things-related technologies. Journal of Informetrics, 10(4), 967–980.
Article Google Scholar
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., & Tsujii, J. (2005). Developing a robust part-of-speech tagger for biomedical text. In Proceedings of the 10th Panhellenic conference on informatics (pp. 382–382). Berlin: Springer.
Tytgat, G. (2001). Shortcomings of the first-generation proton pump inhibitors. European Journal of Gastroenterology & Hepatology, 13(Suppl 1), S29-33.
Google Scholar
van Eck, N. J., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84, 523–538.
Article Google Scholar
Verbeek, A., Debackere, K., & Luwel, M. (2002). Linking science to technology: Using bibliographic references in patents to build linkage schemes. Scientometrics, 54, 399–420.
Article Google Scholar
Wang, C., Thiesson, B., Meek, C., & Blei, D. (2009). Markov topic models. In Proceedings of the 12th international conference on artificial intelligence and statistics (pp. 583–590).
Wang, G., & Guan, J. (2011). Measuring science–technology interactions using patent citations and author-inventor links: An exploration analysis from Chinese nanotechnology. Journal of Nanoparticle Research, 13, 6245–6262.
Article Google Scholar
Wang, Z., Xu, S., & Zhu, L. (2018). Semantic relation extraction aware of N-gram features from unstructured biomedical text. Journal of Biomedical Informatics, 86, 59–70.
Article Google Scholar
Xu, H., Winnink, J., Yue, Z., Liu, Z., & Yuan, G. (2020). Topic-linked innovation paths in science and technology. Journal of Informetrics, 14(2), 101014.
Article Google Scholar
Xu, S., An, X., Zhu, L., Zhang, Y., & Zhang, H. (2015). A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature. Journal of Cheminformatics, 7(Suppl 1), S11.
Article Google Scholar
Xu, S., Hao, L., An, X., Yang, G., & Wang, F. (2019b). Emerging research topics detection with multiple machine learning models. Journal of Informetrics, 13(4), 100983.
Article Google Scholar
Xu, S., Hao, L., An, X., Zhai, D., & Pang, H. (2019c). Types of DOI errors of cited references in Web of Science with a cleaning method. Scientometrics, 120(3), 1427–1437.
Article Google Scholar
Xu, S., Hao, L., Yang, G., Lu, K., & An, X. (2021). A topic models based framework for detecting and forecasting emerging technologies. Technology Forecasting and Social Change, 162, 120366.
Article Google Scholar
Xu, S., Liu, J., Zhai, D., An, X., Wang, Z., & Pang, H. (2018). Overlapping thematic structures extraction with mixed-membership stochastic blockmodel. Scientometrics, 117(1), 61–84.
Article Google Scholar
Xu, S., Qiao, X., Zhu, L., Zhang, Y., Xue, C., & Li, L. (2016). Reviews on determining the number of clusters. Applied Mathematics & Information Sciences, 10(4), 1493–1520.
Article Google Scholar
Xu, S., Zhai, D., Wang, F., An, X., Pang, H., & Sun, Y. (2019a). A novel method for topic linkages between scientific publications and patents. Journal of the Association for Information Science and Technology, 70(9), 1026–1042.
Article Google Scholar
Xu, S., Zhu, L., Qiao, X., Shi, Q., & Gui, J. (2012). Topic linkages between papers and patents. In Proceedings of the 4th international conference on advanced science and technology (pp. 176–183).
Zhai, C., Velivelli, A., & Yu, B. (2004). A cross-collection mixture model for comparative text mining. In Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 743–748). New York: ACM.
Zhang, H., Xu, S., & Qiao, X. (2014). Review on topic models integrating intra- and extra-features of scientific and technical literature. Journal of the China Society for Scientific and Technical Information, 33, 1108–1120.
Google Scholar

Download references

Acknowledgements

This work was supported partially by the National Natural Science Foundation of China (Grant Numbers 72074014 and 72004012). Our gratitude also goes to the anonymous reviewers and the editor for their valuable comments.

Author information

Authors and Affiliations

College of Economics and Management, Beijing University of Technology, Beijing, 100124, People’s Republic of China
Shuo Xu, Ling Li & Liyuan Hao
School of Economics and Management, Beijing Forestry University, Beijing, 100083, People’s Republic of China
Xin An
School of Information Resource Management, Renmin University of China, Beijing, 100872, People’s Republic of China
Guancan Yang

Authors

Shuo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ling Li
View author publications
You can also search for this author in PubMed Google Scholar
Xin An
View author publications
You can also search for this author in PubMed Google Scholar
Liyuan Hao
View author publications
You can also search for this author in PubMed Google Scholar
Guancan Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin An.

Appendix

See Tables 11 and 12.

Table 11 Journals and abstract restrictions corresponding to the top 20% papers that appear most frequently

Full size table

Table 12 Two-tailed independent sample t-test results of 11 language complexity indicators

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, S., Li, L., An, X. et al. An approach for detecting the commonality and specialty between scientific publications and patents. Scientometrics 126, 7445–7475 (2021). https://doi.org/10.1007/s11192-021-04085-9

Download citation

Received: 15 August 2020
Accepted: 21 June 2021
Published: 05 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11192-021-04085-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An approach for detecting the commonality and specialty between scientific publications and patents

Abstract

Access this article

Similar content being viewed by others

How to design bibliometric research: an overview and a framework proposal

The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis

Plagiarism in research

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An approach for detecting the commonality and specialty between scientific publications and patents

Abstract

Access this article

Similar content being viewed by others

How to design bibliometric research: an overview and a framework proposal

The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis

Plagiarism in research

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation