Skip to main content
Log in

A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Patents are an important source of information for measuring the technological advancement of a specific knowledge domain. To facilitate the search for information in patent datasets, classification systems separate documents into groups according to the area of knowledge, and designate names to define their content. The increase in the number of patented inventions leads to the need to subdivide these groups. Since these groups belong to a restricted knowledge domain, naming the generated subcategories can be extremely laborious. This work aims to compare the performance of abstractive and extractive summarization techniques in the task of generating sentences directly associated with the content of patents. The abstractive summarization model was composed by a Seq2Seq architecture and a LSTM network. The training was conducted with a dataset of patent titles and abstracts. The validation process was performed using the ROUGE set of metrics. The results obtained by the generated model were compared with the sentence resulting from an extractive summarization algorithm applied to the task of naming patent groups. The main idea was to help the specialist to name new patent groups created by the clustering systems. The naming experiments were performed on the dataset of abstracts of patent documents. Comparative experiments were conducted using four subgroups of the United States Patent and Trademark Office, which uses the Cooperative Patent Classification system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Available at: https://github.com/chakki-works/sumeval.

  2. Available at: https://github.com/natsheh/sensim.

  3. Available at: https://nlp.stanford.edu/projects/glove/.

References

  • Abtahi, F., Ro, T., Li, W., Zhu, Z. (2018). Emotion analysis using audio/video, emg and eeg: A dataset and comparison study. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, (pp. 10–19).

  • Al-Natsheh, H. T., Martinet, L., Muhlenbach, F., Zighed, D. A. (2017). UdL at SemEval-2017 task 1: Semantic textual similarity estimation of English sentence pairs using regression model over pairwise features. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, Vancouver, Canada, (pp 115–119).

  • Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:170702919.

  • Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. CoRRarXiv:1409.0473.

  • Bin, Y., Yang, Y., Shen, F., Xie, N., Shen, H. T., & Li, X. (2018). Describing video with attention-based bidirectional lstm. IEEE Transactions on Cybernetics, 49(7), 2631–2641.

    Article  Google Scholar 

  • Camus, C., & Brancaleon, R. (2003). Intellectual assets management: From patents to knowledge. World Patent Information, 25(2), 155–159.

    Article  Google Scholar 

  • Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), (pp. 1724–1734).

  • Codina-Filbà, J., Bouayad-Agha, N., Burga, A., Casamayor, G., Mille, S., Müller, A., et al. (2017). Using genre-specific features for patent summaries. Information Processing & Management, 53(1), 151–174.

    Article  Google Scholar 

  • Cohen, E., Beck, C. (2019). Empirical analysis of beam search performance degradation in neural sequence models. In International Conference on Machine Learning, (pp. 1290–1299).

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

    Article  Google Scholar 

  • Dokun, O., & Celebi, E. (2015). Single-document summarization using latent semantic analysis. International Journal of Scientific Research in Information Systems and Engineering (IJSRISE), 1(2), 57–64.

    Google Scholar 

  • Froud, H., Lachkar, A., & Ouatik, S. A. (2013). Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering. International Journal of Data Mining & Knowledge Management Process (IJDKP), 3(1), 79–95.

    Article  Google Scholar 

  • Gambhir, M., & Gupta, V. (2017). Recent automatic text summarization techniques: A survey. Artificial Intelligence Review, 47(1), 1–66.

    Article  Google Scholar 

  • Gomez, J. C. (2019). Analysis of the effect of data properties in automated patent classification. Scientometrics.

  • Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5–6), 602–610.

    Article  Google Scholar 

  • Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232.

    Article  MathSciNet  Google Scholar 

  • Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2015). Semantic similarity from natural language and ontology analysis. Synthesis Lectures on Human Language Technologies, 8(1), 1–254.

    Article  Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Khan, A., Salim, N., & Kumar, Y. J. (2015). A framework for multi-document abstractive summarization based on semantic role labelling. Applied Soft Computing, 30, 737–747.

    Article  Google Scholar 

  • Kim, J., & Lee, S. (2015). Patent databases for innovation studies: A comparative analysis of uspto, epo, jpo and kipo. Technological Forecasting and Social Change, 92, 332–345.

    Article  Google Scholar 

  • Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.

    Article  Google Scholar 

  • Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, (pp. 74–81).

  • Luong, M. T., Sutskever, I., Le, Q. V., Vinyals, O., Zaremba, W. (2015). Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, (pp. 11–19).

  • Madani, F., & Weber, C. (2016). The evolution of patent mining: Applying bibliometrics analysis and keyword network analysis. World Patent Information, 46, 32–48.

    Article  Google Scholar 

  • Mille, S., Wanner, L. (2008). Multilingual summarization in practice: The case of patent claims. In Proceedings of the 12th European association of machine translation conference, (pp. 120–129).

  • Olah, C. (2015). Understanding lstm networks.

  • Ouellette, L. L. (2017). Who reads patents? Nature Biotechnology, 35(5), 421.

    Article  Google Scholar 

  • Parmar, C., Chaubey, R., Bhatt, K. (2019). Abstractive text summarization using artificial intelligence. Available at SSRN 3370795.

  • Paul, I. J. L., Sasirekha, S., Vishnu, D. R., Surya, K. (2019). Recognition of handwritten text using long short term memory (lstm) recurrent neural network (rnn). In AIP Conference Proceedings, AIP Publishing, 2095, (pp. 030011).

  • Pennington, J., Socher, R., Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (pp. 1532–1543).

  • Sanchez-Gomez, J. M., Vega-Rodríguez, M. A., & Pérez, C. J. (2018). Extractive multi-document text summarization using a multi-objective artificial bee colony optimization approach. Knowledge-Based Systems, 159, 1–8.

    Article  Google Scholar 

  • Sharma, E., Li, C., Wang, L. (2019). Bigpatent: A large-scale dataset for abstractive and coherent summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, (pp. 2204–213).

  • Sjögren, R., Stridh, K., Skotare, T., Trygg, J. (2018). Multivariate patent analysis-using chemometrics to analyze collections of chemical and pharmaceutical patents. Journal of Chemometrics, (pp. e3041).

  • Song, S., Huang, H., & Ruan, T. (2019). Abstractive text summarization using lstm-cnn based deep learning. Multimedia Tools and Applications, 78(1), 857–875.

    Article  Google Scholar 

  • Souza, C. M., Santos, M. E., Meireles, M. R., Almeida, P. E. (2019). Using summarization techniques on patent database through computational intelligence. In EPIA Conference on Artificial Intelligence, Springer, (pp. 508–519)

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.

    MathSciNet  MATH  Google Scholar 

  • Sutskever, I., Vinyals, O., Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, (pp. 3104–3112).

  • Trappey, A. J., Trappey, C. V., & Wu, C. Y. (2009). Automatic patent document summarization for collaborative knowledge systems and services. Journal of Systems Science and Systems Engineering, 18(1), 71–94.

    Article  Google Scholar 

  • Wang, D., Zhu, S., Li, T., Chi, Y., & Gong, Y. (2011). Integrating document clustering and multidocument summarization. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(3), 14.

    Article  Google Scholar 

  • Wang, X., Ren, H., Chen, Y., Liu, Y., Qiao, Y., & Huang, Y. (2019). Measuring patent similarity with sao semantic analysis. Scientometrics, 121(1), 1–23.

    Article  Google Scholar 

  • Yao, K., Zhang, L., Du, D., Luo, T., Tao, L., Wu, Y. (2018). Dual encoding for abstractive text summarization. IEEE transactions on cybernetics.

  • Zhang, Y., Li, D., Wang, Y., Fang, Y., & Xiao, W. (2019). Abstract text summarization with a convolutional seq2seq model. Applied Sciences, 9(8), 1665.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the financial support of the Pontifical Catholic University of Minas Gerais (PUC Minas), the Federal Center for Technological Education of Minas Gerais (CEFET-MG), the National Council for Scientific and Technological Development (CNPq, Grant 429144/2016-4) and the Foundation for Research Support of the State of Minas Gerais (FAPEMIG, Grant APQ 01454-17).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Magali R. G. Meireles.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Souza, C.M., Meireles, M.R.G. & Almeida, P.E.M. A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset. Scientometrics 126, 135–156 (2021). https://doi.org/10.1007/s11192-020-03732-x

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03732-x

Keywords

Navigation