A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset

Souza, Cinthia M.; Meireles, Magali R. G.; Almeida, Paulo E. M.

doi:10.1007/s11192-020-03732-x

A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset

Published: 17 October 2020

Volume 126, pages 135–156, (2021)
Cite this article

Scientometrics Aims and scope Submit manuscript

864 Accesses
12 Citations
Explore all metrics

Abstract

Patents are an important source of information for measuring the technological advancement of a specific knowledge domain. To facilitate the search for information in patent datasets, classification systems separate documents into groups according to the area of knowledge, and designate names to define their content. The increase in the number of patented inventions leads to the need to subdivide these groups. Since these groups belong to a restricted knowledge domain, naming the generated subcategories can be extremely laborious. This work aims to compare the performance of abstractive and extractive summarization techniques in the task of generating sentences directly associated with the content of patents. The abstractive summarization model was composed by a Seq2Seq architecture and a LSTM network. The training was conducted with a dataset of patent titles and abstracts. The validation process was performed using the ROUGE set of metrics. The results obtained by the generated model were compared with the sentence resulting from an extractive summarization algorithm applied to the task of naming patent groups. The main idea was to help the specialist to name new patent groups created by the clustering systems. The naming experiments were performed on the dataset of abstracts of patent documents. Comparative experiments were conducted using four subgroups of the United States Patent and Trademark Office, which uses the Cooperative Patent Classification system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Using Summarization Techniques on Patent Database Through Computational Intelligence

Summarization as a Denoising Extraction Tool

A deep learning based method for extracting semantic information from patent documents

Article 24 July 2020

Notes

Available at: https://github.com/chakki-works/sumeval.
Available at: https://github.com/natsheh/sensim.
Available at: https://nlp.stanford.edu/projects/glove/.

References

Abtahi, F., Ro, T., Li, W., Zhu, Z. (2018). Emotion analysis using audio/video, emg and eeg: A dataset and comparison study. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, (pp. 10–19).
Al-Natsheh, H. T., Martinet, L., Muhlenbach, F., Zighed, D. A. (2017). UdL at SemEval-2017 task 1: Semantic textual similarity estimation of English sentence pairs using regression model over pairwise features. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, Vancouver, Canada, (pp 115–119).
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:170702919.
Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. CoRRarXiv:1409.0473.
Bin, Y., Yang, Y., Shen, F., Xie, N., Shen, H. T., & Li, X. (2018). Describing video with attention-based bidirectional lstm. IEEE Transactions on Cybernetics, 49(7), 2631–2641.
Article Google Scholar
Camus, C., & Brancaleon, R. (2003). Intellectual assets management: From patents to knowledge. World Patent Information, 25(2), 155–159.
Article Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), (pp. 1724–1734).
Codina-Filbà, J., Bouayad-Agha, N., Burga, A., Casamayor, G., Mille, S., Müller, A., et al. (2017). Using genre-specific features for patent summaries. Information Processing & Management, 53(1), 151–174.
Article Google Scholar
Cohen, E., Beck, C. (2019). Empirical analysis of beam search performance degradation in neural sequence models. In International Conference on Machine Learning, (pp. 1290–1299).
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Article Google Scholar
Dokun, O., & Celebi, E. (2015). Single-document summarization using latent semantic analysis. International Journal of Scientific Research in Information Systems and Engineering (IJSRISE), 1(2), 57–64.
Google Scholar
Froud, H., Lachkar, A., & Ouatik, S. A. (2013). Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering. International Journal of Data Mining & Knowledge Management Process (IJDKP), 3(1), 79–95.
Article Google Scholar
Gambhir, M., & Gupta, V. (2017). Recent automatic text summarization techniques: A survey. Artificial Intelligence Review, 47(1), 1–66.
Article Google Scholar
Gomez, J. C. (2019). Analysis of the effect of data properties in automated patent classification. Scientometrics.
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5–6), 602–610.
Article Google Scholar
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232.
Article MathSciNet Google Scholar
Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2015). Semantic similarity from natural language and ontology analysis. Synthesis Lectures on Human Language Technologies, 8(1), 1–254.
Article Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Khan, A., Salim, N., & Kumar, Y. J. (2015). A framework for multi-document abstractive summarization based on semantic role labelling. Applied Soft Computing, 30, 737–747.
Article Google Scholar
Kim, J., & Lee, S. (2015). Patent databases for innovation studies: A comparative analysis of uspto, epo, jpo and kipo. Technological Forecasting and Social Change, 92, 332–345.
Article Google Scholar
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.
Article Google Scholar
Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, (pp. 74–81).
Luong, M. T., Sutskever, I., Le, Q. V., Vinyals, O., Zaremba, W. (2015). Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, (pp. 11–19).
Madani, F., & Weber, C. (2016). The evolution of patent mining: Applying bibliometrics analysis and keyword network analysis. World Patent Information, 46, 32–48.
Article Google Scholar
Mille, S., Wanner, L. (2008). Multilingual summarization in practice: The case of patent claims. In Proceedings of the 12th European association of machine translation conference, (pp. 120–129).
Olah, C. (2015). Understanding lstm networks.
Ouellette, L. L. (2017). Who reads patents? Nature Biotechnology, 35(5), 421.
Article Google Scholar
Parmar, C., Chaubey, R., Bhatt, K. (2019). Abstractive text summarization using artificial intelligence. Available at SSRN 3370795.
Paul, I. J. L., Sasirekha, S., Vishnu, D. R., Surya, K. (2019). Recognition of handwritten text using long short term memory (lstm) recurrent neural network (rnn). In AIP Conference Proceedings, AIP Publishing, 2095, (pp. 030011).
Pennington, J., Socher, R., Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (pp. 1532–1543).
Sanchez-Gomez, J. M., Vega-Rodríguez, M. A., & Pérez, C. J. (2018). Extractive multi-document text summarization using a multi-objective artificial bee colony optimization approach. Knowledge-Based Systems, 159, 1–8.
Article Google Scholar
Sharma, E., Li, C., Wang, L. (2019). Bigpatent: A large-scale dataset for abstractive and coherent summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, (pp. 2204–213).
Sjögren, R., Stridh, K., Skotare, T., Trygg, J. (2018). Multivariate patent analysis-using chemometrics to analyze collections of chemical and pharmaceutical patents. Journal of Chemometrics, (pp. e3041).
Song, S., Huang, H., & Ruan, T. (2019). Abstractive text summarization using lstm-cnn based deep learning. Multimedia Tools and Applications, 78(1), 857–875.
Article Google Scholar
Souza, C. M., Santos, M. E., Meireles, M. R., Almeida, P. E. (2019). Using summarization techniques on patent database through computational intelligence. In EPIA Conference on Artificial Intelligence, Springer, (pp. 508–519)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
MathSciNet MATH Google Scholar
Sutskever, I., Vinyals, O., Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, (pp. 3104–3112).
Trappey, A. J., Trappey, C. V., & Wu, C. Y. (2009). Automatic patent document summarization for collaborative knowledge systems and services. Journal of Systems Science and Systems Engineering, 18(1), 71–94.
Article Google Scholar
Wang, D., Zhu, S., Li, T., Chi, Y., & Gong, Y. (2011). Integrating document clustering and multidocument summarization. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(3), 14.
Article Google Scholar
Wang, X., Ren, H., Chen, Y., Liu, Y., Qiao, Y., & Huang, Y. (2019). Measuring patent similarity with sao semantic analysis. Scientometrics, 121(1), 1–23.
Article Google Scholar
Yao, K., Zhang, L., Du, D., Luo, T., Tao, L., Wu, Y. (2018). Dual encoding for abstractive text summarization. IEEE transactions on cybernetics.
Zhang, Y., Li, D., Wang, Y., Fang, Y., & Xiao, W. (2019). Abstract text summarization with a convolutional seq2seq model. Applied Sciences, 9(8), 1665.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the financial support of the Pontifical Catholic University of Minas Gerais (PUC Minas), the Federal Center for Technological Education of Minas Gerais (CEFET-MG), the National Council for Scientific and Technological Development (CNPq, Grant 429144/2016-4) and the Foundation for Research Support of the State of Minas Gerais (FAPEMIG, Grant APQ 01454-17).

Author information

Authors and Affiliations

Pontifical Catholic University of Minas Gerais, Belo Horizonte, MG, Brazil
Cinthia M. Souza & Magali R. G. Meireles
Federal Center for Technological Education of Minas Gerais, Belo Horizonte, MG, Brazil
Paulo E. M. Almeida

Authors

Cinthia M. Souza
View author publications
You can also search for this author in PubMed Google Scholar
Magali R. G. Meireles
View author publications
You can also search for this author in PubMed Google Scholar
Paulo E. M. Almeida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Magali R. G. Meireles.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Souza, C.M., Meireles, M.R.G. & Almeida, P.E.M. A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset. Scientometrics 126, 135–156 (2021). https://doi.org/10.1007/s11192-020-03732-x

Download citation

Received: 23 December 2019
Published: 17 October 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11192-020-03732-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset

Abstract

Access this article

Similar content being viewed by others

Using Summarization Techniques on Patent Database Through Computational Intelligence

Summarization as a Denoising Extraction Tool

A deep learning based method for extracting semantic information from patent documents

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset

Abstract

Access this article

Similar content being viewed by others

Using Summarization Techniques on Patent Database Through Computational Intelligence

Summarization as a Denoising Extraction Tool

A deep learning based method for extracting semantic information from patent documents

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation