TripleRank: An unsupervised keyphrase extraction algorithm

https://doi.org/10.1016/j.knosys.2021.106846Get rights and content

Abstract

Automatic keyphrase extraction algorithms aim to identify words and phrases that contain the core information in documents. As online scholarly resources have become widespread in recent years, better keyphrase extraction techniques are required to improve search efficiency. We present two features, keyphrase semantic diversity and keyphrase coverage, to overcome limitations of existing methods for unsupervised keyphrase extraction. Keyphrase semantic diversity is the degree of semantic variety in the extraction result, which is introduced to avoid extracting synonym phrases that contain the same high-score candidate. Keyphrase coverage refers to candidates’ representativeness of other words in documents. We propose an unsupervised keyphrase extraction method called TripleRank, which evaluates three features: word position (a sensitive feature for academic documents) and two innovative features mentioned above. The architecture of TripleRank includes three sub-models that score the three features and a summing model. Though involving multiple models, there is no typical iteration process in TripleRank; hence, the computational cost is relatively low. TripleRank has led the experiment results on four academic datasets compared to four state-of-the-art baseline models, which confirmed the influence of keyphrase semantic diversity and keyphrase coverage and proved the efficiency of our method.

Introduction

Cloud applications have facilitated access to academic materials for individuals in the recent decade. The top data provider, Google Scholar, listed tens of millions of papers in 2019. The surge of online scholarly information makes it challenging for academic users to find relevant academic documents. Hence, data providers investigate intelligent processing methods for academic papers. Keyphrases and streamline recapitulations of documents are considered efficient search tags because they provide a summary of documents or texts. Precise search tags can improve the efficiency of document retrieval tasks, especially for academic users. In addition, automatic keyphrase extraction saves time for users: they can search for the required documents without reading the content thoroughly. However, not all documents are annotated with proper keyphrases; thus, skimming and scanning is inefficient for documents that have no annotations. Therefore, information extraction, including keyphrase extraction, has become a topic of considerable research interest. Many outstanding keyphrase extraction algorithms have been developed in recent years. These algorithms are generally classified into supervised and unsupervised approaches.

For supervised approaches, the problem is regarded as a classification task [1]. An example of an early algorithm is the keyphrase extraction algorithm (KEA). In such approaches, the models are trained to distinguish whether a word is a keyphrase. A word is more likely to be classified as a keyphrase if it appears more frequently in the target document and relatively less frequently in general [2]. To train this classifier, different algorithms have been proposed [3], such as boosting [4], maximum entropy [5], and support vector machines (SVMs) [6]. Supervised keyphrase extraction methods are more accurate than unsupervised ones; nevertheless, they require enormous human labelling work and time to prepare large training datasets.

For unsupervised approaches, the tasks are regarded as ranking problems. Unsupervised approaches are commonly researched because of their low computational cost, although they are less precise than supervised approaches. The unsupervised approaches include graph-based ranking, topic-based, and simultaneous learning [3]; graph-based ranking is researched more than other methods. Inspired by PageRank [7], a web page ranking algorithm designed based on human interest using link structures of web pages, Mihalcea and Tarau [8] proposed an extraction algorithm called TextRank. The core idea of TextRank is to use words in documents (instead of web pages), build a network graph based on the adjacent relationship between words, and apply the PageRank algorithm to the network graph. However, TextRank neglected the influence of document topics and semantics, which reduced the overall precision. To solve this problem, Liu et al. built the topical PageRank (TPR) [9] using the latent Dirichlet allocation (LDA) model [10] that generates topic distributions to extract keyphrases more accurately. Liu et al. improved precision to a large extent. However, TPR is a complex model with numerous parameters, which makes it difficult to apply. In addition, Boudin et al. [11] proposed a model that encoded topical information with a multipartite graph: it represented documents as tightly connected sets of topic-related candidates, which further improved precision. Taking advantage of the specifics of scholarly documents, Florescu and Caragea [12] proposed PositionRank, an efficient keyphrase extraction algorithm for scholarly documents. PositionRank used the position information of words and achieved improvements of up to 29.09% compared with PageRank models. Hence, PositionRank provides satisfactory results and is computationally inexpensive because of the accessibility of position order. Moreover, simultaneous learning methods performed keyphrase extraction and text summarisation simultaneously, which is related to graph-based methods. Zha et al. [13] proposed a graph-based simultaneous learning method, and Wan et al. [14] extended this method by simultaneously using TextRank. Simultaneous learning methods and TextRank have a similar drawback: they do not consider topical information.

Numerous keyphrase extraction algorithms have obtained excellent results using different tools, including position information, graphs, and topic models. However, we identified the following problem, which frequently appeared in most models. Many models evaluate terms and phrases in scholarly documents based on a score determined by probability and weight, which is accurate in most cases. The extraction of phrases that contain more than one word is usually dominated by the weight and probability score of the words they contain, which is where potential inaccuracy lies. A single word with a high score or weight is more likely to be a part of several phrases. Thus, the model can predict multiple phrases involving the same high-score word as keyphrase results, ignoring their similar semantics.

For example, in one of our experiments on DUC datasets, as Fig. 1 shows, the red words in the context (document) are the author-input keyphrases, the red and bold words in the extraction results are correct keyphrases, and the extraction result by PositionRank is given. PositionRank [12] is an excellent method that substantially improved the extraction precision and extracted four out of six keyphrases correctly. However, we use this imperfection example to explain our idea. We found that the “PLO” appeared in four keyphrases out of six in the results of PositionRank, which is due to its high weight. Such phenomena could not only generate incorrect keyphrases but may also lead to the inaccuracy of output results (when many latent proper keyphrases that can highly represent the documents are left unpicked).

To address this problem, we developed two innovative features – keyphrase coverage and keyphrase semantic diversity – and proposed a model named TripleRank. The results of our method for the document from Fig. 1 is given in the “Supporting Evidence” section.

Keyphrase coverage refers to the range of coverage or representativeness of a single word over other words. Keyphrases are the terms and phrases that should represent entire documents and summarise their contents. Therefore, it can also be considered that keyphrases summarise or cover more semantic content than other words in these documents. Keyphrase semantic diversity is the span of the semantic distribution of extracted keyphrases. The extraction results that were centralised to fewer domains had lower keyphrase semantic diversity. We maximised keyphrase semantic diversity and keyphrase coverage to avoid the problem mentioned above.

In this paper, we proposed TripleRank, which uses three features: keyphrase coverage, keyphrase semantic diversity, and position information. We assessed three features by scores and unified them with the output of the final score. To calculate the scores of three features, consulting the merits and demerits of existing works, three sub-models were constructed to evaluate features. We deem that the keyphrases covering more details in documents are more likely to represent other words, which is similar to the concept of similarity between word vectors. Thus, the sub-model for keyphrase coverage is based on similarity calculation and processing. In addition, for the sub-model evaluating keyphrase semantic diversity, we predict the potential topic distributions of every candidate. Position information is regarded as another feature, mainly because of specific properties of scholarly documents. There is no typical iterating process in our method. However, it unites the scores of three features evaluated by the corresponding algorithms to obtain the final score of our method.

Our major contributions can be summarised as follows: 1. We addressed a common problem in which several extracted phrases contained the same high-score word. 2. We proposed and analysed the reason and significance of two concepts to solve the problem: keyphrase coverage and keyphrase semantic diversity. 3. We proposed TripleRank, a method based on keyphrase coverage, keyphrase semantic diversity, and position information that has preferable precision compared with state-of-the-art models. 4. We proposed a computationally efficient model that is partly pre-trained, with no typical iteration process in the main body.

Our major contributions can be summarised as follows:

  • 1.

    We addressed a common problem in which several extracted phrases contained the same high-score word.

  • 2.

    We proposed and analysed the reason and significance of two concepts to solve the problem: keyphrase coverage and keyphrase semantic diversity.

  • 3.

    We proposed TripleRank, a method based on keyphrase coverage, keyphrase semantic diversity, and position information that has preferable precision compared with state-of-the-art models.

  • 4.

    We proposed a computationally efficient model that is partly pre-trained, with no typical iteration process in the main body.

Section snippets

Related work

In 1998, Brin and Page [3] proposed PageRank, a webpage ranking algorithm that scored webpages based on the importance of pages that link to them and ranked them by correlated assessment. Inspired by PageRank, graph-based ranking algorithms have been developed and improved recently. Such algorithms evaluate the importance of candidate keyphrases by the number of other words in the documents that they are related to. TextRank, SingleRank, and PositionRank are state-of-the-art approaches for

Proposed model

The architecture of TripleRank consists of three sub-models, with scoring features including keyphrase coverage, keyphrase semantic diversity, and position information.

As Fig. 2 shows, three features are scored separately and subsequently combined by mathematical methods (including the weighted sum and the weighted harmonic sum) to output a comprehensive ranking score.

The keyphrase coverage score evaluates the breadth of content in documents that the extracted keyphrases can express. We found

Experiment datasets

Our experiments used four datasets: Knowledge Discovery and Data Mining (KDD), World Wide Web Conference (WWW), Inspec, and Document Understanding Conference (DUC) [15]. KDD and WWW [29] are compiled from the CiteSeerX digital library and include research papers from the two ACM conferences by Gollapalli and Caragea. Inspec [30] has been a dataset used in science and technology literature since 1898. The statistics of the datasets are shown in Table 2, together with an experimental discovery.

Evaluation metrics

Conclusion

In this paper, we developed two innovative features – keyphrase semantic diversity and keyphrase coverage – to solve the problem of synonym phrase occurrence, which exists in state-of-the-art approaches. Using these two features together with the position information, we proposed TripleRank, a method of keyphrase extraction. The experimental results proved that TripleRank outperforms the baseline methods in precision, and the results are mainly improved by keyphrase semantic diversity and

CRediT authorship contribution statement

Tuohang Li: Conceptualization, Methodology, Formal analysis, Investigation, Writing - original draft, Writing - reviewing & editing, Validation. Liang Hu: Resources, Project administrating. Hongtu Li: Methodology, Validation. Chengyu Sun: Software, Investigation. Shuai Li: Data curation. Ling Chi: Conceptualization, Methodology, Software, Supervision, Resources, Writing - reviewing & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is funded by: National Key R&D Plan of China under Grant No. 2017YFA0604500, and by National Sci-Tech Support Plan of China under Grant No. 2014BAH02F00, and by National Natural Science Foundation of China under Grant No. 61701190, and by Youth Science Foundation of Jilin Province of China under Grant No. 20160520011JH & 20180520021JH, and by Youth Sci-Tech Innovation Leader and Team Project of Jilin Province of China under Grant No. 20170519017JH, and by Key Technology Innovation

References (32)

  • PapagiannopoulouE. et al.

    Local word vectors guiding keyphrase extraction

    Inf. Process. Manage.

    (2018)
  • TurneyP.D.

    Learning To Extract Keyphrases from Text

    (2002)
  • E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, C.G. Nevill-Manning, Domain-specific keyphrase extraction, in:...
  • K.S. Hasan, V. Ng, Automatic keyphrase extraction: A survey of the state of the art, in: Proceedings of the 52nd Annual...
  • A. Hulth, J. Karlgren, A. Jonsson, H. Bostrom, L. Asker, Automatic keyword extraction using domain knowledge, in:...
  • W.-T. Yih, J. Goodman, V.R. Carvalho, Finding advertising keywords on web pages, in: Proceedings of the 15th...
  • X. Jiang, Y. Hu, H. Li, A ranking approach to keyphrase extraction, in: Proceedings of the 32nd International ACM SIGIR...
  • PageL. et al.

    The PageRank Citation Ranking: Bringing Order To the WebTechnical Report

    (1998)
  • R. Mihalcea, P. Tarau, TextRank: bringing order into texts, in: Proceedings of the Conference on Empirical Methods in...
  • Z. Liu, W. Huang, Y. Zheng, M. Sun, Automatic keyphrase extraction via topic decomposition, in: Proceedings of the...
  • BleiD. et al.

    Latent Dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • F. Boudin, Unsupervised keyphrase extraction with multipartite graphs, in: Proceedings of the 55th Annual Meeting of...
  • C. Florescu, C. Caragea, PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents, in:...
  • H. Zha, Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering,...
  • X. Wan, J. Yang, J. Xia, Towards an iterative reinforcement approach for simultaneous document summarization and...
  • X. Wan, J. Xiao, Single document keyphrase extraction using neighborhood knowledge, in: Proceedings of the 23rd AAAI...
  • Cited by (16)

    • Gain more with less: Extracting information from business documents with small data

      2023, Expert Systems with Applications
      Citation Excerpt :

      The main contributions of this paper are: Information extraction (IE) is an NLP task that makes a great impact in practice (Angeli et al., 2015; Fu et al., 2021; Lample et al., 2016; Li et al., 2020, 2021; Nguyen, Le, Linh, et al., 2020; Watanabe et al., 2007). The task has been addressed by using named entity recognition (NER) and span extraction.

    • MGRank: A keyword extraction system based on multigraph GoW model and novel edge weighting procedure

      2022, Knowledge-Based Systems
      Citation Excerpt :

      More specifically, it defines statistics regarding a word’s position, normalized frequency, casing, relatedness to context, and different word sentences. TripleRank [47] is a recent statistical method that introduces two new features and aggregates them with positional information to improve the semantic diversity and coverage of the keywords. The authors proposed a feature that assesses the similarity of a candidate keyword to all other words in a document.

    View all citing articles on Scopus
    View full text