TripleRank: An unsupervised keyphrase extraction algorithm

doi:10.1016/j.knosys.2021.106846

Knowledge-Based Systems

Volume 219, 11 May 2021, 106846

https://doi.org/10.1016/j.knosys.2021.106846 Get rights and content

Abstract

Automatic keyphrase extraction algorithms aim to identify words and phrases that contain the core information in documents. As online scholarly resources have become widespread in recent years, better keyphrase extraction techniques are required to improve search efficiency. We present two features, keyphrase semantic diversity and keyphrase coverage, to overcome limitations of existing methods for unsupervised keyphrase extraction. Keyphrase semantic diversity is the degree of semantic variety in the extraction result, which is introduced to avoid extracting synonym phrases that contain the same high-score candidate. Keyphrase coverage refers to candidates’ representativeness of other words in documents. We propose an unsupervised keyphrase extraction method called TripleRank, which evaluates three features: word position (a sensitive feature for academic documents) and two innovative features mentioned above. The architecture of TripleRank includes three sub-models that score the three features and a summing model. Though involving multiple models, there is no typical iteration process in TripleRank; hence, the computational cost is relatively low. TripleRank has led the experiment results on four academic datasets compared to four state-of-the-art baseline models, which confirmed the influence of keyphrase semantic diversity and keyphrase coverage and proved the efficiency of our method.

Introduction

Cloud applications have facilitated access to academic materials for individuals in the recent decade. The top data provider, Google Scholar, listed tens of millions of papers in 2019. The surge of online scholarly information makes it challenging for academic users to find relevant academic documents. Hence, data providers investigate intelligent processing methods for academic papers. Keyphrases and streamline recapitulations of documents are considered efficient search tags because they provide a summary of documents or texts. Precise search tags can improve the efficiency of document retrieval tasks, especially for academic users. In addition, automatic keyphrase extraction saves time for users: they can search for the required documents without reading the content thoroughly. However, not all documents are annotated with proper keyphrases; thus, skimming and scanning is inefficient for documents that have no annotations. Therefore, information extraction, including keyphrase extraction, has become a topic of considerable research interest. Many outstanding keyphrase extraction algorithms have been developed in recent years. These algorithms are generally classified into supervised and unsupervised approaches.

For supervised approaches, the problem is regarded as a classification task [1]. An example of an early algorithm is the keyphrase extraction algorithm (KEA). In such approaches, the models are trained to distinguish whether a word is a keyphrase. A word is more likely to be classified as a keyphrase if it appears more frequently in the target document and relatively less frequently in general [2]. To train this classifier, different algorithms have been proposed [3], such as boosting [4], maximum entropy [5], and support vector machines (SVMs) [6]. Supervised keyphrase extraction methods are more accurate than unsupervised ones; nevertheless, they require enormous human labelling work and time to prepare large training datasets.

For unsupervised approaches, the tasks are regarded as ranking problems. Unsupervised approaches are commonly researched because of their low computational cost, although they are less precise than supervised approaches. The unsupervised approaches include graph-based ranking, topic-based, and simultaneous learning [3]; graph-based ranking is researched more than other methods. Inspired by PageRank [7], a web page ranking algorithm designed based on human interest using link structures of web pages, Mihalcea and Tarau [8] proposed an extraction algorithm called TextRank. The core idea of TextRank is to use words in documents (instead of web pages), build a network graph based on the adjacent relationship between words, and apply the PageRank algorithm to the network graph. However, TextRank neglected the influence of document topics and semantics, which reduced the overall precision. To solve this problem, Liu et al. built the topical PageRank (TPR) [9] using the latent Dirichlet allocation (LDA) model [10] that generates topic distributions to extract keyphrases more accurately. Liu et al. improved precision to a large extent. However, TPR is a complex model with numerous parameters, which makes it difficult to apply. In addition, Boudin et al. [11] proposed a model that encoded topical information with a multipartite graph: it represented documents as tightly connected sets of topic-related candidates, which further improved precision. Taking advantage of the specifics of scholarly documents, Florescu and Caragea [12] proposed PositionRank, an efficient keyphrase extraction algorithm for scholarly documents. PositionRank used the position information of words and achieved improvements of up to 29.09% compared with PageRank models. Hence, PositionRank provides satisfactory results and is computationally inexpensive because of the accessibility of position order. Moreover, simultaneous learning methods performed keyphrase extraction and text summarisation simultaneously, which is related to graph-based methods. Zha et al. [13] proposed a graph-based simultaneous learning method, and Wan et al. [14] extended this method by simultaneously using TextRank. Simultaneous learning methods and TextRank have a similar drawback: they do not consider topical information.

Numerous keyphrase extraction algorithms have obtained excellent results using different tools, including position information, graphs, and topic models. However, we identified the following problem, which frequently appeared in most models. Many models evaluate terms and phrases in scholarly documents based on a score determined by probability and weight, which is accurate in most cases. The extraction of phrases that contain more than one word is usually dominated by the weight and probability score of the words they contain, which is where potential inaccuracy lies. A single word with a high score or weight is more likely to be a part of several phrases. Thus, the model can predict multiple phrases involving the same high-score word as keyphrase results, ignoring their similar semantics.

For example, in one of our experiments on DUC datasets, as Fig. 1 shows, the red words in the context (document) are the author-input keyphrases, the red and bold words in the extraction results are correct keyphrases, and the extraction result by PositionRank is given. PositionRank [12] is an excellent method that substantially improved the extraction precision and extracted four out of six keyphrases correctly. However, we use this imperfection example to explain our idea. We found that the “PLO” appeared in four keyphrases out of six in the results of PositionRank, which is due to its high weight. Such phenomena could not only generate incorrect keyphrases but may also lead to the inaccuracy of output results (when many latent proper keyphrases that can highly represent the documents are left unpicked).

To address this problem, we developed two innovative features – keyphrase coverage and keyphrase semantic diversity – and proposed a model named TripleRank. The results of our method for the document from Fig. 1 is given in the “Supporting Evidence” section.

Keyphrase coverage refers to the range of coverage or representativeness of a single word over other words. Keyphrases are the terms and phrases that should represent entire documents and summarise their contents. Therefore, it can also be considered that keyphrases summarise or cover more semantic content than other words in these documents. Keyphrase semantic diversity is the span of the semantic distribution of extracted keyphrases. The extraction results that were centralised to fewer domains had lower keyphrase semantic diversity. We maximised keyphrase semantic diversity and keyphrase coverage to avoid the problem mentioned above.

In this paper, we proposed TripleRank, which uses three features: keyphrase coverage, keyphrase semantic diversity, and position information. We assessed three features by scores and unified them with the output of the final score. To calculate the scores of three features, consulting the merits and demerits of existing works, three sub-models were constructed to evaluate features. We deem that the keyphrases covering more details in documents are more likely to represent other words, which is similar to the concept of similarity between word vectors. Thus, the sub-model for keyphrase coverage is based on similarity calculation and processing. In addition, for the sub-model evaluating keyphrase semantic diversity, we predict the potential topic distributions of every candidate. Position information is regarded as another feature, mainly because of specific properties of scholarly documents. There is no typical iterating process in our method. However, it unites the scores of three features evaluated by the corresponding algorithms to obtain the final score of our method.

Our major contributions can be summarised as follows: 1. We addressed a common problem in which several extracted phrases contained the same high-score word. 2. We proposed and analysed the reason and significance of two concepts to solve the problem: keyphrase coverage and keyphrase semantic diversity. 3. We proposed TripleRank, a method based on keyphrase coverage, keyphrase semantic diversity, and position information that has preferable precision compared with state-of-the-art models. 4. We proposed a computationally efficient model that is partly pre-trained, with no typical iteration process in the main body.

Our major contributions can be summarised as follows:

1.
We addressed a common problem in which several extracted phrases contained the same high-score word.
2.
We proposed and analysed the reason and significance of two concepts to solve the problem: keyphrase coverage and keyphrase semantic diversity.
3.
We proposed TripleRank, a method based on keyphrase coverage, keyphrase semantic diversity, and position information that has preferable precision compared with state-of-the-art models.
4.
We proposed a computationally efficient model that is partly pre-trained, with no typical iteration process in the main body.

Section snippets

Related work

In 1998, Brin and Page [3] proposed PageRank, a webpage ranking algorithm that scored webpages based on the importance of pages that link to them and ranked them by correlated assessment. Inspired by PageRank, graph-based ranking algorithms have been developed and improved recently. Such algorithms evaluate the importance of candidate keyphrases by the number of other words in the documents that they are related to. TextRank, SingleRank, and PositionRank are state-of-the-art approaches for

Proposed model

The architecture of TripleRank consists of three sub-models, with scoring features including keyphrase coverage, keyphrase semantic diversity, and position information.

As Fig. 2 shows, three features are scored separately and subsequently combined by mathematical methods (including the weighted sum and the weighted harmonic sum) to output a comprehensive ranking score.

The keyphrase coverage score evaluates the breadth of content in documents that the extracted keyphrases can express. We found

Experiment datasets

Our experiments used four datasets: Knowledge Discovery and Data Mining (KDD), World Wide Web Conference (WWW), Inspec, and Document Understanding Conference (DUC) [15]. KDD and WWW [29] are compiled from the CiteSeerX digital library and include research papers from the two ACM conferences by Gollapalli and Caragea. Inspec [30] has been a dataset used in science and technology literature since 1898. The statistics of the datasets are shown in Table 2, together with an experimental discovery.

Evaluation metrics

Conclusion

In this paper, we developed two innovative features – keyphrase semantic diversity and keyphrase coverage – to solve the problem of synonym phrase occurrence, which exists in state-of-the-art approaches. Using these two features together with the position information, we proposed TripleRank, a method of keyphrase extraction. The experimental results proved that TripleRank outperforms the baseline methods in precision, and the results are mainly improved by keyphrase semantic diversity and

CRediT authorship contribution statement

Tuohang Li: Conceptualization, Methodology, Formal analysis, Investigation, Writing - original draft, Writing - reviewing & editing, Validation. Liang Hu: Resources, Project administrating. Hongtu Li: Methodology, Validation. Chengyu Sun: Software, Investigation. Shuai Li: Data curation. Ling Chi: Conceptualization, Methodology, Software, Supervision, Resources, Writing - reviewing & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is funded by: National Key R&D Plan of China under Grant No. 2017YFA0604500, and by National Sci-Tech Support Plan of China under Grant No. 2014BAH02F00, and by National Natural Science Foundation of China under Grant No. 61701190, and by Youth Science Foundation of Jilin Province of China under Grant No. 20160520011JH & 20180520021JH, and by Youth Sci-Tech Innovation Leader and Team Project of Jilin Province of China under Grant No. 20170519017JH, and by Key Technology Innovation

References (32)

PapagiannopoulouE. et al.
Local word vectors guiding keyphrase extraction
Inf. Process. Manage.
(2018)
TurneyP.D.
Learning To Extract Keyphrases from Text
(2002)
E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, C.G. Nevill-Manning, Domain-specific keyphrase extraction, in:...
K.S. Hasan, V. Ng, Automatic keyphrase extraction: A survey of the state of the art, in: Proceedings of the 52nd Annual...
A. Hulth, J. Karlgren, A. Jonsson, H. Bostrom, L. Asker, Automatic keyword extraction using domain knowledge, in:...
W.-T. Yih, J. Goodman, V.R. Carvalho, Finding advertising keywords on web pages, in: Proceedings of the 15th...
X. Jiang, Y. Hu, H. Li, A ranking approach to keyphrase extraction, in: Proceedings of the 32nd International ACM SIGIR...
PageL. et al.
The PageRank Citation Ranking: Bringing Order To the WebTechnical Report
(1998)
R. Mihalcea, P. Tarau, TextRank: bringing order into texts, in: Proceedings of the Conference on Empirical Methods in...
Z. Liu, W. Huang, Y. Zheng, M. Sun, Automatic keyphrase extraction via topic decomposition, in: Proceedings of the...

BleiD. et al.

Latent Dirichlet allocation

J. Mach. Learn. Res.

(2003)

F. Boudin, Unsupervised keyphrase extraction with multipartite graphs, in: Proceedings of the 55th Annual Meeting of...

C. Florescu, C. Caragea, PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents, in:...

H. Zha, Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering,...

X. Wan, J. Yang, J. Xia, Towards an iterative reinforcement approach for simultaneous document summarization and...

X. Wan, J. Xiao, Single document keyphrase extraction using neighborhood knowledge, in: Proceedings of the 23rd AAAI...

Cited by (16)

MICRank: Multi-information interconstrained keyphrase extraction
2024, Expert Systems with Applications
Keyphrase Extraction is an automatic task that involves identifying the key words or phrases that capture the main content of an article. It is useful for various downstream tasks, including text search, text clustering, and text classification. Embedding-based methods for keyphrase extraction have shown promising results by utilizing pre-trained language models to represent candidate phrases and documents separately. These methods then rank the candidate phrases based on the cosine similarity between the document and the candidate phrases embeddings. However, there are mainly two shortcomings in such methods: I) Redundancy errors, when there are partial repetitions of candidate keyphrases, the methods tend to use redundant long phrases as keyphrases; II) Low keyphrase coverage, such as some keyphrases used to describe locally important information are ignored. In this paper, we propose an unsupervised keyphrase extraction method called “MICRank”, which evaluates the importance of candidate keyphrases from three perspectives: global information, local information, and attribute information, and solved the aforementioned issues. The experimental results on six benchmarks demonstrate that the proposed MICRank method outperforms the state-of-the-art unsupervised keyphrase extraction methods. In addition, this paper improves the judgment criterion of correct keyphrase extraction and introduces a new evaluation metric called S1@M (M ∈ {5,10,15}) to address the issue of synonyms being considered incorrect predictions.
Unsupervised technical phrase extraction by incorporating structure and position information
2024, Expert Systems with Applications
The vigorous development of patent applications in recent years provides an opportunity to unveil the inherent laws of innovation, but it also puts forward higher requirements for patent mining technology. An essential step for patent text mining is to establish a technology portrait for each patent, that is, identify the technical phrases involved, which can be summarized and represented by the patent from the technical point of view. Currently, there is a large body of work focusing on keyword extraction. However, technical phrase extraction differs from keyword extraction due to the unique properties of technical phrases. Specifically, technical phrases must contain rich technical information and are essential to the entire patent text from a technical perspective. Meanwhile, finding potential relationships between phrases with different technical meanings is challenging for technical phrase extraction. Based on the analysis of the characteristics of technical phrases, we found that the position of technical phrases in the patent text and the structural relationship between technical phrases are crucial, and how to make good use of these two pieces of information is a challenge. Motivated by this, we propose a new Unsupervised Technical phrase Extraction model from the Structure and Position information perspective, named UTESP. Specifically, UTESP includes four key steps: candidate generation, graph construction, candidate score, and candidate selection. The structure information refers to adjusting the incoming edge weight of candidate phrases through the distance relations between candidate phrases and applying the graph ranking algorithm to obtain the structure score of the candidate phrase. The position information simultaneously incorporates the position and frequency of candidate phrases in the patent text to calculate a position score for candidate technical phrases. The effectiveness of our framework has been demonstrated by comparing with seven competitive algorithms on the patent datasets in terms of three evaluation metrics: Precision, Recall, and F1 scores. Besides, our new framework indicated significant improvements in the representation ability of technical phrases by comparing Information Retrieval Efficiency (IRE) with competitive algorithms.
Towards unsupervised keyphrase extraction via an autoregressive approach
2023, Knowledge-Based Systems
Keyphrase extraction is a technique used to capture the core information of documents and is an upstream task for advanced information retrieval systems, particularly in the academic realm. Current unsupervised methods are primarily built on a score-and-rank framework with a consistent inability to acquire mutual information between extracted keyphrases, especially with graph-based models. Utilizing the autoregressive structure that is typically used in sequence-to-sequence text generation models, we propose a plug-and-play optimizer named C-Decay that can be integrated into any graph-based unsupervised keyphrase extraction model for a stable performance boost, and that mitigates the bias of certain semantically or lexically dominant tokens by optimizing the origin score distribution output by graph-based models directly. The architecture of C-Decay includes the keyphrase pool, the gain vector and the decay factor, where the keyphrase pool is designed to realize an autoregressive structure and the gain vector and the decay factor are the optimization operator. Herein, we examine three graph-based models integrated with C-Decay, and the experiment is conducted on four datasets KDD, Semeval, Nguyen, and Krapivin. Moreover, we prove that C-Decay can improve accuracy and F-Measure by an average of approximately 50% and 20%, respectively.
SkyWords: An automatic keyword extraction system based on the skyline operator and semantic similarity
2023, Engineering Applications of Artificial Intelligence
This study presents a hybrid keyword extraction method called SkyWords. It implements a novel supervised step based on the skyline operator and the majority voting principle for high-quality candidate keyword selection and an unsupervised step based on contextual semantics for the candidate keyword ranking. To achieve this, firstly, we build a feature vector database using the features of known keywords and then apply the skyline operator to retrieve the dominating feature vectors. To select the candidate keywords of a document, we compare each word of the document against the dominating feature vectors and assume the words that are as good as the majority of the dominating feature vectors as candidate keywords. To obtain the final set of keywords, we rank the candidate keywords based on their semantic similarity to the document based on their vector representation using the MPNet sentence transformer. We conducted experiments on six benchmark scholarly datasets to evaluate the performance of SkyWords and compared the results against eleven baseline keyword extraction systems. The experimental results show that the proposed novel keyword selection algorithm reduced the number of candidate keywords by several folds. Moreover, SkyWords achieved statistically significant improvements over the baseline methods in precision, recall, and F1 score. Compared to the baseline regarding ranking-based metrics, SkyWords achieved the highest mean average precision score for all datasets and the highest mean reciprocal rank score for all datasets but one. Furthermore, SkyWords extracted more relevant keywords than the baseline methods.
Gain more with less: Extracting information from business documents with small data
2023, Expert Systems with Applications
Citation Excerpt :
The main contributions of this paper are: Information extraction (IE) is an NLP task that makes a great impact in practice (Angeli et al., 2015; Fu et al., 2021; Lample et al., 2016; Li et al., 2020, 2021; Nguyen, Le, Linh, et al., 2020; Watanabe et al., 2007). The task has been addressed by using named entity recognition (NER) and span extraction.
Information extraction (IE) is a vital step of digitization that reduces paperwork in offices. However, the adaptation of common IE systems to actual business cases faces two issues. First, the number of training samples is small (i.e. 100–200 examples). Second, span extraction models based on question answering formulation require a long time for training and inference. To overcome these issues, we introduce a new query-based model for the extraction of information from business documents. For data limitation, the model employs transfer learning which adapts the knowledge of pre-trained language models (i.e. BERT) to specific domains. To do that, we design a new CNN layer for the adaptation of the model to specific domains. For the speed, different from the encoding of normal span extraction methods (BERT-QA), the proposed model encodes short tags and context documents in two channels in parallel, which speeds up training and inference time. Information from short tags is fused with context documents learned from CNN by using attention to predict start and end positions of extracted spans. Promising results on five domain-specific datasets in English and Japanese indicate that the proposed model produces high-quality outputs and can be applied for business scenarios.
MGRank: A keyword extraction system based on multigraph GoW model and novel edge weighting procedure
2022, Knowledge-Based Systems
Citation Excerpt :
More specifically, it defines statistics regarding a word’s position, normalized frequency, casing, relatedness to context, and different word sentences. TripleRank [47] is a recent statistical method that introduces two new features and aggregates them with positional information to improve the semantic diversity and coverage of the keywords. The authors proposed a feature that assesses the similarity of a candidate keyword to all other words in a document.
Keyword extraction is the process of extracting the most descriptive words from a textual document. State-of-the-art graph-based keyword extraction systems generally represent text documents using a simple graph called a graph-of-words (GoW), based on the sliding window concept. This representation of a text document requires determining a proper window size, models the document on a local scale, and allows the establishment of a single relation between two candidate keywords. In this study, we address these problems and propose a keyword extraction system called MGRank which uses a complete multigraph structure to build a GoW model to represent a text document. The completeness property of the proposed GoW model provides a means to represent a document globally and eliminates the need to determine the window-size parameter. Parallel edges allow the establishment of multiple relations between candidate keywords. In this study, we also propose a new edge-weighting procedure based on the positional distance of candidate keywords. To evaluate the performance of MGRank, we performed experiments on seven benchmark datasets and compared the results with those of six baseline methods. The experimental results show that MGRank outperforms the baseline methods statistically in precision, recall, and F1-score in almost all cases. In terms of mean average precision and mean reciprocal rank, MGRank performs statistically better than node ranking-based and statistical baseline methods and achieves on-par results with topic-based baseline methods. Furthermore, the experimental results showed that MGRank extracted the most relevant keywords.

View all citing articles on Scopus

View full text