1 Introduction

Term mining aims to mine terms from unstructured documents. Terminology is usually composed of multiple words and describes concepts in a particular domain that forms the basis of the domain. Unsupervised term mining aims to use automated algorithms to mine terms in the literature without relying on external resources and human intervention. Therefore, the method has a wide range of applications and can be used to process text data in any field. Existing research usually uses dictionary-based methods and supervised machine learning methods to mine terms in the literature. However, the dictionary-based method’s limitation is that the method can only match terms in a specific field, and terms in the vocabulary are often different from expressions in the literature. For example, there are more than 5 ways in the literature to mention the disease “ischemic stroke”. Terms in the literature contain many variants, so dictionary-based methods require constant vocabulary maintenance, which is costly and time-consuming.

The named entity recognition method has achieved good results in terms of recognition. However, the named entity recognition model requires a large number of labelled corpora for training and can only identify predefined types of terms, which is not suitable for this open term mining problem. Corpus annotation in the biomedical field is very challenging because of the high requirements for the knowledge background of the annotators, and the labelling process relies heavily on the knowledge of domain experts and annotation standards. Therefore, the method of manually labelling the term corpus for named entity recognition is very time-consuming and costly. In addition, using deep learning models to identify terms can also lead to increased computation costs. Based on these problems, this paper proposes an unsupervised term mining method. This method can automatically mine terms from the literature without the need to annotate the corpus manually, so it can work with word embedding training algorithms to produce term-level embeddings, not just word embeddings. We propose a multi-length term mining algorithm. Using this algorithm, we can fully mine terms from the text without requiring any external resources.

Existing term representation methods usually represent a term by average word embeddings [1]. However, this method cannot obtain the semantics of the term. This method only represents the term by each compound word. There are essential differences between terms and words, so this poses a limitation of term research. Faced with this problem, we apply the proposed method to train term embeddings. Based on the existing word embedding algorithm GloVe [2], we trained the mined terms and found that the term representation can better represent the semantic relationship between terms. To evaluate the performance of the method, we created two datasets of different sizes. We compared the term representation composed of the original word embeddings with the term representation obtained based on our method. We observe that our method improves the representation of terms, can better characterize the relationships between terms, and better calculate the similarity between terms. This method can potentially be applied to any biomedical text mining system. The proposed method can also be used to build a term correlation graph. Finally, we explore the factors and treatments for lung and breast cancer using the proposed method. The results show that our method can find some key information for these diseases from the literature.

1.1 Contribution

The main contributions of this paper can be summarized below:

  1. 1.

    We propose an unsupervised term mining algorithm. The proposed method can be applied to any biomedical corpus, and the process can mine terms without manual annotation and can be used with word embedding training algorithms, which may become a scheme for solving term representation problems.

  2. 2.

    The proposed method improves the existing term representation based on word embeddings, and the obtained term embeddings can better characterize the relationship between terms.

  3. 3.

    This paper creates two-term mining datasets to evaluate the performance of the method.

1.2 Organization

The rest of the paper is organized as follows. Related works are discussed in Sect. 2. Section 3 describes the method proposed in this paper, followed by the proposed algorithm. Experiment results are explained in Sect. 4 consisting of experimental results, including dataset description, evaluation metric, analysis of term similarity, analysis of term relationship. Finally, we discuss the conclusion and possible future work in Sect. 5.

2 Related work

Named entity recognition models are widely used for recognizing biomedical terms. Settles et al. [3] use conditional random fields (CRF) [4] to recognize the gene and protein mentions in biomedical abstracts. Leaman et al. [5] propose the BANNER framework to recognize biomedical entities, which aims to improve the generalization ability for this task. Habibi et al. [6] use the LSTM-CRF model to recognize genes, chemicals, and diseases mentions. Tang et al. [7] study three different types of word representation methods and analyze their performance for biomedical NER on JNLPBA and BioCreAtIvE II BNER tasks. Yao et al. [8] propose a deep learning model that consists of multiple CNN layers and achieves improvement on the GENIA dataset. Wang et al. [9, 10] propose a multitask learning approach to recognize biomedical entities by using training data collectively consisting of distinct types of entities. Yoon et al. [11] propose CollaboNet to resolve the issue due to lack of data and entity-type misclassification by using the integration of multiple NER frameworks. Cho et al. [12] use BiLSTM and CRF to propose contextual LSTM networks with CRF to capture all the contextual information.

Lafferty et al. [13] use conditional random fields (CRF) to build probabilistic models for sequence labelling problem. Nadeau et al. [14] investigate the feature engineering-based NER models and systems. Collobert et al. [15] use CNN to solve may NLP tasks. Lample et al. [16] adopt LSTM-CRF to solve the sequence labelling problem. Chiu et al. [17] propose to use bidirectional LSTM-CNNs to resolve the sequence labelling problem. Ma et al. [18] use lstm-cnns-crf model to recognize entities. Akbik et al. [19] propose to use character-level language modelling to improve performance.

Pre-trained language models, such as ELMo [20] and BERT [21], have also been applied in the clinical NLP field. Beltagy et al. [22] train the SciBERT to enhance downstream NLP tasks. Alsentzer et al. [23] train the BERT model based on both clinical notes and discharge summaries. Lee et al. [24] proposed BioBERT, a model that retrains BERT on PubMed and PMC corpora, which can improve the results of downstream BioBLP tasks. These studies focus on word-level representations without considering the term representation. Context-dependent representations make the same word have different representations in different sentences. However, this paper aims to obtain context-independent representations, so we do not adopt these methods.

3 Methods

3.1 Multi-length term mining algorithm

This section explains this algorithm. After statistical analysis, we found that terms composed of 2, 3, 4, and 5 words are the more common, so we mainly mine terms of the above length. Word vectors can directly represent terms with only one word. The input of the algorithm is the original text, and the output is the mined terms. We do not need to use any external resources to apply it to any corpus in this process. We first perform word segmentation and part-of-speech tagging on the original text and then mine the terms.

As shown in Algorithm 1, the first line of the algorithm is mainly to initialize 4 dictionaries for storing terms of different lengths and input the corpora. The method then starts processing each sample, that is, each article. The fourth line performs word segmentation for these articles. Word segmentation is to divide these articles into a sequence of words and identify the punctuations, which can prevent the words and punctuations from being connected, resulting in an irregular vocabulary and inaccurate words.

figure a

The fifth line is mainly used for part-of-speech (POS) tagging of words. We adopt the LSTM-CRF neural network for POS tagging. The detailed mechanism of this model will be introduced in Sect. 3.2. The POS features help in improving the term mining in the algorithm. Lines 6 to 9 are used to mine terms of different lengths. We mainly focus on terms composed of 2, 3, 4, and 5 words.

The term extraction algorithm is described in Algorithm 2. The second line splits the input word sequence into phrases of a certain length. The third line is to extract the POS tags of the phrase from the recognized POS tag sequence. Inline 4, this algorithm matches the POS of the target phrase with our predefined rule. This rule considers the extraction of medical terms where the first two words are adjective, and noun, respectively, and the last term is also a noun. If the phrase matches successfully, the phrase will be treated as a potential term and added to the dictionary. Then, this algorithm counts how often the term appears. If the term has appeared in the dictionary, increase the term frequency by 1. If the term does not appear in the dictionary, the term has a frequency of 1. The purpose of counting term frequency is to extract meaningful terms. By setting thresholds, we have the flexibility to mine a certain number of terms. Terms are often repeated in the literature, and if a phrase appears only once, we do not consider it as a term. If we set a higher threshold, it means that we can get more confident terms, which also demonstrates that these terms are more common.

figure b

In the following, we briefly discuss the asymptotic complexity of our approach. The algorithm describes the processing of a large number of samples, where the number of samples is related to the corpus’s size. Therefore, we analyze the term mining process for only one sample. The algorithm first performs word segmentation and then performs part-of-speech tagging and then calls Algorithm 2. Since the complexity of Algorithm 2 is \(\mathcal {O}(n)\), the complexity of Algorithm 1 mainly depends on the two steps of word segmentation and part-of-speech tagging. Therefore, our algorithm is approximately equal to the complexity of segmentation and part-of-speech tagging. The proposed method does not significantly increase computational costs.

3.2 LSTM-CRF

This subsection introduces the LSTM-CRF sequence labelling model for POS tagging, as shown in Fig. 1. This model does not depend on feature engineering; instead, it adopts the words and characters as input. This design can increase the generality for processing any dataset. Then, textual input is projected to the word embeddings, which is a way to encode the prior knowledge of semantics into a dense vector representation [25,26,27].

Fig. 1
figure 1

LSTM-CRF network overview

3.2.1 Long short-term memory

The long short-term memory (LSTM) network is a kind of recurrent neural network (RNN). It uses the LSTM unit [28] to solve the exploding and vanishing gradient problems encountered in the traditional RNNs. The formulations of LSTM unit are as follows.

$$\begin{aligned} i_t&=\sigma (W_i h_{t-1}+U_i x_t +b_i) \end{aligned}$$
(1)
$$\begin{aligned} f_t&=\sigma (W_f h_{t-1}+U_f x_t +b_f) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{c_t}&=\tanh (W_c h_{t-1}+U_c x_t +b_c) \end{aligned}$$
(3)
$$\begin{aligned} c_t&=f_t \odot c_{t-1}+i_t \odot \tilde{c_t} \end{aligned}$$
(4)
$$\begin{aligned} o_t&=\sigma (W_o h_{t-1}+U_o x_t +b_o) \end{aligned}$$
(5)
$$\begin{aligned} h_t&=o_t \odot tanh(c_t) \end{aligned}$$
(6)

where \(\sigma (\cdot )\) is a sigmoid activation function. \(x_t\) represents the input vector at time step t, and \(h_t\) denotes the hidden state containing the context information in the former time steps. W and b are weight and bias parameters. \(i_t\), \(f_t\), \(c_t\) and \(o_t\) are the input gate vector, forget gate vector, cell state, and output gate vector, respectively. However, the hidden state in a forward LSTM can only capture the context information in the left side of current step [13, 16, 18]. Bidirectional LSTM (Bi-LSTM) has an operation to reverse the order of input sequence and concatenate the hidden state in each time step, which can capture the context information from the left and right side of the current step.

3.2.2 Conditional random field

The conditional random field (CRF) layer performs the label sequence prediction [16, 18]. This model can be more effective than classifying each word independently because the word label is determined not only by itself but also by the context.

$$\begin{aligned} p(y|z;W,b)&=\frac{\prod _{i=1}^{n}\exp (W_{y_{i-1}y_i}^Tz_i+b_{y_{i-1}y_i})}{\sum _{y'\in Y(z)}\prod _{i=1}^{n}\exp (W_{y'_{i-1}y'_i}^Tz_i+b_{y'_{i-1}y'_i})} \end{aligned}$$
(7)
$$\begin{aligned} L(W,b)&=\sum \limits _{i}\log p(y|z;W,b) \end{aligned}$$
(8)
$$\begin{aligned} y^*&=\arg \max \limits _{y\in Y(z)}p(y|z;W,b) \end{aligned}$$
(9)

where \(\{[z_i,y_i]\}, i=1,2...n\) denotes a set of words z with a label sequence y. W and b are weight and bias parameters. p(y|zWb) is the probability of label sequence y over all possible sequences Y(z) on the input z. In training stage, the objective is to maximize the log-likelihood L(Wb). In prediction, the decoder will find the optimal label sequence as Eq. 9 by Viterbi algorithm [29].

3.3 Term embedding

Word vector approaches aim to project a large vector space of words into a much lower-dimensional space and generate dense representations. It has made significant contributions to enhance various NLP techniques and has been widely used in various downstream NLP tasks like sentiment analysis, document classification, etc., to achieve improved results. The word representations are computable and have actively promoted the development of deep learning NLP. Existing word vector training methods mainly embed words into a fixed-length vector, but cannot be directly used to learn term vector in sentences. For terms, the word embeddings are not enough to represent the overall meaning of the term. Existing methods have limitations on the problem of term representation.

Existing methods usually use average or max pooling of all word vectors to represent a multi-word term. The problem is that the term representation obtained by this method will make the term most similar to each composed word. This method cannot reflect the relationship between different terms, but only the relationship between words contained in the term. This method belongs to a word-level learning method and cannot be used in the term level. Based on the above problems, we use the multi-length term mining algorithm proposed in Sect. 3.1 together with the word vector learning algorithm to fundamentally alleviate the problem of term representation and learn term embeddings. We use the GloVe algorithm to train term embeddings. However, the major difference is that our vocabulary consists of terms mined using the algorithms described above. This enables the GloVe algorithm to learn vectors for terms rather than individual words. For example, it will learn one single vector for the compound term “lung cancer”. GloVe will learn vectors for the two terms “lung” and “cancer”, and then one has to average them to obtain the vector for “lung cancer”. The training objective for GloVe embeddings is shown in equation (10).

$$\begin{aligned} J=\sum _{i,j=1}^{V}f(X_{i,j})(w_i^T\tilde{w}_j+b_i+\tilde{b}_j-\log {X_{ij}})^2 \end{aligned}$$
(10)

where \(X_{ij}\) represents the co-occurrence frequency of the word\(_{i}\) and word\(_{j}\). \(w_i\) and \(w_j\) denote vector representations of the word\(_{i}\) and word\(_{j}\), respectively. \(b_i\) and \(b_j\) are the bias value for the word\(_{i}\) and word\(_{j}\), respectively. \(f(X_{i,j})\) is the weighting factor defined in equation (11).

$$\begin{aligned} f(x)={\left\{ \begin{array}{ll}(x/x_{\max })^\alpha ,&{} {\text{ if } }\; x<x_{\max } \\ 1,&{} {\text{ otherwise }} \end{array}\right. } \end{aligned}$$
(11)

where \(x_{\max }\) and \(\alpha\) are hyperparameters. These two parameters are set to \(x_{\max }=100\) and \(\alpha =0.75\).

4 Results

In this section, we conduct experiments on two datasets. We first introduce the datasets, evaluation methods. Then, we compare the mined term-based embeddings with the word embedding-based method. Finally, we analyze the experimental results.

4.1 Experimental settings

4.1.1 Dataset

PubMed-10K We randomly sampled 10k abstracts from PubMed. This dataset contains 91k sentences. We mine more than 42k potential terms from this dataset. The term statistics can be found in the first row in Table 1.

PubMed-100K We randomly sampled 100k abstracts from PubMed. This dataset contains 0.94M sentences, which is larger than the first dataset, so we can compare the performance on different data scales, and we can mine 0.35M potential terms from this dataset. The term statistics can be found in the second row in Table 1.

Table 1 Number of terms mined on different datasets

4.2 Evaluation

We mainly analyze the mined terms and their semantic representation capacity through manual evaluation and visualization. We use cosine similarity to calculate similar terms for each term, thereby reflecting the learned term embeddings’ performance.

4.3 Analysis of term similarity

We compared with the word vector-based method, as shown in Figs. 2, 3, 4 and 5. For the baseline method, we used the most common way to represent a term. That is, for each word contained in the term, we averaged their embeddings to represent the term. Different from this method, the proposed method directly learns the term embeddings.

Fig. 2
figure 2

Visualization of term embeddings where each term consists of 2 words, and a and b are based on our method and word embeddings

Fig. 3
figure 3

Visualization of term embeddings where each term consists of 5 words, and a and b are based on our method and word embeddings

Fig. 4
figure 4

Visualization of term embeddings where each term consists of 3 words, and a and b are based on our method and word embeddings

Fig. 5
figure 5

Visualization of term embeddings where each term consists of 4 words, and a and b are based on our method and word embeddings

As shown in Figs 6 and 7, the first column is our method, and the second column is the baseline method. We observe the proposed method can find closely related terms. The word vector-based method finds mainly insignificant words or phrases related to only one word of the term. This limitation is because the word-vector-based method can only find other words similar to a word in this phrase, and the computing process is to maintain the semantics of each word instead of the entire phrase. Each word that makes up a term is used to represent the term, so the most similar representation is usually a word in the term. However, this kind of information does not generate much value, so we removed the words contained in the term. Unlike word vector-based methods, our models can find abbreviations of some terms.

As shown in Fig. 6, we observe that “chronic obstructive pulmonary disease” is closely related to the abbreviation “copd”. “liquid chromatography-tandem mass spectrometry” is closely related to the abbreviation “ls-ms”. As shown in Fig. 7, “toll-like receptor” is closely related to different types of “tlr”. For “human immunodeficiency virus”, we find the abbreviation “hiv”. Abbreviations can be found for almost all the terms based on our method, while the baseline method does not find abbreviations. This shows that our model has achieved better results in expressing the true semantics of terms, and we have performed experiments on two datasets, respectively. We found that when the dataset is larger, there are more term names contained in the dataset. Due to many candidate terms, each term is more likely to find related terms. When a corpus contains fewer samples, this corpus also contains fewer terms, so each term may not find similar terms. However, some phrases related to the term can be founded. It can be seen that our algorithm achieves better results on corpora of different sizes.

Fig. 6
figure 6

The most closely related terms base on the PubMed-10k dataset

Fig. 7
figure 7

The most closely related terms base on the PubMed-100k dataset

4.4 Analysis of term relationship

In this subsection, we analyze the difference between terms of various lengths for learning term embeddings. The baseline is the most commonly used term representation method based on word vectors. We apply principal component analysis (PCA) [30] to reduce the dimension of the learned term, which is convenient for a visualization based on low dimensions to observe and measure the semantic similarity between terms. As shown in Figs. 8 and 9, we found that term embeddings learned by our method can better reflect the relationship between terms. For example, disease-related terms are relatively close, but the word vector-based term representation does not reveal the semantic relationship between terms well. The baseline method mainly retains the words’ similarity, so this method cannot get the term similarity very well. We further find that the longer the term, the less accurate the term relationship based on the word vector method, and the more obvious the need for term embedding. So the proposed method can learn a reasonable representation for the longer term.

Fig. 8
figure 8

Visualization of cancer and a and b are based on our method and word embeddings

Fig. 9
figure 9

Visualization of lung disease and a and b are based on our method and word embeddings

We analyze the performance of the proposed method on different types of terms. As shown in Fig. 8, for cancer terms, our method can achieve a more accurate semantic distribution, so that semantically related terms are closer. For example, we found “non-small cell lung cancer” is closely related to “colorectal cancer”. There is potential relation between “epithelial ovarian cancer” and “neck cancer”, “early breast cancer” and “head-and-neck cancer”. This characteristic helps find related treatment schemes from other terms. However, the word vector-based method makes the terms mixed.

Figure 10 shows the terms related to chronic diseases. Our method brings together similar diseases. We find some potential links between chronic diseases, such as “chronic pancreatitis” and “chronic heart failure”, “chronic rhinosinusitis” and “chronic lymphocytic leukemia”. The baseline method does not generate this effect, so our method’s advantage is that it can compare similar chronic diseases to find treatment options.

Fig. 10
figure 10

Visualization of chronic diseases and a and b are based on our method and word embeddings

Terms related to drugs and treatments are shown in Fig. 11. Our method can put related drugs and treatment methods together, which helps medical researchers choose the corresponding treatment plan and recommend more treatment plans. We observe “neoadjuvant chemotherapy” and “cancer immunotherapy” are closely related. The baseline method has no obvious semantic characteristics.

Fig. 11
figure 11

Visualization of drug and therapy, and a and b are based on our method and word embeddings

Figure 9 shows the terms related to lung diseases. Our method reveals the relationship between lung diseases. The baseline method does not reveal this relationship. Therefore, our method can be further used to find drugs to treat lung diseases.

We visualize factors and treatments closely related to lung cancer, breast cancer, and coronavirus, as shown in Figs. 12, 13, and 14, respectively. We observe the lung cancer is closely related to “antiretroviral therapy, radiation therapy, gene therapy, adjuvant chemotherapy, prognostic factor, nuclear factor, targeted therapy, photodynamic therapy”. The breast cancer is closely related to “prognostic factor, antiretroviral therapy, adjuvant chemotherapy, radiation therapy, gene therapy, photodynamic therapy”. “tumour necrosis factor, epidermal growth factor receptor, nuclear factor, targeted therapy, combination therapy” has more inspiration for the treatment of breast cancer. The coronavirus is closely related to “vascular endothelial growth factor”, “cell therapy”, “replacement therapy”, “neoadjuvant chemotherapy”, “drug development”, “combination therapy”, “photodynamic therapy”, etc. These results show that our method can learn term embeddings from a large-scale corpus to generate inspiration for diseases’ treatment.

Fig. 12
figure 12

Visualization of factors for breast cancer (blue dot) (color figure online)

Fig. 13
figure 13

Visualization of factors for lung cancer (blue dot) (color figure online)

Fig. 14
figure 14

Visualization of factors for coronavirus (blue dot) (color figure online)

5 Conclusion

In this paper, an unsupervised term mining method has been proposed for mining terms from a biomedical corpus. We have combined the term mining method with existing word vector training methods to learn term embeddings to capture the semantic similarity between terms. The proposed method can identify term variations and improve term representations. It is to be noted that the proposed method can be applied across domains without the need for external resources. We also analyzed the distribution of diseases and treatments based on the learned term embeddings, which can be used to explore treatment schemes for some challenging diseases. Extensive computer simulations were conducted to determine the effectiveness of the proposed method. PubMed—10K and PubMed—100K datasets were used for experimentation (see Table 1). A comprehensive evaluation was carried out through visualization of term embeddings, which used cosine similarity to determine similar terms to each term. The visualization helped to demonstrate the performance of the proposed method and serve as a way to explore treatments for novel diseases. The application of the proposed model can be applicable in several domains [31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54] for the future work.