Abstract

Mining and understanding patients’ disease-development pattern is a major healthcare need. A huge number of research studies have focused on medical resource allocation, survivability prediction, risk management of diagnosis, etc. In this article, we are specifically interested in discovering risk factors for patients with high probability of developing cancers. We propose a systematic and data-driven algorithm and build around the idea of association rule mining. More precisely, the rule-mining method is firstly applied on the target dataset to unpack the underlying relationship of cancer-risk factors, via generating a set of candidate rules. Later, this set is represented as a rule graph, where informative rules are identified and selected with the aim of enhancing the result interpretability. Compared to hundreds of rules generated from the standard rule-mining approach, the proposed algorithm benefits from a concise rule subset, without losing the information from the original rule set. The proposed algorithm is then evaluated using one of the largest cancer data resources. We found that our method outperforms existing approaches in terms of identifying informative rules and requires affordable computational time. Additionally, relevant information from the selected rules can also be used to inform health providers and authorities for cancer-risk management.

1. Introduction

Recent years have witnessed a significantly increasing amount of electronic health records (EHR), in addition to other data collected for the diagnosis and management purpose. The Surveillance, Epidemiology, and End Results (SEER) resource is one of the typical examples. As one comprehensive and authoritative resource in relation to cancer statistics, SEER is a publicly available dataset originating from the United State. This data repository aims to provide high-quality and comprehensive cancer information, in order to help institutions and laboratories worldwide performing their own research. As such, this SEER dataset has been used for a diverse range of research applications, which results in more than 1500 copies for the public use annually. In addition, this SEER repository has also been evolving and updated from time to time, by the increment of new patient samples, the inclusion of more medical features/variables, the involvement of new types of cancers, etc.

Not surprisingly, numerous machine learning methods have been applied to the SEER dataset for monitoring patient status and facilitating a better understanding of cancer treatment and survivability. Prior research efforts include Expert Systems [1], Fuzzy Systems [2, 3], Evolutionary Computation [4, 5], Support Vector Machines [6], and Neural Networks and/or Deep Learning [7]. Yet, there are still open research questions remaining. Expert/Fuzzy Systems, for instance, are typically reliant on human knowledge to determine (semi)static decision strategies. Intuitively, the a priori knowledge may vary from experts to experts, thereby resulting in significantly different outcomes. In addition, knowledge/expertise acquisition could be very time-consuming and labor-expensive, particularly when the scale/dimension of the given problem is large. On the other hand, Support Vector Machines and Neural Network based approaches are limited in their interpretability, and relevant results are always questionable for end users.

Alternatively, we consider adopting the association rule-mining (ARM) algorithm in this study. As one of the most popular data-mining algorithms, ARM is characterized by its capability of being data-driven (less dependency on the external knowledge, compared to Expert Systems) and interpretability (high transparency compared to Neural Networks). As such, ARM has attracted much research attention with its wide application in many areas, such as the analysis for smart-phone app usage [8], opinion leadership identification [9], and monitoring patients’ disease-development behavior [10]. In addition, ARM-based applications in the medical domain can be found in the preliminary research [11, 12].

Yet, one major problem with ARM is the huge number of generated rules; that is, a typical result from ARM could be hundreds and thousands rules associated with different lengths. On the other hand, many rules are overlapping and/or repeating each other with minor changes, which leads to the issue of the rule redundancy. Obviously, a large number of rules is difficult to exam or interpret manually, not to mention its computational overhead, while applying a small set of rules may not be sufficient to capture the underlying pattern, due to the possibility of lacking of information. Consequently, how to control/manipulate the number of generated rules to accurately describe the given dataset becomes a critical process for any ARM-based applications.

Traditionally, there are two strategies in terms of optimizing generated rules: (i) the application of a priori domain knowledge and (ii) rule summarization technique. The former one usually works with predetermined conditions to filter rules, which relies on external resources, such as expert experience or domain knowledge. In this context, only certain items are permitted to be included in generated rules, while others will be cast as unnecessary items to remove. Intuitively, this strategy has two major drawbacks: firstly, identifying important items is time-consuming, particularly with the large number of available items; secondly, experts could impose their own bias via determining item importance, thereby resulting in questionable rules.

On the other hand, rule summarization is a data-driven and automate method, in which less domain knowledge is required. The basic concept of the summarization technique is to identify important rules automatically, from the entire rule set, without compromising the information loss. Some existing work has been reviewed in Section 2. Inspired by the general applicability of rule summarization, this paper explores the task of discovering patients’ pattern using the association rule summarization method. To enhance the summarization capability, we further introduce a cluster-based strategy to identify important rules. More specifically, the proposed algorithm consists of three parts. To begin with, we establish a rule graph based on their similarity, in which rules are grouped into different clusters using the community-detection method. Eventually, significant rules are determined and selected across individual rules, which are cast as the output of the proposed summarization. To the best of our knowledge, this is the first study of proposing a cluster-based rule summarization algorithm to reveal the relationship among cancer-related risk factors.

The remainder of the paper is organized as follows. Section 2 provides the brief review about the ground work, such as data-mining-based medical applications; we also discuss traditional techniques for the rule summarization and community-detection clustering approaches. Section 3 presents the proposed cluster-based summarization algorithm, where three major phases are discussed in terms of similarity graph construction, community-detection-based cluster, and applied summarization strategy within each individual cluster. Our proposed framework is then evaluated in Section 4 using the SEER dataset to explore patient risk factors, followed by concluding remarks in Section 5.

This section offers a brief discussion of the state-of-the-art research work related to the analysis of patients’ pattern. At first, we investigate the application of data-mining algorithms in the medical domain. We then discuss the basic concept behind association rule-mining and summarization methods. Finally, we focus on the clustering approach for community detection.

2.1. Data-Mining-Based Medical Application

Recent years have witnessed a vast amount of applications of data-mining techniques in the medical domain [1, 3, 6, 7, 13]. In [1], an expert system was proposed by integrating geographic information and Online Analytical Processing (OLAP) technologies to facilitate environmental health decision support. More precisely, this expert system aimed to investigate potential relations between health problems and environmental risk factors, such as neighborhood, industrial pollutants, and drinking water quality. Another research [6] was proposed to apply a number of supervised learning techniques to discover lung cancer patients in terms of their survivability. Experimental results suggested that the Gradient Boosting Machine led to the best prediction performance, while Support Vector Machine was the only model that generated a distinctive output. In addition, the work [7] investigated the combination of Neural Networks with adversarial domain adaptation. A couple of scenarios were considered in the experiment for the evaluation purpose, including the standard supervised classification, unsupervised domain adaptation, and supervised domain adaptation. Resultant outcome indicated that the hybrid model of Neural Networks and adversarial domain based adaptation achieved satisfactory performance to measure pathology reports. More recently, a Bidirectional Long Short-Term Memory Neural Network (BLSTM-NN) was employed to build an interaction monitoring system in [13]. In their study, ten volunteers were involved and their activities were recorded using a set of Kinect sensors. Then 3D skeletons from participants were detected and tracked using BLSTM-NN, which revealed the underlying activity pattern and interaction among patients. A more general survey was represented [3] to discuss the various methodologies, such as Fuzzy Logic, Neural Networks, and Genetic Algorithm, and their various applications in medicine.

The majority of existing systems, however, generally are characterized as expert-defined or black-box style. For instance, Expert and Fuzzy Systems require the domain knowledge to setup prediction strategies, which could be very labor-expensive. Neural Network based approaches, on the other hand, are usually limited by their interpretability, which remain questionable for end users. To sum up, despite the general interest of applying data-mining techniques in the medial domain, discovering patient risk factors is still a difficult task. As an alternative, this paper explores the potential of applying association rule-mining-based methods. In particular, rule-based approaches benefit from their transparency, interpretability, and efficient computation, which have potential to overcome the aforementioned limitations from other approaches.

2.2. Rules Mining and Summarization

Association rules mining (ARM) is one of the most-common data-mining algorithms for the relationship analysis. Its goal is to extract rules of the form “IF-Then,” such that if a set of variable values is found, then another set of variables will generally have a specific value. A typical example from patients’ rule can be “AGE_DX(1), MAR_STAT(1) SEQ_NUM(0), ,” which indicates if this patient is diagnosed less than 53 years old (i.e., AGE_DX(1)) and is single (i.e., MAR_STAT(1)), then to some extent she/he will have one primary only in the lifetime (i.e., SEQ_NUM(0)) and survival months is more than 60 months (i.e., ).

As such, the technique is very useful in terms of associating an immediate subsequence (i.e., consequent) given the previous condition (i.e., antecedent) and discovering patterns of interaction among different factors. On the other hand, the importance of a rule is usually estimated through critical indicators, such as “support” and “confidence.” Mathematically, given a rule of , its support is the proportion of records which contain all items from , which can be computed as follows:where is the number of records containing and is the total number of rules. The confidence of the rule is accordingly computed as

Consequently, the “support” indicator is used to measure the extent to which the antecedent and consequent occurs simultaneously, while the “confidence” indicator estimates how often the consequent occurs given the antecedent.

Due to its high interpretability and efficiency, plenty of ARM-based applications have been applied for the analysis of smart-phone app usage [8], opinion leadership identification [9, 14], and monitoring patients’ disease-development behavior [10], to name a few. In their pilot work of [8], the authors aimed to investigate how students use their smart-phone apps to support online learning. App data from 148 schools were collected accordingly, and the -means algorithm was employed to separate students into five groups based on their app usage. By mining pattern rules from each cluster, results suggested that students’ online patterns showed a shifting ratio between educational and noneducational apps. In addition, generated rules also revealed unique emphases on different types of apps that could impact on student learning performance. The work in [14], on the other hand, investigated a niche subset of user-generated popular culture content on Douban, a well-known Chinese-language online social network. Built on a dataset comprised of 714,946 comments and 228,806 distinct users, a parallel rule-mining algorithm was proposed. The experimental results demonstrated the flexibility and applicability of the rule-based method for extracting useful relationship from complex social media data. In addition, another work to explore patient’s behavior in terms of disease complications and recurrences was reported in [10]. For this particular research, a database about colorectal cancer, with 1516 patients and 126 attributes, was considered. At its core, four heuristic operators and a complete methodology were proposed to implement the rule-mining process. From the experiments, the rule-based approach showed advantages over standard approaches, such as the associative classification methods to identify risk factors.

The major problem, however, with the traditional ARM is the huge number of generated rules, which is manually inefficient to exam them one-by-one. The large number of rules, on the other hand, also reduces the interpretability as a whole. To overcome this problem, one established approach is the rule summarization, i.e., to summarize rules based on their significance without degrading the relationship information from the expression of the entire rule set. There are a few implementations for summarizing important rules, including APRX-COLLECTION [15] and RPGlobal [16]. To begin with, APRX-COLLECTION introduces a concept of the false positive rate (), which is used to control the level of wrong coverage. More precisely, APRX-COLLECTION firstly forms an aggregated rule set by enumerating all possible combinations of original rules . As such, additional rules, even though they might not exist in , could be created. Next, rules from are selected according to two criteria: (i) those rules cover most items from and (ii) the number of additional items should be less than the level of . On the other hand, RPGlobal adopts similar selection criteria by identifying rules that cover most items from . The major difference is that RPGlobal chooses rules from directly, instead of generating . In addition, to limit the increment of additional rules, RPGlobal further introduces a user-defined parameter to control the number of selected rules each time.

Overall, rule-mining summarization techniques have been proposed to overcome the problem associated with the huge number of generated rules. The summarized rules make the subsequent interpretation process more efficient and easier, by filtering least-important rules and significantly reducing the scale of rules. Inspired by this insight, an enhanced rule summarization method is proposed in this study, which is in light of clustering, in particular, the community-detection technique.

2.3. Community Detection

As a graph clustering technique, community detection has attracted a lot of attention in the past decade due to the increasing scale of social network. With the rapid growth of Internet infrastructure, more and more people utilize online resources in their daily life, such as Twitter and Facebook. The result is the generation of a huge social network, in which individual users play a role of nodes/vertexes and their connections (e.g., friendship) become the edge in the network. As such, either the industry stakeholders or academia are interested in analyzing this giant social network to formulate better marketing and/or development strategy. In particular, identifying communities within complex networks is of great importance for many real scenarios. A typical example could be forming an online community in relation to a group of people who share the same interest.

A number of different methods have been proposed to implement the community detection. A pioneer work in [17] was proposed based on the concept of edge betweenness centrality (EBC). For each individual edge within the network, its EBC was measured by the total number of shortest paths (for any two vertices) passing this particular edge. As a result, an edge with higher EBC became a good indicator to separate among communities, while an edge with lower EBC was more likely to exist within a small community. By removing edges with high EBC, the entire network eventually could be split into small groups/communities. Another work was reported in [18] with a similar measurement, that is, edge clustering coefficient (ECC). This measurement was to count the number of triangles for a given edge, compared to the total number of such possible triangles. Compared to EBC, edges with low ECC were considered as the connections among communities. As such, disjoint subnetworks can be formed by eliminating those low-ECC edges.

In addition to edge-based measurements, the Walktrap algorithm from [19] considered the topological similarity between vertices. The main idea was to divide the network based on a distance between vertices such that distance in the same community was small but became larger in different groups. This vertex distance was formally defined by (i) the walking probability from one vertex to another and (ii) the vertex degree. Another vertex-based algorithm was proposed in [20], termed as Label Propagation. To begin with, every vertex was randomly initialized with a unique label. Later, during the iteration, each vertex adjusted its label based on neighbors; that is, new label will be made the same as its majority labels nearby. Finally, communities were formed by grouping vertices with the same labels. The Infomap algorithm, on the other hand, was proposed using the concept of random walks and information diffusion [21]. It started by performing a random walk within the network and calculated the information flows using the trajectory of the previous random walk. An information map was accordingly established, which differentiated communities with a diverse range of map importance. One advantage with this Infomap algorithm is its nearly linear-computational time, thereby leading to a very efficient process.

2.4. Summary

In this section, we briefly discuss the existing work of applying data-mining techniques to address medical problems. We also review the basic concept of rule-mining, rules summarization, and the community-detection approaches. Although preliminary work has been conducted to identify rules in relation to patients risk management, traditional rule-mining algorithms suffer from a major problem associated with the huge number of generated rules. To cope with this issue, rules summarization techniques offer advantage to select important rules and minimize the information loss. Taking all these aspects into account, we propose an enhanced summarization algorithm using the community-detection approaches, which is detailed in the following section.

3. The Proposed Framework

In this section, we discuss a systematic and data-driven approach to discover risk-relevant factors. The main contribution of this study is the proposal of a novel cluster-based summarization algorithm. As illustrated in Figure 1, the proposed approach consists of three phases. To begin with, we apply the traditional rule-mining algorithm on the entire dataset to generate a comprehensive set of potential rules. Due to the large scale of this rule set, we then represent it as a rule-similarity graph; see Section 3.1. Secondly, the community-detection algorithm is employed to identify clusters from this rule graph; see Section 3.2. Finally, informative rules across clusters are summarized, as introduced in Section 3.3.

For convenience, Table 1 summarizes notations used throughout the paper.

3.1. Similarity Graph

The main purpose of this first phase is to generate a completed rule set that represents the entire transaction records and then to construct a rule-similarity graph. Towards this end, there are several steps we need to consider, including data discretization, rule-mining, similarity measurement, and graph construction.

3.1.1. Data Discretization

To begin with, the rule-mining algorithm works well with discrete data, rather than the continuous ones. However, in the real-world scenario, the majority of medical data is continuous and not operable by the rule-mining approaches. To quantify extracted features, a preprocess of data discretization is necessary. For simplicity, this study aims to split a continuous input into groups (where is a user-defined threshold). As such, samples belonging to the same group will be assigned with the same label, to convert the continuous data into discrete one. Note that a domain knowledge is required to decide the number of groups (i.e., ), while different business or operational requirements could result in a variety of discretization ranges. However, the advantage of the discretization is twofold: (i) continuous data is represented using discrete labels to facilitate the subsequent application of the rule-mining algorithm; (ii) the raw continuous dataset is represented by a smaller-sized but meaningful format, which is easy to be interpreted and also saves the computational cost.

3.1.2. Rule Mining

There exists a diverse range of implementations for rule mining, such as Apriori and FP-Growth. In particular, Apriori employs a “bottom-up” strategy to produce frequent-item sets, in which repeated scanning of the entire dataset is required. This typically leads to an expensive computational cost. Therefore, in this study, the FP-Growth algorithm is implemented, which adopts a “top-down” strategy to produce frequent-item sets. The main advantage is that it requires less scanning time to generate possible combinations of frequent sets.

3.1.3. Similarity Measurement

Before we construct the rule-similarity graph, it is essential to define the similarity measurement for any given rules. Consider the typical form of two rules and . The similarity function between two rules is accordingly defined in terms of the relative item coverage (RIC):

As observed, the similarity is measured as the portion of the common items from both antecedents and consequences versus the portion of all items occurring within two rules. We then introduce the process of rule-graph construction based on the similarity measurement in equation (3).

3.1.4. Graph Construction

Graph is a very important data structure in computer science, while a large number of existing works have been proposed to demonstrate the solid application and success of graph-based techniques [2, 4]. Inspired by this insight, we also consider representing rules as a graph format. As such, the rule graph is represented as in our study, where each vertex () denotes a rule , and the edge is the connection between the -th and -th vertex. Furthermore, is associated with the similarity between the rule of and , i.e., . Note that there are a diverse range of options to manipulate . For instance, we can introduce a user-defined threshold to filter if . That is, two vertices are only connected if their similarity is larger than the value of . Alternatively, we can also employ the concept of -nearest neighbors, from which only the most similar vertices to one specific vertex are connected. Without loss of generality, we consider the full-connect strategy; that is, all vertices will be connected to each other, while the edge equals their similarity .

3.2. Rule Clustering

This second phase intends to apply the community-detection algorithm to identify clusters from the rule-similarity graph. The general idea is to cluster vertices in a way that samples belonging to the same group are similar, while samples from different groups are dissimilar to each other. As mentioned in Section 3.3, there has been a great interest in developing implementations for detecting communities within the graph. Table 2 summarizes the computational complexity of the existing implementations for the community detection.

As observed, a variety of implementations may lead to different costs, as they focus on different optimization strategies on splitting vertices and/or edges. Considering our case of rule-based graph, there will be over rules, indicating the final graph is with vertices and approximately edges. As such, we implement the work [21] in this study as our community-detection executor, due to its efficiency and affordable computation.

3.3. Cluster-Based Summarization

The final phase is used to perform rule summarization by determining or selecting important rules within each individual cluster. More precisely, after forming clusters within the rule graph, rules will be ranked according to their importance. As such, top- rules from each single cluster will be selected according to their importance, where is a user-determined parameter, to produce the final summary. To begin with, we propose three statistical features before aggregating them to measure the rule importance:(i)Rule length: the first importance measurement is the rule length. This idea is inspired by the work of APRX-COLLECTION in [15], where a long rule has a better item coverage compared to shorter ones. Therefore, we use the rule length as a feature to estimate the importance. Given a rule of , the feature of length () is computed as follows:where represents the total number of distinct items from all rules.(ii)Aggregated similarity: we want to select informative rules that would represent the majority rules within a cluster. As such, rules should be selected if they are with a similar or close form to the rest. Therefore, the second measurement to the rule importance is the aggregated similarity. More specifically, the larger the aggregated similarity of a rule, the higher the rank. Given a cluster , the aggregate similarity () for the -th rule () is then given as follows:where is the similarity measurement from (3) and represents the total number of rules from the cluster.(iii)Redundancy: another critical aspect is to consider the redundancy impact once a rule is selected. Note that each cluster is composed of similar rules. Therefore, if two rules contain similar items, then these two are more likely to convey the similar meaning. The redundancy feature is then employed to eliminate those semantically similar rules, without repeating the same rules all the time. In our study, the following estimation is proposed to measure the redundancy feature ():where is the set composed of selected rules and represents the number of rules from . Note that, at the beginning of the summarization, and . In this case, ().

Apart from aforementioned measurements, we further leverage the support degree as an indicator for evaluating the rule importance. Consequently, the final score to one specific rule (i.e., ) is to involve four statistical features, including support degree, rule length, aggregated similarity, and redundancy, and the following equation is proposed to formulate this calculation:where , , and are the penalty term for balancing four statistical features, respectively. Eventually, the cluster-based summarization algorithm is proposed for determining informative rules, which is further shown in Algorithm 1.

Input: clusters (, , , ), the number of selected rules , and the penalty terms of , , and ;
Initialization:
set the rule set to empty:
for to do
While ( or ) do
  Calculate the score of () according to equation (7)
  Select one rule with the highest score and label it as
  and removefrom
end
end
Output: Return the optimal rules from .
3.4. Summary

The main contribution of this work is to formulate the problem of rule summarization as a graph clustering process. Next, we discuss the computational complexity of the proposed method. Given a dataset with samples, the FP-Growth algorithm is employed to mining potential rules with the cost of . Next the rule-similarity graph is constructed, which requires a complexity of (where is the total number of generated rules; note that it usually leads to ). Note that with the established graph, there will be vertex and edges as we consider the full-connect strategy. As such, applying the community-detection algorithm to cluster this rule graph costs . Finally, the cluster-based summarization algorithm needs to go through all clusters to select top rules. As such, for the -th cluster, the time complexity could be , where is the number of rules within the -th cluster. In the worst case, we have , thereby leading to the worst complexity of . Overall, the complexity order of the proposed algorithm is .

Notice that the overall complexity for the proposed algorithm depends on either the total number of generated rules (i.e., ) or the number of available clusters and rules to be selected. In the worst case, the complexity could be if we select all rules; by contrast, it will be as only one rule to be chosen from individual cluster.

4. Experimental Results and Analysis

This section discusses the experimental results by performing the application of our proposed algorithm to the SEER dataset. The details about the employed dataset are presented in the following section. The aim of the experiments is to (i) evaluate the influence of key parameters to the clustering performance and (ii) compare the proposed algorithm with existing work for the rule summarization.

4.1. Experiments Design

As mentioned before, the SEER dataset consists of samples from a great number of cancer types. In this study, the breast cancer is explicitly employed as the main resource. Relevant samples and variables from SEER that are associated with patient survivability and tumor status are selected based on a set of inclusions (the selection criteria for variables can be found in our preliminary work [22]). As such, 85,189 patient samples (with 12 variables) are identified as the main experimental data, and detailed description for chosen variables can be found in Table 3.

Note that among these 12 variables, two of them are with the continuous type, i.e., AGE_DX and SRV_TIME, and Figures 2 and 3 illustrate their distribution, respectively. In relation to the continuous data, the data discretization process is considered to split them into groups, while each group will be assigned to one unique label. In this study, we set and 2 for the variables of AGE_DX and SRV_TIME, respectively. As a result, the discrete results for AGE_DX will be [17, 53), [53, 67), and [67, 104] due to an equal-size separation. Meanwhile, as for the variable of survival months, we are taking SRV_TIME = 60 as the splitting threshold, as the majority of studies has categorized patients’ survivability using a threshold of five years [10, 22].

4.2. Results from Cluster-Based Summarization

In this section, we focus on the results from the proposed algorithm, with respect to the rule summarization for patients’ behavior. Towards this end, we start by examining the completed rule sets and then performing rules cluster, and more importantly we investigate the summarization results across different clusters. To begin with, we utilize the FP-Growth algorithm to generate the completed rule set, which leads to a total of 11,887 rules. Table 4 shows the generated rules as a function of the support degrees, while a higher support is normally with a smaller number of rules. Note that the threshold for the confidence degree is fixed as 0.5 in all cases.

We then perform the community detection to cluster the generated rule set. A particular problem with the application of community-detection clustering is to determine the number of clusters (). Yet, the selection of an appropriate value for the number of clusters has crucial impact on the clustering performance. Having said that, a too big value for would make it difficult to interpret the result, not to mention the computational cost; while a smaller value of could fail to group similar samples and result in poor clustering performance. To identify an optimal value for the number of clusters, the measurement of intercluster similarity () is introduced that is estimated as follows:where RIC represents the relative item coverage (defined in equation (3)), represents the number of rules in the -th cluster (), and is the number of clusters.

Using this measurement, our aim is to select the value of that leads to the biggest value of , so that similar rules are grouped together to maximize the intercluster similarity. Towards this end, Figure 4 plots the results of intercluster similarity for varying values of , where . At first, we confirm that the computational time associated with different sizes of clusters is increasing with respect to . For instance, the minimal and maximal time has been found from (with 426.07 seconds) and (with 1658.97 seconds), respectively. On the other hand, we observe that the intercluster similarity () performs stably after .

Given the expensive computation with a bigger value of , we decide to take the optimal value of in the following study. As such, we perform the community-detection algorithm to cluster rules into five groups, and the statistical information about the particular clustering result is summarized in Table 5.

Next, the proposed summarization algorithm is performed on these five clusters to identify informative rules. In this study, we are setting key parameters for summarization as follows: the number of selected rules , and the penalty terms of . As a result, summarized rules are listed in Table 6, which shows a diverse coverage of support degree and number of items. For instance, all selected rules, together, cover nearly 98% of patient samples (high support degree) while seven distinct items occur from the results that are identified as key items, including SEQ_NUM(0), HISTREC(9), RADIATN(0), SEX(2), NUMPRIMS(1), ORIGIN(0), and . We will then compare our proposed approach with others.

4.3. Comparison Results

In this section, we compare the proposed algorithm with the existing approaches in terms of mining risk factors associated with patients’ disease development. To begin with, we first extract top 15 rules from the completed rule set (without any summarization techniques), by simply ranking them based on their support degrees. As such, these rules are cast as the baseline results, and Table 7 illustrates this rule set with high support degree.

As observed, the majority of rules from the baseline results is overlapping each other. For instance, there are only five items observed from both the antecedent and consequent, including “SEX(2),” “NUMPRIMS(1),” “ORIGIN(0),” “SEQ_NUM(0),” and “HISTREC(9),” respectively. That is, approximately 88.1% of the items have been repeated from this baseline result. On the other hand, rules from Table 7 are with relatively simple format (with no more than 3 items), which also indicates we could miss some complex or advanced rules. More importantly, although these top 15 rules are selected based on their support degree, their relevant data coverage is relatively low compared to our method (with 94.6% of the entire data). By contrast, the proposed method performs better than the baseline rules, by identifying more key items (seven) and covering a larger number of patients (98.3%).

Next, traditional rule summarization techniques are considered, including APRX-COLLECTION [15] and RPGlobal [16] method, while their results are shown in Tables 8 and 9, respectively. Again, those methods are applied to summarize rules by selecting the top 15 ones.

The results from the APRX-COLLECTION algorithm clearly indicate its preference of selecting long rules, regardless of their support. As mentioned before, the principle of APRX-COLLECTION is to choose rules with a large item coverage. As a result, the top two rules from APRX-COLLECTION, for instance, are with 8 and 6 items, respectively. However, the major problem with APRX-COLLECTION is the data coverage, that is, the support degree from rules. All selected rules are with support under the value of 0.2, which is associated with a very small population of patients. In other words, results from APRX-COLLECTION are insufficient and cover the majority patient’s data, while they could lead to a misleading summarization result.

The similar problem is with the RPGlobal algorithm. Again, the long rules are still preferred and rules with more items are selected. However, RPGlobal also considers removing redundancy by encouraging rules that cover other population. As a result, we can see that the data coverage from RPGlobal is slightly better than that of APRX-COLLECTION, with the average support degrees of 0.1171 (RPGlobal) and 0.1154 (APRX-COLLECTION), respectively. Consequently, the major problem with traditional summarization technologies is that the selected rules are associated with a small data coverage, thereby reducing their generalization capability.

On the other hand, as mentioned before, our summarized rule set (illustrated in Table 6) shows a nice balance between the support degree (data coverage) and the number of covered items. For instance, our algorithm leads to a total number of 7 distinct items, which are more than those of baseline (i.e., five). Therefore, more details or complex rules are allowed to be selected using our method. In addition, compared to traditional summarization methods, the proposed approach leads to a summarized rule set with approximately 98% support degree, which outperforms its peers that are with less than 25% support degree. In other words, our method is able to cover the majority of patient cases. Overall, the comparison results clearly show the summarization applicability of our method to represent an overlarge rule set by identifying important rules (with high support degree in terms of data coverage and less redundancy in terms of item overlapping).

At last, we investigate the computational cost from different approaches, and the comparison results are shown in Figure 5. From the experimental results, we notice that the proposed algorithm requires an affordable time for the rule summarization. For example, compared to the RPGlobal method, the proposed algorithm needs 535.8 seconds, which is much better than that of RPGlobal (with 7422.18 seconds). Although we notice that the APRX-COLLECTION approach comes with the least time of 10.23 seconds, its summarization result is the worst among three cases. As such, the satisfactory performance from our proposed algorithm compensate for its computational cost. More importantly, the computational time is accumulated with five clusters in our approach. Note that we can perform the summarization within individual clusters in a parallel way, which could reduce the total cost further. We will leave this work for our future study.

5. Conclusion

In this paper, we propose a novel rule summarization algorithm for identifying informative rules from a cancer-relevant data repository. Three phases are introduced that are capable of generating a comprehensive rule set and relevant rule-similarity graph, performing the community detection to cluster rules, and then selecting important rules to produce a fluent rule summary.

The proposed method is evaluated using the breast cancer dataset from the Surveillance, Epidemiology, and End Results (SEER) resource, which include 85,189 patient samples and 12 variables. The data leads to a completed rule set with over 11,887 rules. By applying the proposed method, we manage to identify the informative rules with high support degree in terms of data coverage and less redundancy in terms of item overlapping. Experimental results also demonstrate that the proposed method leads to competitive performance compared to existing approaches, in terms of the satisfactory summarization results and affordable computational cost. Overall, the proposed method offers a flexible capability and efficient applicability for processing a large amount of medical data that in turn can be utilized to facilitate patients’ risk management.

Table 10 summaries the item sets and related medical information from rules within Tables 69, respectively.

Data Availability

The Surveillance, Epidemiology, and End Results (SEER) data used in our manuscript to support the findings have been deposited in the publicly available, open-source data repository, which is accessible from https://seer.cancer.gov/.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (Grant no. 61873004) and the Humanities and Social Sciences Foundation of Anhui Department of Education, China (Grant no. SK2017A0098).