Exploring cohesive subgraphs with vertex engagement and tie strength in bipartite graphs
Introduction
Bipartite graphs are widely used to represent networks with two different groups of entities such as user-item networks [1], author-paper networks [2], and member-activity networks [3]. In bipartite graphs, cohesive subgraph mining has numerous applications including fraudsters detection [4], [5], [6], group recommendation [7], [8] and discovering inter-corporate relations [9], [10].
()-core and k-bitruss are two representative cohesive subgraph models in bipartite graphs extended from the unipartite k-core [11] and k-truss [12] models. -core is the maximal subgraph of a bipartite graph G such that the vertices on upper or lower layer have at least or neighbors respectively. -core models vertex engagement as degrees and treats each edge equally, but ties (edges) in real networks have different strengths. k-bitruss is the maximal subgraph where each edge is contained in at least k butterflies (i.e. 2x-biclique), which can model the tie strength [13], [14].
In the author-paper network as shown in Fig. 1, the graph is the -core (==2) and the light blue region is the k-bitruss (k = 2). Without considering tie strength, -core blindly includes research groups of different levels of cohesiveness. We can see that and are not as closely connected as the rest authors. The k-bitruss model can exclude the relatively sparse subgraph containing and , but it also deletes edges and when their incident vertices are present. This exposes the drawbacks of the k-bitruss model: (1) As k-bitruss only keeps strong ties, the weak ties between important vertices are missed. In Fig. 1, it fails to recognize the contributions of authors in papers . (2) After removing weak ties, the tie strengths are modeled inaccurately. Edges and have more supporting butterflies ( form a butterfly) than , but their tie strengths are modeled as equal.
In this paper, we study the efficient and scalable computation of -strengthened -core, which is the first cohesive subgraph model on bipartite graphs to consider both tie strength and vertex engagement. Given a bipartite graph G, we model the tie strength of each edge as the number of butterflies containing it. With a strength level , we consider the edges with tie strength no less than to be strong ties. The engagement of a vertex is modeled as the number of strong ties to which it is incident. Given engagement constraints, , and a strength level -core is the maximal subgraph of G such that each upper or lower vertex in the subgraph has at least or strong ties. The -core model is highly flexible and is able to capture unique structures. For instance, in Fig. 1, the subgraph induced by vertices is the -core which cannot be found by -core or k-bitruss for any or k. Also, as shown in Fig. 1, -core can preserve the weak ties if the incident vertices are present (e.g., the red edges are preserved due to and ), which better resembles reality. The flexibility of the -core model is also evaluated in another experiment conducted on dataset DBpedia-producer. Fig. 2 shows the subgraphs of different densities found by -core and -core, where density is the ratio between the number of existing edges and the number of all possible edges [13]. 165 subgraphs with a density greater than are found by -core while only 9 such subgraphs are found by -core.
Applications. The -strengthened -core model has many applications. We list some of them below.
- •
Identify nested communities. On Internet forums like Reddit, Quora, and StackOverflow, users hold conversations on topics that interest them. The users and the topics form a bipartite network. In these networks, communities naturally exist and are nested. For instance, Reddit displays a list of top communities like “News”, “Gaming”, and “Sports” on the front page. The “Sports” community contains many sub-communities, including “Cricket”, “Bicycling” and “Golf”. The edges in sub-communities have higher tie strength because users and topics within them are more closely connected. By increasing strength level -core captures the subgraphs forming a hierarchy, which can model nested communities on bipartite networks.
- •
Group similar users and items. In online shopping platforms like Amazon, eBay, and Alibaba, users and items form a bipartite graph, where each edge indicates a purchasing record. Such a network consists of many closely connected communities, where the same group of users repeatedly buy some items. Examples of such communities include children-toy communities, student-stationary communities, and patient-medicine communities. Within one community, items are considered more similar, and users tend to be alike due to their everyday shopping habits. As the edges between these users and items have high tie strength (butterfly support), we can use -core to find these communities and group similar users or items together.
Challenges. To obtain the -core from the input graph, we can first compute the supports of edges and the engagements of vertices and then iteratively delete the vertices not meeting the engagement constraints. When , and are large, -core is small, and computing -core from the input graph is time-consuming. Thus, the online computation method cannot support a large number of -core queries.
In this paper, we resort to index-based approaches. A straightforward solution is to compute all possible -cores and build a total index based on them. Instead of computing all -cores from the input graph, we take advantage of the nested property of the -core, which means that if and -core is a subgraph of -core. Specifically, for all possible and , we first find -core and then compute -core while gradually increasing strength level . In this manner, we can compute all -cores and construct the index . Although supports optimal retrieval of the vertex set of any -core, it still suffers from long construction time on large graphs. To devise more practical index-based approaches, we face the following challenges.
- 1.
When building index , it is time-consuming to enumerate all butterflies containing the deleted edges. Also, the index construction algorithm is prone to visit the same -core subgraph repeatedly as it can correspond to different combinations of , and . It is a challenge to speed up butterfly enumeration and avoid repeatedly visiting the same subgraphs while constructing the total index .
- 2.
Due to the flexibility of the -core model, there are a large number of -cores corresponding to different combinations of , and . The time cost of indexing all -cores becomes not affordable on large graphs. It is also a challenge to balance building space-efficient indexes and supporting efficient and scalable query processing.
Our approaches. To address the first challenge, we extend the butterfly enumeration techniques in [15] and propose novel computation sharing optimizations to speed up the index construction process of . Specifically, we build a Bloom-Edge-Index (hereafter denoted by BE-Index) proposed in [15] to quickly fetch the butterflies containing an edge. The BE-Index captures the relationships between edges and -bicliques (also called blooms). When deleting an edge, we can quickly locate the blooms containing this edge in the BE-Index and update the supports of the affected edges in these blooms accordingly. Besides, computation-sharing optimization is based on the fact that the same -core subgraph corresponds to various parameter combinations. If we realize the vertices in a subgraph have already been recorded, we can skip the current parameter combination.
To address the second challenge, we introduce space-efficient 2D-indexes including , and , and train a feed-forward neural network to predict the most promising index to handle an -core query. Instead of indexing all -cores, the 2D-indexes , and store the vertex sets of all -core, -core, and -core respectively. These 2D-indexes are much smaller in size and require significantly less build time, each of which can be used to handle -core queries. For example, to compute -core using , we fetch the vertices in -core and recover the edges of -core. Then, we iteratively remove the vertices not having enough engagement from -core until we find -core. However, the query processing performance based on each 2D-index is highly sensitive to parameters , and . This is because the 2D-indexes only store the vertices in -core, -core, and -core, and the size difference between -core and each of these subgraphs is uncertain. We also observe no simple rules to partition the parameter space so that queries from each partition can be efficiently handled by one type of index. This motivates us to resort to machine learning techniques and train a feed-forward neural network as the classifier to predict which index to use for each incoming query of -core. Since we aim to minimize the query time instead of accuracy, we propose a scoring function, time-sensitive-error, to tune the hyper-parameters of the classifier. The experiment results show that the resulting hybrid computation algorithm significantly outperforms the query processing algorithms based on , and , and it is less sensitive to varying parameters.
Contribution. Our major contributions are summarized here:
- •
We propose the first cohesive subgraph model -strengthened -core on bipartite graphs, which considers both tie strength and vertex engagement. The flexibility of our model allows it to capture unique and useful structures on bipartite graphs.
- •
We construct index to support optimal retrieval of the vertex set of any -core. We also devise computation sharing and BE-Index based optimizations to reduce its construction time effectively.
- •
We build 2D-indexes that are more space-efficient and require significantly less build time. We propose a learning-based hybrid computation paradigm to predict which index to choose to minimize the response time for an incoming -core query.
- •
We validate the efficiency of proposed algorithms and the effectiveness of our model through extensive experiments on real-world datasets. Results show that the 2D-indexes are scalable, and the hybrid computation algorithm on a well-trained neural network can outperform the algorithms based on each 2D-index alone.
Organization. The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 summarizes important notations and definitions and introduces -core and -strengthened -core. Section 4 presents the online computation algorithm. Section 5 The decomposition based total index, 6 Optimizations of index construction presents the total index and optimizations of the index construction process. Section 7 presents the learning-based hybrid computation paradigm. Section 8 shows the experimental results, and Section 9 concludes the paper.
Section snippets
Related work
In the literature, there are many recent studies on cohesive subgraph models on both unipartite graphs and bipartite graphs.
Unipartite graphs. k-core [11], [16], [17], [18] and k-truss [12], [19], [20] are two of the most well-known cohesive subgraph models on general, unipartite graphs. On a unipartite graph, k-core is the maximal subgraph such that each vertex in the subgraph has at least k neighbors. k-core models vertex engagement as degrees and assumes the importance of each tie to be
Problem definition
In this section, we formally define our cohesive subgraph model -strengthened -core. We consider an unweighted, undirected bipartite graph . = denotes the set of vertices in G where and represent the upper and lower layer, respectively. denotes the set of edges in G. We use n = to denote the number of vertices and m = to denote the number of edges. The maximum degree in the upper and lower layer is denoted as and ,
The online computation algorithm
Given engagement constraints, , and a strength level , the online algorithm to compute the -core is outlined in Algorithm 1. First, we compute the support of each edge e using the algorithm in [34] and count how many strong ties each vertex u has. Once the strong ties are identified, the upper vertices with fewer than strong ties and the lower vertices with fewer than strong ties are the weakly-engaged vertices. Then, Algorithm 2 is invoked to iteratively remove these weakly
The decomposition based total index
Given , and , Algorithm 1 computes the -core from the input graph, which is slow and cannot handle a large number of queries. In this section, we present a decomposition algorithm that retrieves all -cores, and we build a total index based on the decomposition output to support efficient query processing.Algorithm 3:Decomposition
The decomposition algorithm. The following lemma is immediate based on Definition 5, which depicts the nested relationships among -cores. Lemma 2
Optimizations of index construction
The above decomposition algorithm has these issues: (1) The same subgraph can be computed repeatedly for different and values. For example, if -core is the same subgraph as -core, then we will compute it twice when =1 and =2. (2) When removing an edge e, we need to enumerate all the butterflies containing e. The basic implementation of butterfly enumeration is inefficient, which finds three connected vertices first and then check if a fourth vertex can form a butterfly with
A learning-based hybrid computation paradigm
Although the index supports the optimal retrieval of the vertices in the queried -core, it does not scale well to large graphs due to its long build time and large space complexity even with the related optimizations. For instance, on datasets Team, Wiki-en, Amazon, and DBLP, the index cannot be built within two hours as evaluated in our experiments. In this section, we present 2D-indexes that selectively store the vertices of -core for some combinations of , and .
Experiments
In this section, we first validate the effectiveness of the -strengthened -core model. Then, we evaluate the performance of the index construction algorithms as well as the query processing algorithms.
Conclusion
In this paper, we introduce a novel cohesive subgraph model, -strengthened -core, which is the first to consider both tie strength and vertex engagement on bipartite graphs. We propose a decomposition-based index that can retrieve the vertices of any -core in optimal time. We also apply computation sharing and BE-Index-based optimizations to speed up the index construction process of . To balance space-efficient index construction and time-efficient query processing, we
CRediT authorship contribution statement
Yizhang He: Writing - original draft, Methodology, Software. Kai Wang: Conceptualization, Methodology, Investigation. Wenjie Zhang: Conceptualization, Methodology, Writing - original draft. Xuemin Lin: Supervision. Ying Zhang: Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
Xuemin Lin is supported by the National Key R\&D Program of China under grant 2018AAA0102502 and ARC DP200101338. Wenjie Zhang is supported by ARC DP210101393 and ARC DP200101116. Ying Zhang is supported by FT170100128 and ARC DP180103096.
References (39)
Interlocking directorates in canada: evidence from replacement patterns
Social Networks
(1982)Network structure and minimum degree
Social Networks
(1983)- et al.
A fast order-based approach for core maintenance
- et al.
Generalized two-mode cores
Social Networks
(2015) - et al.
Unifying user-based and item-based collaborative filtering approaches by similarity fusion
- M. Ley, The DBLP computer science bibliography: Evolution, research issues, perspectives, in: Proc. Int. Symposium on...
- J.C. Brunson, Triadic analysis of affiliation networks, arXiv preprint...
- et al.
Collusion detection in online rating systems
- et al.
Copycatch: stopping group attacks by spotting lockstep behavior in social networks
- et al.
Efficient -core computation in bipartite graphs
VLDB J.
(2020)
Efficient fault-tolerant group recommendation using alpha-beta-core
Fast group recommendations by applying user clustering
Interlocking directorates and intercorporate coordination
Social Networks: Critical Concepts Sociol.
Trusses: Cohesive subgraphs for social network analysis
National Security Agency Tech. Rep.
Peeling bipartite networks for dense subgraph discovery
Bitruss decomposition of bipartite graphs
Efficient bitruss decomposition for large-scale bipartite graphs
Efficient core decomposition in massive networks
K-core decomposition of large networks on a single pc
Proc. VLDB Endowment
Cited by (22)
OCSM: Finding overlapping cohesive subgraphs with minimum degree
2022, Information SciencesA parameter-free approach to lossless summarization of fully dynamic graphs
2022, Information SciencesCitation Excerpt :Consequently, graph summarization can concisely represent the original graph. Then, we can conduct graph computing based on the summary graph, including query processing [3–5], extraction and interaction analysis [6–8], and processing in hardware [9–11]. Furthermore, lossless graph summarization is an accurate compression technique, which is more appropriate for many applications.
Cohesive Subgraph Discovery Over Uncertain Bipartite Graphs
2023, IEEE Transactions on Knowledge and Data EngineeringSearching Personalized k-Wing in Bipartite Graphs
2023, IEEE Transactions on Knowledge and Data EngineeringParallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph Discovery
2023, ACM Transactions on Parallel ComputingAccelerated butterfly counting with vertex priority on bipartite graphs
2023, VLDB Journal
- 1
This manuscript is the authors’ original work and has not been published nor has it been submitted simultaneously elsewhere.
- 2
All authors have checked the manuscript and have agreed to the submission.