Compact structure for sparse undirected graphs based on a clique graph partition

doi:10.1016/j.ins.2020.09.010

Information Sciences

Volume 544, 12 January 2021, Pages 485-499

https://doi.org/10.1016/j.ins.2020.09.010 Get rights and content

Highlights

•
A new compression method for sparse and clustered undirected graphs is presented.
•
The representation is based on maximal cliques and compact data structures.
•
It supports efficient neighborhood queries for any vertex in the graph.
•
The compression rates and execution times are competitive with the state of the art.
•
All maximal cliques can be recovered 20 times faster than from the original graph.

Abstract

Compressing real-world graphs has many benefits such as improving or enabling the visualization in small memory devices, graph query processing, community search, and mining algorithms. This work proposes a novel compact representation for real sparse and clustered undirected graphs. The approach lists all the maximal cliques by using a fast algorithm and defines a clique graph based on its maximal cliques. Further, the method defines a fast and effective heuristic for finding a clique graph partition that avoids the construction of the clique graph. Finally, this partition is used to define a compact representation of the input graph. The experimental evaluation shows that this approach is competitive with the state-of-the-art methods in terms of compression efficiency and access times for neighbor queries, and that it recovers all the maximal cliques faster than using the original graph. Moreover, the approach makes it possible to query maximal cliques, which is useful for community detection.

Introduction

A wide variety of real systems are modeled by graphs, including communication, transit, web, social, and biological networks [1], [2]. The process of discovering relevant information from graphs is called graph mining [3]. This is usually a time-consuming task, especially with the current trend of data growth size [4]. The main challenges are triggered by different aspects. This includes the data volume itself, data complexity (i.e., many relationships among the data), and application needs [4]. Several schemes have been proposed for analyzing graphs that aim at understanding the properties and patterns found in them to serve different application purposes. Some known applications include disease analysis [5], community discovery [1], [2], [6], recommender systems [7], graph compression [8], [9], [10], measuring relevance of network actors [11], [12], and network visualization [13], [14]. Recent works on graph mining postulate that dense patterns are prominent and describe different dense substructures. Some examples include maximal cliques [15], [16], communities [17], and others [3], [9], [18], [19]. These substructures have been used for improving network analysis, graph compression [9], [20], and visualization [13].

Given the space required to store and analyze large graphs, the research community has proposed graph compression formats that support basic navigation queries directly over the compressed structure without requiring decompression. This approach enables the simulation of any graph algorithm in the main memory, requiring less space than plain representations. Even though these compressed structures are usually slower than uncompressed representations, they are still attractive in devices with limited memory. This includes devices, such as tablets or cell phones. Moreover, these in-memory representations can provide faster access than plain representations incurring I/O costs [21], [22].

Although there are different types of real-world graphs of interest, this work aims at processing highly clustered and sparse graphs. Clustered graphs contain vertices grouped in highly connected subgraphs. These graphs have high clustering coefficient and transitivity [23] measures. In practice, many real-world graphs are sparse, for example graphs with low degeneracy [16].

This work proposes a compact data structure for clustered sparse undirected graphs that exploits the cliques to represent the edges implicitly. Further, it makes use of the vertex redundancy of the cliques by partitioning them into components that share many vertices. This structure enables neighbor queries, as well as queries for recovering all or subsets of the maximal cliques. Finding maximal cliques is an important step in the clique percolation method (CPM). This has been successfully used for community searches in biological networks [1], social group evolution [24], human disease pattern discovery [5], and computing and visualizing topological features using persistence homology in network analysis [14].

The structure is built on a partition of the clique graph, where each node is a maximal clique in the original graph. The proposed method uses a fast algorithm for listing all maximal cliques and defines an effective heuristic for finding a clique graph partition avoiding the construction of the clique graph. From this, a compact representation of the partitioned clique graph is proposed.

The experimental evaluation shows that the compressed graph representation is competitive with the state-of-the-art methods in terms of compression efficiency for large real graphs, obtaining the smallest representation for clustered graphs. This high compression is achieved, in some cases, at the expense of slower access times when answering neighbor queries. As discussed, in a context of limited memory or steep memory hierarchies, using less space can be of special interest. This may allow the representation to fit into faster memory levels and, in the case of larger datasets, prevents it from being handled on slower ones, such as disks [21], [22]. In addition, according to our knowledge, beside neighbor queries, the structure presented in this study is the first proposal that enables maximal clique queries. This is an important operation for applications that use clique communities. Furthermore, retrieving maximal cliques from the compressed representation is much faster than listing them from the original graph.

The implementation of the proposed method is available at http://www.inf.udec.cl/c̃hernand/sources/cliquecomp/cliquecomp.tgz.

Section snippets

Related work

Boldi and Vigna [25] proposed in 2004 one of the best-known techniques for web graph compression, which offered the best space/time trade-off for many years. They presented the WebGraph framework, which obtains very compact representations of web graphs by exploiting their regularities and statistical properties. More concretely, they exploit the locality of reference, since web pages generally include links to other web pages of the same domain. They also exploited the similarity of the

Proposed method

This section describes a new method for compressing real sparse undirected graphs using a compact data structure that takes advantage of the vertex redundancy of the graph represented by its maximal cliques. In this method, vertex redundancy refers to vertices that belong to multiple maximal cliques. Such vertices can be stored only once to reduce space.

The proposed compression method includes three steps. The first step (clique listing) lists all the maximal cliques of size at least two in the

Query algorithms

This section describes how the main queries are solved using the compact data structure. Algorithm 2 displays a sequential algorithm that retrieves the input graph G in a single pass. The time complexity of the sequential algorithm is $O (\sum_{p = 1}^{M} | X_{p} |^{2} \cdot (1 + {bpu}_{p}))$ .

The algorithm goes through each partition p of the compact representation, retrieves all of the edges in that partition and adds all those edges to build E. If a partition $X_{p}$ contains only one clique, then all the possible edges are

Experimental evaluation

This section describes several experiments to tune and compare our method with the state-of-the-art algorithms for compressing graphs, including version 3.6.1 of WebGraph (WG) [33], the graph compression by BFS from Apostolico and Drovandi (AD) [32], and the $k^{2}$ -tree [20]. The results of the compression efficiency reported by Rossi and Zhou for GraphZIP [36] are also included, although they do not support query operations. All of the experiments ran on a machine with an Intel i7-7500U CPU @

Conclusions

This work introduces a new compact representation of real sparse and clustered undirected graphs based on clique graph partitioning. The method first lists all the maximal cliques of the input graph. Then, it defines a clique graph, whose vertices are the cliques in the original graph. Next, it finds a partition of the clique graph, which is finally encoded in a compressed form using compact data structures.

Our method includes an effective heuristic to find a partition in the clique graph, by

CRediT authorship contribution statement

Felipe Glaria: Conceptualization, Writing - original draft, Software, Visualization. Cecilia Hernández: Conceptualization, Formal analysis, Writing - original draft, Software, Visualization, Writing - review & editing. Susana Ladra: Conceptualization, Formal analysis, Writing - original draft, Writing - review & editing. Gonzalo Navarro: Conceptualization, Formal analysis, Writing - original draft, Writing - review & editing. Lilian Salinas: Formal analysis.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie [grant agreement No 690941]; from the Ministerio de Economía y Competitividad (PGE and ERDF) [Grant Nos. TIN2016-77158-C4-3-R]; from Xunta de Galicia (co-founded with ERDF) [Grant Nos. ED431C 2017/58; ED431G 2019/01]; from the Center for Biotechnology and Bioengineering (CeBiB), Chile; and from the Millennium Institute for Foundational Research on Data

References (46)

Ehsan Pournoor et al.
Disease global behavior: A systematic study of the human interactome network reveals conserved topological features among categories of diseases
Inform. Med. Unlocked
(2019)
Bin Zhou
Applying the clique percolation method to analyzing cross-market branch banking network structure: the case of illinois
Social Network Anal. Mining
(2016)
Zhiyuan Liu et al.
A divide and agglomerate algorithm for community detection in social networks
Inf. Sci.
(2019)
Nieves R Brisaboa et al.
Compact representation of web graphs with extended functionality
Inform. Syst.
(2014)
A. Broder et al.
Graph structure in the Web
Comput. Netw.
(2000)
Ronald C. Hamelink
A partial characterization of clique graphs
J. Comb. Theory
(1968)
Fred S. Roberts et al.
A characterization of clique graphs
J. Comb. Theory, Series B
(1971)
Francisco Claude et al.
The wavelet matrix: An efficient wavelet tree for large alphabets
Inform. Syst.
(2015)
Gergely Palla et al.
Uncovering the overlapping community structure of complex networks in nature and society
Nature
(2005)
Jianxin Li, Xinjue Wang, Ke Deng, Xiaochun Yang, Timos Sellis, and Jeffrey Xu Yu. Most influential community search...

Chuntao Jiang et al.

A survey of frequent subgraph mining algorithms

Knowl. Eng. Rev.

(2013)

Bin Shao et al.

Managing and mining large graphs: systems and implementations

Lidia Fotia. Recommending items in social networks using cliques-based trust. In WOA, pages 51–56,...

G. Buehrer et al.

A scalable pattern mining approach to Web graph compression with communities

Cecilia Hernández et al.

Compressed representations for web and social graphs

Knowl. Inf. Syst.

(2014)

Natalie Stanley et al.

Compressing networks with super nodes

Sci. Rep.

(2018)

Øivind Wang, Nicolai Bodd, Chen Xing, Bård Kvalheim, and Torbjørn Helvik. Enterprise graph search based on object and...

Zhipeng Huang et al.

Meta structure: Computing relevance in large heterogeneous information networks

Ryan A. Rossi et al.

The network data repository with interactive graph analytics and visualization

Bastian Rieck et al.

Clique community persistence: A topological visual analysis approach for complex networks

IEEE Trans. Visualization Computer Graphics

(2017)

Kazuhisa Makino et al.

New algorithms for enumerating all maximal cliques

David Eppstein, Maarten Löffler, and Darren Strash. Listing all maximal cliques in large sparse real-world graphs. ACM...

Charalampos Tsourakakis

The k-clique densest subgraph problem

Cited by (13)

P2S distance induced locally conjugated orthogonal subspace learning for feature extraction
2024, Expert Systems with Applications
When performing data classification tasks, it often occurs to them the curse of dimensionality problem. To address the issue, a manifold learning method termed locally conjugated orthogonal subspace (LCOS) is put forward for dimensionality reduction or feature extraction in this paper. Note that point to feature space (P2S) distance contributes to mining local geometry information, both a local margin characterizing data apartness and a locally conjugated orthogonal constraint beneficial to removing data redundancy are well studied from the P2S distance metric. They are all exploited to model the proposed LCOS. Then, a low dimensional subspace can be explored by maximizing the P2S distance induced local margin under the constraint. Compared with some other related dimensionality reduction methods, experimental results on benchmark face and object data sets validate the performance of the proposed method.
The minimum quasi-clique partitioning problem: Complexity, formulations, and a computational study
2022, Information Sciences
Citation Excerpt :
Among the critical problems of obtaining large dense subgraphs, we can highlight the maximum clique problem and the maximum quasi-clique problem. Besides, partitioning graphs into dense subgraphs finds applications in several areas, such as bioinformatics [18], quantum computing [45], data mining [16], and community detection [47,13,48]. The minimum quasi-clique partitioning problem lies in this family of problems.
Given a simple graph $G = (V, E)$ and a real constant $γ \in (0, 1]$ , a $γ$ -clique (or $γ$ -quasi-clique) is a subset $V^{'} \subseteq V$ inducing a subgraph with edge density at least $γ$ . The minimum quasi-clique (or $γ$ -clique) partitioning problem (MQCPP) consists in partitioning the vertices of the graph in $γ$ -cliques to minimize the number of elements in the partition. In this paper, we formally introduce the minimum quasi-clique partitioning problem, which has not yet been addressed in the literature from an optimization point of view. We show by using a reduction from the unweighted maximum cut problem that even deciding whether a graph can be partitioned into two $γ$ -cliques is NP-complete. This result contrasts with that of the clique partitioning problem, whose decision version is polynomially solvable for bipartition. We propose four integer programming formulations and a multi-start greedy randomized heuristic to provide initial feasible solutions for MQCPP. Computational experiments show that two formulations that employ the principles of representatives outperform the others regarding the best-obtained solutions and the number of instances solved optimally within the imposed time limit. Furthermore, the results also demonstrate that the instances with medium values of $γ$ are more challenging for the proposed formulations than those with larger or lower values.
Efficient game theoretic approach to dynamic graph partitioning
2022, Information Sciences
As a building block in many graph-based applications, graph partitioning aims to divide a graph into smaller parts of roughly equal size, and meanwhile, minimize the number of cutting edges. Existing solutions for graph partitioning are mainly designed for static graphs and are not appropriate for many dynamic graphs in real-world scenarios, including social networks, knowledge graphs, and web graphs. Although there is an incremental method, called IncKGGGP, proposed to efficiently deal with dynamic graphs, it can only be deployed on top of a specific batch partitioning algorithm, called KGGGP, which inherently impairs the final partitioning quality.
To alleviate these issues, in this paper, we propose a novel Edge-Cut Partitioning approach based on Game theory for dynamic graphs (ECPG). Generally, ECPG is equipped with the following two nice properties. (1) High effectiveness. It can cope with dynamic graphs on top of any initial partitioning result, and then, achieve higher-quality results by choosing more effective static algorithms. (2) High efficiency. By exerting the advantage of game theory, ECPG can not only assign updated vertices into desirable partitions efficiently based on a low time complexity function but also can reduce redundant computations by reusing existing partitioning results. We also prove that there exists a Nash equilibrium in ECPG. From experimental results over several real-world graphs, it demonstrates that ECPG significantly outperforms the existing algorithms by up to one order of magnitude with comparable partitioning quality.
Graph compression based on transitivity for neighborhood query
2021, Information Sciences
Citation Excerpt :
They decomposed a graph into a set of large cliques, and then compressed and represented the graph succinctly. A more successful approach was proposed by [16] based on maximal cliques. This approach lists all the maximal cliques and defines a clique graph based on them.
In recent years, many graph compression methods have been introduced. One successful category of them is based on local decompression designed to answer neighborhood queries. These techniques mainly rely on local similarities of vertices. Besides, their performance is usually a function of graph sparsity. The proposed approach, in this paper, is a lossy compression technique used to answer neighborhood queries with a more general precondition, called transitivity. The output of this method is a sparse graph optimized to keep original adjacent vertices, in at most 2-distance from each other and vice versa. In other words, by traversing a compressed graph by depth of 2, from any desired vertex, its original adjacency list is reconstructed, with an acceptable error. This paper models an optimization problem to solve the inverse problem of finding the best compressed graph in order to minimize the reconstruction error. Then, this NP problem is approximated by a heuristic with a low degree polynomial time-complexity near to the complexity of the forward problem. The results of applying the proposed method on toy and real datasets are compared with the state of the art that improves compression ratio and performance with an acceptable query response time.
Iterated multilevel simulated annealing for large-scale graph conductance minimization
2021, Information Sciences
Citation Excerpt :
It would be useful to investigate additional strategies to be able to handle both types of graphs. In particular, other graph representations using compact structure [14] may be considered to reduce the space complexity of the algorithm. Fourth, in addition to the studied memetic and local search methods in the literature, it is worthy investigating other metaheuristic-based algorithms to better handle various types of graphs and further enrich the MC-GPP toolkit.
Given an undirected connected graph $G = (V, E)$ with vertex set V and edge set E, the minimum conductance graph partitioning problem is to partition V into two disjoint subsets such that the conductance, i.e., the ratio of the number of cut edges to the smallest volume of two partition subsets is minimized. This problem has a number of practical applications in various areas such as community detection, bioinformatics, and computer vision. However, the problem is computationally challenging, especially for large problem instances. This work presents the first iterated multilevel simulated annealing algorithm for large-scale graph conductance minimization. The algorithm features a novel solution-guided coarsening method and an effective solution refinement procedure based on simulated annealing. Computational experiments demonstrate the high performance of the algorithm on 66 very large real-world sparse graphs with up to 23 million vertices. Additional experiments are presented to get insights into the influences of its algorithmic components. The source code of the proposed algorithm is publicly available, which can be used to solve various real world problems.
MIP formulations for induced graph optimization problems: a tutorial
2023, International Transactions in Operational Research

View all citing articles on Scopus

View full text