Granite: A distributed engine for scalable path queries over temporal property graphs

doi:10.1016/j.jpdc.2021.02.004

Journal of Parallel and Distributed Computing

Volume 151, May 2021, Pages 94-111

https://doi.org/10.1016/j.jpdc.2021.02.004 Get rights and content

Highlights

•
Proposes a linear path query model over temporal property graphs.
•
Provides a distributed execution model which can scale to large graphs.
•
Includes a novel cost model to choose the optimum execution plan.
•
Evaluates our distributed engine for diverse query workloads over large property graphs on a commodity cluster.

Abstract

Property graphs are a common form of linked data, with path queries used to traverse and explore them for enterprise transactions and mining. Temporal property graphs are a recent variant where time is a first-class entity to be queried over, and their properties and structure vary over time. These are seen in social, telecom, transit and epidemic networks. However, current graph databases and query engines have limited support for temporal relations among graph entities, no support for time-varying entities and/or do not scale on distributed resources. We address this gap by extending a linear path query model over property graphs to include intuitive temporal predicates and aggregation operators over temporal graphs. We design a distributed execution model for these temporal path queries using the interval-centric computing model, and develop a novel cost model to select an efficient execution plan from several. We perform detailed experiments of our $G r a n i t e$ distributed query engine using both static and dynamic temporal property graphs as large as $52 M$ vertices, $218 M$ edges and $325 M$ properties, and a 1600-query workload, derived from the LDBC benchmark. We frequently offer sub-second query latencies on a commodity cluster, which is $149 \times$ – $1140 \times$ faster compared to industry-leading Neo4J shared-memory graph database and the JanusGraph/Spark distributed graph query engine. $G r a n i t e$ also completes 100% of the queries for all graphs, compared to only 32–92% workload completion by the baseline systems. Further, our cost model selects a query plan that is within 10% of the optimal execution time in 90% of the cases. Despite the irregular nature of graph processing, we exhibit a weak-scaling efficiency of $\geq 60 %$ on 8 nodes and $\geq 40 %$ on 16 nodes, for most query workloads.

Introduction

Graphs are a natural model to represent and analyze linked data in various domains. Property graphs allow vertices and edges to have associated key–value pair properties, besides the graph structure. This forms a rich information schema and has been used to capture knowledge graphs (concepts, relations) [32], social networks (person, forum, message) [8], epidemic networks (subject, infected status, location) [29], and financial and retail transactions (person, product, purchase) [22].

Path queries are a common class of queries over property graphs [14], [44]. Here, the user defines a sequence of predicates over vertices and edges that should match along a path in the graph. E.g., in the property graph for a community of users in Fig. 1, the vertices are labeled with their IDs, their colors indicate their type – blue for Person and orange for a Post, and they have a set of properties listed as Name:Value. The edges are relationships, with types such as Follows, Likes and Created. We can define an example 3-hop path query “[EQ1] Find a person (vertex type) who lives in the country ‘UK’ (vertex property) and follows (edge type) a person who follows another person who is tagged with the label ‘Hiking’ (vertex property)”. This query would match Cleo $\to$ Alice $\to$ Bob, if we ignore the time intervals. Path queries are used to identify concept pathways in knowledge graphs, find friends in social networks, fake news detection, and suggest products in retail websites [14], [24], [44]. They also need to be performed rapidly, within $\approx 1 s$ , as part of interactive requests from websites or exploratory queries by analysts.

While graph databases are designed for transactional read and write workloads, we consider graphs that are updated infrequently but queried often. For these workloads, graph query engines load and retain property graphs in-memory to service requests with low latency, without the need for locking or consistency protocols [7], [42]. They may also create indexes to accelerate these searches [31], [50]. Property graphs can be large, with $1 0^{5}$ – $1 0^{8}$ vertices and edges, and $10$ ’s of properties on each vertex or edge. This can exceed the memory on a single machine, often caused by the properties. This necessitates the use of distributed systems to scale to large graphs [27], [41].

Time is an increasingly common graph feature in a variety of domains [16], [18], [29], [53]. However, existing property graph data models fail to consider it as a first-class entity. Here, we distinguish between graphs with a time interval or a lifespan associated with their entities (properties, vertices, edges), and those where the entities themselves change over time and the history is available. We call the former static temporal graphs and the latter dynamic temporal graphs. Yet another class is streaming graphs, where the topology and properties change in real-time, and queries are performed on this evolving structure [13], [48]; that is outside the scope of this article.

E.g., in the temporal graph in Fig. 1, the lifespan, [start, end), is indicated on the vertices, edges and properties. The start time is inclusive while the end time is exclusive. Other than the properties of Cleo, the remaining entities of the graph form a static temporal graph as they are each valid only for a single time range. But the value of the Country property of Cleo changes over time, making it a dynamic temporal graph.

This gap is reflected not just in the data model but also in the queries supported. We make a distinction between time-independent (TI) and time-dependent (TD) queries, both being defined on a temporal graph [47]. TI queries are those which can be answered by examining the graph at a single point in time (a snapshot), e.g. EQ1 executed on the temporal graph. In contrast, TD queries capture temporal relations between the entities across consecutive time intervals, e.g., “[EQ2] Find people tagged with ‘Hiking’ who liked a post tagged as ‘Vacation’, before the post was liked by a person named ‘Don’”, and “[EQ3] Find people who started to follow another person, after the latter stops following ‘Don’”. Treating time as just another property fails to express temporal relations such as ensuring time-ordering among the entities on the path. While EQ2 and EQ3 should match the paths Bob $\to$ PicPost $\to$ Don and Alice $\to$ Bob $\to$ Don, respectively, such queries are hard, if not impossible, to express in current graph databases. This problem is exacerbated for path queries over dynamic temporal graphs. E.g., the query EQ1 over the dynamic temporal graph should not match Cleo $\to$ Alice $\to$ Bob since at the time Cleo was living in ‘UK’, she was not following Alice.

While platforms which process a snapshot at a time [30], [47] can be adapted to support TI queries over temporal graphs, TD queries cannot be expressed meaningfully. Even those that support TD algorithms enforce strict temporal ordering [15], requiring that the time intervals along the path should be increasing or decreasing, but not both; this limits query expressivity. These motivate the need to support intuitive temporal predicates to concisely express such temporal relations, and flexible platforms to execute them. Lastly, the scalability of existing graph systems is also limited, with few property graph query engines that operate on distributed memory systems with low latency [42], [52], let alone on temporal property graphs.

We make the following specific contributions in this article:

•
We propose a temporal property graph model, and intuitive temporal predicates and aggregation operators for path queries on them (Section 3).
•
We design a distributed execution model for these queries using the interval-centric computing model (Section 4).
•
We develop a novel cost model that uses graph statistics to select the best from multiple execution plans (Section 5).
•
We conduct a detailed evaluation of the performance and scalability of $G r a n i t e$ for $8$ temporal graphs and up to $1600$ queries, derived from the LDBC benchmark. We compare this against three configurations of Neo4J, and JanusGraph which uses Apache Spark (Section 6).

We discuss related work in Section 2 and our conclusions in Section 7.

A prior version of this work appeared as a conference paper [36]. This article substantially extends this. Specifically, it introduces the temporal aggregation operator to the query model (Section 3.3) and implements it within the execution model; offers details, illustrations and complexity metrics for our query model, distributed execution model and query optimizations (Sections Section 3, 4 Distributed query engine, 5 Query planning and optimization); and provides a rigorous empirical evaluation, including two additional large dynamic temporal graphs, aggregation query workloads, weak scaling experiments, and results on the component times of query execution, besides more detailed analysis for the cost model benefits and baseline platform comparisons (Section 6).

Section snippets

Distributed and temporal graph processing

There are several distributed graph processing platforms for running graph algorithms on commodity clusters and clouds [20]. These typically offer programming abstractions like Google Pregel’s vertex-centric computing model [30] and its component-centric variants [17], [45] to design algorithms such as Breadth First Search, centrality scores and mining [10]. These execute using a Bulk Synchronous Parallel (BSP) model, and scale to large graphs and applications that explore the entire graph.

Temporal concepts

The temporal property graph concepts used in this paper are drawn from our earlier work [15]. Time is a linearly ordered discrete domain $Ω$ whose range is the set of non-negative whole numbers. Each instant in this domain is called a time-point and an atomic increment in time is called a time-unit. A time interval is given by $τ = [t_{s}, t_{e})$ where $t_{s}, t_{e} \in Ω$ which indicates an interval starting from and including $t_{s}$ and extending to but excluding $t_{e}$ . Interval relations [5] are Boolean comparators between

Relaxed interval centric computing

The high-level architecture of our distributed query engine, $G r a n i t e$ , is shown in Fig. 2(a). Our query engine uses a distributed in-memory iterative execution model that extends and relaxes the Interval-centric Computing Model (ICM) [15]. ICM adds a temporal dimension to Pregel’s vertex-centric iterative computing model [30], and allows users to define their computation from the perspective of a single interval-vertex, i.e., the state and properties for a certain interval of a vertex’s

Query planning and optimization

A given path query can be executed using different distributed execution plans, each having a different execution time. The goal of the cost model is to quickly estimate the expected execution time of these plans and pick the optimal plan for execution. Rather than absolute accuracy of the query execution time, what matters is its ability to distinguish poor plans with high execution times from good plans with low execution times.

We propose an analytical cost model that uses statistics about

Workload

We use the social network benchmark from the Linked Data Benchmark Council (LDBC) [2] for our evaluation of $G r a n i t e$ . It is a community-standard workload with realistic transactional path queries over a social network property graph. There are two parts to this benchmark, a social network graph generator and a suite of benchmark queries.

Conclusions

In this article, we have motivated the need for querying over large temporal property graphs and the lack of such platforms. We have proposed an intuitive temporal path query model to express a wide variety of requirements over such graphs, and designed the $G r a n i t e$ distributed engine to implement these at scale over the Graphite ICM platform. Our novel analytical cost model uses concise information about the graph to allow accurate selection of a distributed query execution plan from several

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

S. Ramesh was supported by the Maersk CDS M.Tech. Fellowship, India. Y. Simmhan was supported by the Swarna Jayanti Fellowship from DST , India under grant number SB/SJF/2019-20/02. The authors thank Ravishankar Joshi from BITS-Pilani, Goa for his assistance with the experiments.

Shriram Ramesh is a Graduate Student at the Department of Computational and Data Sciences at the Indian Institute of Science, Bangalore. He is supported by the Maersk CDS M.Tech. Fellowship. His research interests include graph processing, distributed systems and database systems. He was part of the team that won IEEE TCSC SCALE Challenge Award in 2019. He has a Bachelor’s Degree in Electrical and Electronics Engineering and two years of consulting experience in the domain of Business

References (54)

SPARQL query language for RDF
(2008)
The LDBC Social Network Benchmark (Version 0.3.2)Technical Report
(2019)
Neo4J graph platform
(2020)
OrientDB graph database
(2020)
AllenJ.
Maintaining knowledge about temporal intervals
Commun. ACM
(1983)
ByunJ. et al.
ChronoGraph: Enabling temporal graph traversals for efficient information diffusion analysis over time
IEEE TKDE
(2019)
CastellanaV.G. et al.
In-memory graph databases for web-scale data
Computer
(2015)
M. Cha, H. Haddadi, F. Benevenuto, K.P. Gummadi, Measuring user influence in twitter: The million follower fallacy, in:...
D. Chavarría-Miranda, V.G. Castellana, A. Morari, D. Haglin, J. Feo, Graql: A query language for high-performance...
ChenH. et al.
G-miner: An efficient task-oriented graph mining system

R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, E. Chen, Kineograph: taking the pulse...

DangH.-V. et al.

A lightweight communication runtime for distributed graph analytics

D. Ediger, J. Riedy, D.A. Bader, H. Meyerhenke, Tracking structure of streaming social networks, in: 2011 IEEE...

W. Fan, Graph pattern matching revised for social network analysis, in: International Conference on Database Theory,...

S. Gandhi, Y. Simmhan, An Interval-centric model for distributed computing over temporal graphs, in: IEEE International...

GeorgeB. et al.

Time-aggregated graphs for modeling spatio-temporal networks

J. Data Semant. XI

(2008)

J.E. Gonzalez, R.S. Xin, A. Dave, D. Crankshaw, M.J. Franklin, I. Stoica, Graphx: Graph processing in a distributed...

GreeneD. et al.

Tracking the evolution of communities in dynamic social networks

GregorD. et al.

The parallel BGL: A generic library for distributed graph computations

Y. Guo, M. Biczak, A.L. Varbanescu, A. Iosup, C. Martella, T.L. Willke, How well do graph-processing platforms perform?...

W. Han, et al. Chronos: a graph engine for temporal graph analysis, in: ACM EuroSys,...

B. Haslhofer, R. Karl, E. Filtz, O Bitcoin where art thou? Insight into large-scale transaction graphs, in:...

HuangJ. et al.

Scalable SPARQL querying of large RDF graphs

VLDB Endow.

(2011)

HuangZ. et al.

A graph model for E-commerce recommender systems

J. Am. Soc. Inf. Sci. Technol.

(2004)

A.P. Iyer, Z. Liu, X. Jin, S. Venkataraman, V. Braverman, I. Stoica, {ASAP}: Fast, approximate graph pattern mining at...

N. Jamadagni, Y. Simmhan, GoDB: From batch processing to distributed querying over property graphs, in: IEEE/ACM...

JunghannsM. et al.

Declarative and distributed graph analytics with GRADOOP

Proc. VLDB Endow.

(2018)

Cited by (4)

Pre-processing of RDF data for METIS partitioning
2023, International Journal of Metadata, Semantics and Ontologies
MAGMA: Proposing a Massive Historical Graph Management System
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Higher-Order Relationship-Based Access Control: A Temporal Instantiation with IoT Applications
2022, Proceedings of ACM Symposium on Access Control Models and Technologies, SACMAT
Query rewriting for incremental continuous query evaluation in hifun
2021, Algorithms

Animesh Baranawal is a Graduate Research Student in the Department of Computational and Data Sciences at the Indian Institute of Science, Bangalore. His research interests include graph processing and distributed systems. He has a Bachelor’s Degree in Computer Science and Engineering and two years of industrial experience in Software Development.

Yogesh Simmhan is an Associate Professor at the Department of Computational and Data Sciences and a Swarna Jayanti Fellow at the Indian Institute of Science, Bangalore. His research explores abstractions, algorithms and applications on distributed systems. He has published over 100 peerreviewed papers, and won the Best Paper Award at IEEE International Conference on Cloud Computing (CLOUD) 2019, IEEE TCSC SCALE Challenge Award in 2019 and 2012 the Distinguished Paper award at EuroPar 2018, and the IEEE/ACM Supercomputing HPC Storage Challenge Award in 2008. He is the recipient of the IEEE TCSC Award for Excellence in Scalable Computing (Mid Career Researcher) in 2020. He is an Associate Editor-in-Chief of the Journal of Parallel and Distributed Systems (JPDC), an Associate Editor of Future Generation Computer Systems (FGCS), and earlier served as an Associate Editor of IEEE Transactions on Cloud Computing and a member of the IEEE Future Directions Initiative on Big Data.

Yogesh has a Ph.D. in Computer Science from Indiana University, Bloomington, and was previously a Research Assistant Professor at the University of Southern California (USC), Los Angeles, and a Postdoc at Microsoft Research, San Francisco. He is a Senior Member of the IEEE and the ACM.

View full text

Granite: A distributed engine for scalable path queries over temporal property graphs

Highlights

Abstract

Introduction

Section snippets

Distributed and temporal graph processing

Temporal concepts

Relaxed interval centric computing

Query planning and optimization

Workload

Conclusions

Declaration of Competing Interest

Acknowledgments

SPARQL query language for RDF

The LDBC Social Network Benchmark (Version 0.3.2)Technical Report

Neo4J graph platform

OrientDB graph database

Maintaining knowledge about temporal intervals

Commun. ACM

ChronoGraph: Enabling temporal graph traversals for efficient information diffusion analysis over time

IEEE TKDE

In-memory graph databases for web-scale data

Computer

G-miner: An efficient task-oriented graph mining system