: A distributed engine for scalable path queries over temporal property graphs
Introduction
Graphs are a natural model to represent and analyze linked data in various domains. Property graphs allow vertices and edges to have associated key–value pair properties, besides the graph structure. This forms a rich information schema and has been used to capture knowledge graphs (concepts, relations) [32], social networks (person, forum, message) [8], epidemic networks (subject, infected status, location) [29], and financial and retail transactions (person, product, purchase) [22].
Path queries are a common class of queries over property graphs [14], [44]. Here, the user defines a sequence of predicates over vertices and edges that should match along a path in the graph. E.g., in the property graph for a community of users in Fig. 1, the vertices are labeled with their IDs, their colors indicate their type – blue for Person and orange for a Post, and they have a set of properties listed as Name:Value. The edges are relationships, with types such as Follows, Likes and Created. We can define an example 3-hop path query “[EQ1] Find a person (vertex type) who lives in the country ‘UK’ (vertex property) and follows (edge type) a person who follows another person who is tagged with the label ‘Hiking’ (vertex property)”. This query would match CleoAliceBob, if we ignore the time intervals. Path queries are used to identify concept pathways in knowledge graphs, find friends in social networks, fake news detection, and suggest products in retail websites [14], [24], [44]. They also need to be performed rapidly, within , as part of interactive requests from websites or exploratory queries by analysts.
While graph databases are designed for transactional read and write workloads, we consider graphs that are updated infrequently but queried often. For these workloads, graph query engines load and retain property graphs in-memory to service requests with low latency, without the need for locking or consistency protocols [7], [42]. They may also create indexes to accelerate these searches [31], [50]. Property graphs can be large, with – vertices and edges, and ’s of properties on each vertex or edge. This can exceed the memory on a single machine, often caused by the properties. This necessitates the use of distributed systems to scale to large graphs [27], [41].
Time is an increasingly common graph feature in a variety of domains [16], [18], [29], [53]. However, existing property graph data models fail to consider it as a first-class entity. Here, we distinguish between graphs with a time interval or a lifespan associated with their entities (properties, vertices, edges), and those where the entities themselves change over time and the history is available. We call the former static temporal graphs and the latter dynamic temporal graphs. Yet another class is streaming graphs, where the topology and properties change in real-time, and queries are performed on this evolving structure [13], [48]; that is outside the scope of this article.
E.g., in the temporal graph in Fig. 1, the lifespan, [start, end), is indicated on the vertices, edges and properties. The start time is inclusive while the end time is exclusive. Other than the properties of Cleo, the remaining entities of the graph form a static temporal graph as they are each valid only for a single time range. But the value of the Country property of Cleo changes over time, making it a dynamic temporal graph.
This gap is reflected not just in the data model but also in the queries supported. We make a distinction between time-independent (TI) and time-dependent (TD) queries, both being defined on a temporal graph [47]. TI queries are those which can be answered by examining the graph at a single point in time (a snapshot), e.g. EQ1 executed on the temporal graph. In contrast, TD queries capture temporal relations between the entities across consecutive time intervals, e.g., “[EQ2] Find people tagged with ‘Hiking’ who liked a post tagged as ‘Vacation’, before the post was liked by a person named ‘Don’”, and “[EQ3] Find people who started to follow another person, after the latter stops following ‘Don’”. Treating time as just another property fails to express temporal relations such as ensuring time-ordering among the entities on the path. While EQ2 and EQ3 should match the paths BobPicPostDon and AliceBobDon, respectively, such queries are hard, if not impossible, to express in current graph databases. This problem is exacerbated for path queries over dynamic temporal graphs. E.g., the query EQ1 over the dynamic temporal graph should not match CleoAliceBob since at the time Cleo was living in ‘UK’, she was not following Alice.
While platforms which process a snapshot at a time [30], [47] can be adapted to support TI queries over temporal graphs, TD queries cannot be expressed meaningfully. Even those that support TD algorithms enforce strict temporal ordering [15], requiring that the time intervals along the path should be increasing or decreasing, but not both; this limits query expressivity. These motivate the need to support intuitive temporal predicates to concisely express such temporal relations, and flexible platforms to execute them. Lastly, the scalability of existing graph systems is also limited, with few property graph query engines that operate on distributed memory systems with low latency [42], [52], let alone on temporal property graphs.
We make the following specific contributions in this article:
- •
We propose a temporal property graph model, and intuitive temporal predicates and aggregation operators for path queries on them (Section 3).
- •
We design a distributed execution model for these queries using the interval-centric computing model (Section 4).
- •
We develop a novel cost model that uses graph statistics to select the best from multiple execution plans (Section 5).
- •
We conduct a detailed evaluation of the performance and scalability of for temporal graphs and up to queries, derived from the LDBC benchmark. We compare this against three configurations of Neo4J, and JanusGraph which uses Apache Spark (Section 6).
We discuss related work in Section 2 and our conclusions in Section 7.
A prior version of this work appeared as a conference paper [36]. This article substantially extends this. Specifically, it introduces the temporal aggregation operator to the query model (Section 3.3) and implements it within the execution model; offers details, illustrations and complexity metrics for our query model, distributed execution model and query optimizations (Sections Section 3, 4 Distributed query engine, 5 Query planning and optimization); and provides a rigorous empirical evaluation, including two additional large dynamic temporal graphs, aggregation query workloads, weak scaling experiments, and results on the component times of query execution, besides more detailed analysis for the cost model benefits and baseline platform comparisons (Section 6).
Section snippets
Distributed and temporal graph processing
There are several distributed graph processing platforms for running graph algorithms on commodity clusters and clouds [20]. These typically offer programming abstractions like Google Pregel’s vertex-centric computing model [30] and its component-centric variants [17], [45] to design algorithms such as Breadth First Search, centrality scores and mining [10]. These execute using a Bulk Synchronous Parallel (BSP) model, and scale to large graphs and applications that explore the entire graph.
Temporal concepts
The temporal property graph concepts used in this paper are drawn from our earlier work [15]. Time is a linearly ordered discrete domain whose range is the set of non-negative whole numbers. Each instant in this domain is called a time-point and an atomic increment in time is called a time-unit. A time interval is given by where which indicates an interval starting from and including and extending to but excluding . Interval relations [5] are Boolean comparators between
Relaxed interval centric computing
The high-level architecture of our distributed query engine, , is shown in Fig. 2(a). Our query engine uses a distributed in-memory iterative execution model that extends and relaxes the Interval-centric Computing Model (ICM) [15]. ICM adds a temporal dimension to Pregel’s vertex-centric iterative computing model [30], and allows users to define their computation from the perspective of a single interval-vertex, i.e., the state and properties for a certain interval of a vertex’s
Query planning and optimization
A given path query can be executed using different distributed execution plans, each having a different execution time. The goal of the cost model is to quickly estimate the expected execution time of these plans and pick the optimal plan for execution. Rather than absolute accuracy of the query execution time, what matters is its ability to distinguish poor plans with high execution times from good plans with low execution times.
We propose an analytical cost model that uses statistics about
Workload
We use the social network benchmark from the Linked Data Benchmark Council (LDBC) [2] for our evaluation of . It is a community-standard workload with realistic transactional path queries over a social network property graph. There are two parts to this benchmark, a social network graph generator and a suite of benchmark queries.
Conclusions
In this article, we have motivated the need for querying over large temporal property graphs and the lack of such platforms. We have proposed an intuitive temporal path query model to express a wide variety of requirements over such graphs, and designed the distributed engine to implement these at scale over the Graphite ICM platform. Our novel analytical cost model uses concise information about the graph to allow accurate selection of a distributed query execution plan from several
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
S. Ramesh was supported by the Maersk CDS M.Tech. Fellowship, India. Y. Simmhan was supported by the Swarna Jayanti Fellowship from DST , India under grant number SB/SJF/2019-20/02. The authors thank Ravishankar Joshi from BITS-Pilani, Goa for his assistance with the experiments.
Shriram Ramesh is a Graduate Student at the Department of Computational and Data Sciences at the Indian Institute of Science, Bangalore. He is supported by the Maersk CDS M.Tech. Fellowship. His research interests include graph processing, distributed systems and database systems. He was part of the team that won IEEE TCSC SCALE Challenge Award in 2019. He has a Bachelor’s Degree in Electrical and Electronics Engineering and two years of consulting experience in the domain of Business
References (54)
SPARQL query language for RDF
(2008)The LDBC Social Network Benchmark (Version 0.3.2)Technical Report
(2019)Neo4J graph platform
(2020)OrientDB graph database
(2020)Maintaining knowledge about temporal intervals
Commun. ACM
(1983)- et al.
ChronoGraph: Enabling temporal graph traversals for efficient information diffusion analysis over time
IEEE TKDE
(2019) - et al.
In-memory graph databases for web-scale data
Computer
(2015) - M. Cha, H. Haddadi, F. Benevenuto, K.P. Gummadi, Measuring user influence in twitter: The million follower fallacy, in:...
- D. Chavarría-Miranda, V.G. Castellana, A. Morari, D. Haglin, J. Feo, Graql: A query language for high-performance...
- et al.
G-miner: An efficient task-oriented graph mining system
A lightweight communication runtime for distributed graph analytics
Time-aggregated graphs for modeling spatio-temporal networks
J. Data Semant. XI
Tracking the evolution of communities in dynamic social networks
The parallel BGL: A generic library for distributed graph computations
Scalable SPARQL querying of large RDF graphs
VLDB Endow.
A graph model for E-commerce recommender systems
J. Am. Soc. Inf. Sci. Technol.
Declarative and distributed graph analytics with GRADOOP
Proc. VLDB Endow.
Cited by (4)
Pre-processing of RDF data for METIS partitioning
2023, International Journal of Metadata, Semantics and OntologiesMAGMA: Proposing a Massive Historical Graph Management System
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Higher-Order Relationship-Based Access Control: A Temporal Instantiation with IoT Applications
2022, Proceedings of ACM Symposium on Access Control Models and Technologies, SACMAT
Shriram Ramesh is a Graduate Student at the Department of Computational and Data Sciences at the Indian Institute of Science, Bangalore. He is supported by the Maersk CDS M.Tech. Fellowship. His research interests include graph processing, distributed systems and database systems. He was part of the team that won IEEE TCSC SCALE Challenge Award in 2019. He has a Bachelor’s Degree in Electrical and Electronics Engineering and two years of consulting experience in the domain of Business Intelligence and Analytics.
Animesh Baranawal is a Graduate Research Student in the Department of Computational and Data Sciences at the Indian Institute of Science, Bangalore. His research interests include graph processing and distributed systems. He has a Bachelor’s Degree in Computer Science and Engineering and two years of industrial experience in Software Development.
Yogesh Simmhan is an Associate Professor at the Department of Computational and Data Sciences and a Swarna Jayanti Fellow at the Indian Institute of Science, Bangalore. His research explores abstractions, algorithms and applications on distributed systems. He has published over 100 peerreviewed papers, and won the Best Paper Award at IEEE International Conference on Cloud Computing (CLOUD) 2019, IEEE TCSC SCALE Challenge Award in 2019 and 2012 the Distinguished Paper award at EuroPar 2018, and the IEEE/ACM Supercomputing HPC Storage Challenge Award in 2008. He is the recipient of the IEEE TCSC Award for Excellence in Scalable Computing (Mid Career Researcher) in 2020. He is an Associate Editor-in-Chief of the Journal of Parallel and Distributed Systems (JPDC), an Associate Editor of Future Generation Computer Systems (FGCS), and earlier served as an Associate Editor of IEEE Transactions on Cloud Computing and a member of the IEEE Future Directions Initiative on Big Data.
Yogesh has a Ph.D. in Computer Science from Indiana University, Bloomington, and was previously a Research Assistant Professor at the University of Southern California (USC), Los Angeles, and a Postdoc at Microsoft Research, San Francisco. He is a Senior Member of the IEEE and the ACM.