Distributed mining of convoys in large scale datasets

Orakzai, Faisal; Pedersen, Torben Bach; Calders, Toon

doi:10.1007/s10707-020-00431-w

Distributed mining of convoys in large scale datasets

Published: 24 February 2021

Volume 25, pages 353–396, (2021)
Cite this article

GeoInformatica Aims and scope Submit manuscript

383 Accesses
4 Citations
Explore all metrics

Abstract

Tremendous increase in the use of the mobile devices equipped with the GPS and other location sensors has resulted in the generation of a huge amount of movement data. In recent years, mining this data to understand the collective mobility behavior of humans, animals and other objects has become popular. Numerous mobility patterns, or their mining algorithms have been proposed, each representing a specific movement behavior. Convoy pattern is one such pattern which can be used to find groups of people moving together in public transport or to prevent traffic jams. A convoy is a set of at least m objects moving together for at least k consecutive time stamps where m and k are user-defined parameters. Existing algorithms for detecting convoy patterns do not scale to real-life dataset sizes. Therefore in this paper, we propose a generic distributed convoy pattern mining algorithm called DCM and show how such an algorithm can be implemented using the MapReduce framework. We present a cost model for DCM and a detailed theoretical analysis backed by experimental results. We show the effect of partition size on the performance of DCM. The results from our experiments on different data-sets and hardware setups, show that our distributed algorithm is scalable in terms of data size and number of nodes, and more efficient than any existing sequential as well as distributed convoy pattern mining algorithm, showing speed-ups of up to 16 times over SPARE, the state of the art distributed co-movement pattern mining framework. DCM is thus able to process large datasets which SPARE is unable to.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

A MapReduce approach for spatial co-location pattern mining via ordered-clique-growth

Article 02 December 2019

Detecting taxi movements using Random Swap clustering and sequential pattern mining

Article Open access 13 May 2019

Spatiotemporal Pattern Mining: Algorithms and Applications

Notes

References

Aung HH, Tan KL (2010) Discovery of evolving convoys. In: International conference on scientific and statistical database management. Springer, pp 196–213
Brinkhoff T (2000) Generating network-based moving objects. In: Scientific and statistical database management, 2000. Proceedings. 12th international conference on. IEEE, pp 253–255
Brinkhoff T (2002) A framework for generating network-based moving objects. GeoInformatica 6(2):153–180
Article Google Scholar
Chen TS, Chang CY (2002) Skewed data partition and alignment techniques for compiling programs on distributed memory multicomputers. J Supercomput 21(2):191–211
Article Google Scholar
Dai BR, Lin I, et al. (2012) Efficient map/reduce-based dbscan algorithm with optimized data partition. In: Cloud computing (CLOUD), 2012 IEEE 5th international conference on. IEEE, pp 59–66
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Douglas DH, Peucker TK (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica 10(2):112–122
Article Google Scholar
Ester M, Kriegel HP, Sander J, Xu X, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol 96, pp 226–231
Fan Q, Zhang D, Wu H, Tan KL (2016) A general and parallel platform for mining co-movement patterns over large-scale trajectories. Proc VLDB Endowment 10(4):313–324
Article Google Scholar
Gudmundsson J, van Kreveld M (2006) Computing longest duration flocks in trajectory data. In: Proceedings of the 14th annual ACM international symposium on advances in geographic information systems. ACM, pp 35–42
He Y, Tan H, Luo W, Feng S, Fan J (2014) Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99
Article Google Scholar
Hua KA, Lee C (1991) Handling data skew in multiprocessor database computers using partition tuning. In: VLDB. Citeseer, pp 525–535
Jeung H, Shen HT, Zhou X (2008) Convoy queries in spatio-temporal databases. In: 2008 IEEE 24th international conference on data engineering. IEEE, pp 1457–1459
Jeung H, Yiu ML, Zhou X, Jensen CS, Shen HT (2008) Discovery of convoys in trajectory databases. Proc VLDB Endowment 1(1):1068–1080
Article Google Scholar
Kalnis P, Mamoulis N, Bakiras S (2005) On discovering moving clusters in spatio-temporal data. In: International symposium on spatial and temporal databases. Springer, pp 364–381
Kwon Y, Ren K, Balazinska M, Howe B, Rolia J (2013) Managing skew in hadoop. IEEE Data Eng Bull 36(1):24–33
Google Scholar
Lacerda T, Fernandes S (2016) Scalable real-time flock detection. In: Global communications conference (GLOBECOM), 2016 IEEE. IEEE, pp 1–7
Naserian E, Wang X, Xu X, Dong Y (2016) Discovery of loose travelling companion patterns from human trajectories. In: High performance computing and communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016 IEEE 18th International Conference on. IEEE, pp 1238–1245
Orakzai F, Calders T, Pedersen TB (2016) Distributed convoy pattern mining. In: 17th IEEE international conference on mobile data management
Orakzai F, Devogele T, Calders T (2015) Towards distributed convoy pattern mining. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems, GIS ’15. https://doi.org/10.1145/2820783.2820840. ACM, pp 50:1–50:4, DOI New York, (to appear in print)
Patwary MMA, Palsetia D, Agrawal A, Liao WK, Manne F, Choudhary A (2012) A new scalable parallel dbscan algorithm using the disjoint-set data structure. In: High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pp. 1–11. IEEE
Tang LA, Zheng Y, Yuan J, Han J, Leung A, Hung CC, Peng WC (2012) On discovery of traveling companions from streaming trajectories. In: 2012 IEEE 28th International conference on data engineering (ICDE). IEEE, pp 186–197
Vieira MR, Bakalov P, Tsotras VJ (2009) On-line discovery of flock patterns in spatio-temporal data. In: Proceedings of the 17th ACM SIGSPATIAL international conference on advances in geographic information systems. ACM, pp 286–295
Wang D, Joshi G, Wornell G (2014) Efficient task replication for fast response times in parallel computation. In: ACM SIGMETRICS performance evaluation review, vol 42. ACM, pp 599–600
Yoon H, Shahabi C (2009) Accurate discovery of valid convoys from moving object trajectories. In: ICDM workshops, pp. 636–643
Yuan J, Zheng Y, Xie X, Sun G (2011) Driving with knowledge from the physical world. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 316–324
Yuan J, Zheng Y, Zhang C, Xie W, Xie X, Sun G, Huang Y (2010) T-drive: driving directions based on taxi trajectories. In: Proceedings of the 18th SIGSPATIAL International conference on advances in geographic information systems. ACM, pp 99–108
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, pp 2–2

Download references

Author information

Authors and Affiliations

Université Libre de Bruxelles, Aalborg University, Aalborg, Denmark
Faisal Orakzai
Aalborg University, Aalborg, Denmark
Torben Bach Pedersen
University of Antwerp, Antwerp, Belgium
Toon Calders

Authors

Faisal Orakzai
View author publications
You can also search for this author in PubMed Google Scholar
Torben Bach Pedersen
View author publications
You can also search for this author in PubMed Google Scholar
Toon Calders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Faisal Orakzai.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Scalability on the NUMA architecture

NUMA (Non Uniform Memory Access) systems are low-cost multi-processor platforms that support large numbers of processors on a single board. Faster CPUS are generally constrained by the memory bandwidth under memory-intensive workload. Symmetric multiprocessing (SMP) systems use a shared bus to connect processors, thus, many processors have to compete for memory bandwidth. The NUMA architecture solves this problem by connecting several low-end processor nodes each having its own cache and memory, using a high-speed connection. Each node has a memory controller which allows it to use memory on all other nodes in addition to its own memory, thus abstracting the memory as a single image. When a processor requests data from a memory location that does not exist in its local memory, the data is transfered over the NUMA connection, which is slower than the connection between the processor and its local memory. Thus, memory access time is not uniform and varies depending upon if the access is local or remote.

In NUMA systems, cache coherence problem occurs when two or more processors access the same shared data. If one processor modifies its copy of the data, the copies of this data in the cache of other processors will become stale. ccNUMA (cache coherent NUMA) machines ensure that a processor accessing a memory location receives the most up-to-date version of the data. Cache coherence can be ensured either in software or hardware, however software approaches tend to be slower than the hardware ones.

We analysed the performance of DCM_MR on AMD Opteron based NUMA machines. Figure 31 shows the architecture of the AMD Opteron 6300 series processors. The processor has two NUMA nodes connected with a HyperTransport bus. Each node has 8 cores. The cores are arranged in 4 pairs such that each pair shares a Floating Point Unit (FPU) and an L2 cache of 2MB. The pairs are connected to each other by the Crossbar Switch which connects to the HT bus through an HT interface. Each node has its own memory controller with 2 channels. Each channel supports memory up to 32 GB.

Figure 32 shows the architecture of the AMD Opteron 6300 series quad-processor ccNUMA system which we used for one set of our experiments. The system consists of 4 AMD Opteron 6376 processors (Fig. 31) interconnected through HT buses. The system has 512 GB of memory (128 GB per processor, 64 GB per NUMA node). If a processor core is the first one to request a memory page, it is mapped to the memory of the node to which the core belongs (first touch policy). A NUMA aware OS tries to keep the threads running always in the same core pair because they share the same L2 cache. Moving a thread to another core-pair will cause a performance degradation because of cache invalidation. A thread gets further performance hit if it is moved to another node because it needs to get data from a remote node’s memory. Therefore using multiple cores for running a process using context switching although increases the performance but the increase might not be linear depending upon the location of the core the process is moved to.

^{Footnote 7}

If an algorithm accesses all of its data from the memory, ccNUMA increases memory bandwidth at a ratio effectively the same as the number of NUMA nodes. In our case, it is expected to have 8 times the memory bandwidth of an SMP machine but it does not necessarily mean that the performance of an algorithm will scale linearly with increase in the number of cores because of the performance bottlenecks explained above. The following steps are required from a NUMA-aware OS for optimal NUMA performance:

Processes should be scheduled on cores as close as possible to the memory that contains its data.
OS should maintain a queue per node
Memory allocation for a process should be in the memory of a single node
All child processes should be scheduled on the same node during the lifetime of the parent process

The two most common policies supported by the Linux kernel are NODE LOCAL and INTERLEAVE.^{Footnote 8}^{Footnote 9} In NODE LOCAL mode, an allocation occurs from the memory node local to where the code is currently executing where as in the INTERLEAVE mode, allocation occurs round-robin. The INTERLEAVE policy is used to distribute memory accesses for data structures that may be accessed from multiple processors in the system in order to have an even load on the interconnect and the memory of each node.

The memory management policies of the OS work best for the general cases and not for a specific application with a different memory access behaviour. When the memory load of a NUMA system increases, its memory management overhead increases, thus resulting in the overall degraded performance. Therefore the best approach is to have an application do the management itself. Hadoop runs in Java Virtual Machines (JVMs) which come with support for NUMA but Hadoop itself is not NUMA aware. Thus, an algorithm running on Hadoop on a NUMA system shows lower scalability in terms of number of cores when compared to its execution on a cluster of SMP machines with the same number of cores.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Orakzai, F., Pedersen, T.B. & Calders, T. Distributed mining of convoys in large scale datasets. Geoinformatica 25, 353–396 (2021). https://doi.org/10.1007/s10707-020-00431-w

Download citation

Received: 14 August 2018
Revised: 02 November 2020
Accepted: 17 December 2020
Published: 24 February 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s10707-020-00431-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed mining of convoys in large scale datasets

Abstract

Access this article

Similar content being viewed by others

A MapReduce approach for spatial co-location pattern mining via ordered-clique-growth

Detecting taxi movements using Random Swap clustering and sequential pattern mining

Spatiotemporal Pattern Mining: Algorithms and Applications

Notes

References