Skip to main content

Advertisement

Log in

Distributed mining of convoys in large scale datasets

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

Tremendous increase in the use of the mobile devices equipped with the GPS and other location sensors has resulted in the generation of a huge amount of movement data. In recent years, mining this data to understand the collective mobility behavior of humans, animals and other objects has become popular. Numerous mobility patterns, or their mining algorithms have been proposed, each representing a specific movement behavior. Convoy pattern is one such pattern which can be used to find groups of people moving together in public transport or to prevent traffic jams. A convoy is a set of at least m objects moving together for at least k consecutive time stamps where m and k are user-defined parameters. Existing algorithms for detecting convoy patterns do not scale to real-life dataset sizes. Therefore in this paper, we propose a generic distributed convoy pattern mining algorithm called DCM and show how such an algorithm can be implemented using the MapReduce framework. We present a cost model for DCM and a detailed theoretical analysis backed by experimental results. We show the effect of partition size on the performance of DCM. The results from our experiments on different data-sets and hardware setups, show that our distributed algorithm is scalable in terms of data size and number of nodes, and more efficient than any existing sequential as well as distributed convoy pattern mining algorithm, showing speed-ups of up to 16 times over SPARE, the state of the art distributed co-movement pattern mining framework. DCM is thus able to process large datasets which SPARE is unable to.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Similar content being viewed by others

Notes

  1. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

  2. https://technology.finra.org/code/using-spark-transformations-for-mpreduce-jobs.html

  3. https://www.ae.be/blog-en/ingesting-data-spark-using-custom-hadoop-fileinputformat/

  4. http://chorochronos.datastories.org/

  5. http://research.microsoft.com/apps/pubs/?id=152883

  6. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/User_guide_T-drive.pdf

  7. http://beagle.ci.uchicago.edu/technical-specification

  8. http://queue.acm.org/detail.cfm?id=2513149

  9. https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt

References

  1. Aung HH, Tan KL (2010) Discovery of evolving convoys. In: International conference on scientific and statistical database management. Springer, pp 196–213

  2. Brinkhoff T (2000) Generating network-based moving objects. In: Scientific and statistical database management, 2000. Proceedings. 12th international conference on. IEEE, pp 253–255

  3. Brinkhoff T (2002) A framework for generating network-based moving objects. GeoInformatica 6(2):153–180

    Article  Google Scholar 

  4. Chen TS, Chang CY (2002) Skewed data partition and alignment techniques for compiling programs on distributed memory multicomputers. J Supercomput 21(2):191–211

    Article  Google Scholar 

  5. Dai BR, Lin I, et al. (2012) Efficient map/reduce-based dbscan algorithm with optimized data partition. In: Cloud computing (CLOUD), 2012 IEEE 5th international conference on. IEEE, pp 59–66

  6. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  7. Douglas DH, Peucker TK (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica 10(2):112–122

    Article  Google Scholar 

  8. Ester M, Kriegel HP, Sander J, Xu X, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol 96, pp 226–231

  9. Fan Q, Zhang D, Wu H, Tan KL (2016) A general and parallel platform for mining co-movement patterns over large-scale trajectories. Proc VLDB Endowment 10(4):313–324

    Article  Google Scholar 

  10. Gudmundsson J, van Kreveld M (2006) Computing longest duration flocks in trajectory data. In: Proceedings of the 14th annual ACM international symposium on advances in geographic information systems. ACM, pp 35–42

  11. He Y, Tan H, Luo W, Feng S, Fan J (2014) Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99

    Article  Google Scholar 

  12. Hua KA, Lee C (1991) Handling data skew in multiprocessor database computers using partition tuning. In: VLDB. Citeseer, pp 525–535

  13. Jeung H, Shen HT, Zhou X (2008) Convoy queries in spatio-temporal databases. In: 2008 IEEE 24th international conference on data engineering. IEEE, pp 1457–1459

  14. Jeung H, Yiu ML, Zhou X, Jensen CS, Shen HT (2008) Discovery of convoys in trajectory databases. Proc VLDB Endowment 1(1):1068–1080

    Article  Google Scholar 

  15. Kalnis P, Mamoulis N, Bakiras S (2005) On discovering moving clusters in spatio-temporal data. In: International symposium on spatial and temporal databases. Springer, pp 364–381

  16. Kwon Y, Ren K, Balazinska M, Howe B, Rolia J (2013) Managing skew in hadoop. IEEE Data Eng Bull 36(1):24–33

    Google Scholar 

  17. Lacerda T, Fernandes S (2016) Scalable real-time flock detection. In: Global communications conference (GLOBECOM), 2016 IEEE. IEEE, pp 1–7

  18. Naserian E, Wang X, Xu X, Dong Y (2016) Discovery of loose travelling companion patterns from human trajectories. In: High performance computing and communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016 IEEE 18th International Conference on. IEEE, pp 1238–1245

  19. Orakzai F, Calders T, Pedersen TB (2016) Distributed convoy pattern mining. In: 17th IEEE international conference on mobile data management

  20. Orakzai F, Devogele T, Calders T (2015) Towards distributed convoy pattern mining. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems, GIS ’15. https://doi.org/10.1145/2820783.2820840. ACM, pp 50:1–50:4, DOI New York, (to appear in print)

  21. Patwary MMA, Palsetia D, Agrawal A, Liao WK, Manne F, Choudhary A (2012) A new scalable parallel dbscan algorithm using the disjoint-set data structure. In: High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pp. 1–11. IEEE

  22. Tang LA, Zheng Y, Yuan J, Han J, Leung A, Hung CC, Peng WC (2012) On discovery of traveling companions from streaming trajectories. In: 2012 IEEE 28th International conference on data engineering (ICDE). IEEE, pp 186–197

  23. Vieira MR, Bakalov P, Tsotras VJ (2009) On-line discovery of flock patterns in spatio-temporal data. In: Proceedings of the 17th ACM SIGSPATIAL international conference on advances in geographic information systems. ACM, pp 286–295

  24. Wang D, Joshi G, Wornell G (2014) Efficient task replication for fast response times in parallel computation. In: ACM SIGMETRICS performance evaluation review, vol 42. ACM, pp 599–600

  25. Yoon H, Shahabi C (2009) Accurate discovery of valid convoys from moving object trajectories. In: ICDM workshops, pp. 636–643

  26. Yuan J, Zheng Y, Xie X, Sun G (2011) Driving with knowledge from the physical world. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 316–324

  27. Yuan J, Zheng Y, Zhang C, Xie W, Xie X, Sun G, Huang Y (2010) T-drive: driving directions based on taxi trajectories. In: Proceedings of the 18th SIGSPATIAL International conference on advances in geographic information systems. ACM, pp 99–108

  28. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, pp 2–2

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faisal Orakzai.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Scalability on the NUMA architecture

Appendix A: Scalability on the NUMA architecture

NUMA (Non Uniform Memory Access) systems are low-cost multi-processor platforms that support large numbers of processors on a single board. Faster CPUS are generally constrained by the memory bandwidth under memory-intensive workload. Symmetric multiprocessing (SMP) systems use a shared bus to connect processors, thus, many processors have to compete for memory bandwidth. The NUMA architecture solves this problem by connecting several low-end processor nodes each having its own cache and memory, using a high-speed connection. Each node has a memory controller which allows it to use memory on all other nodes in addition to its own memory, thus abstracting the memory as a single image. When a processor requests data from a memory location that does not exist in its local memory, the data is transfered over the NUMA connection, which is slower than the connection between the processor and its local memory. Thus, memory access time is not uniform and varies depending upon if the access is local or remote.

In NUMA systems, cache coherence problem occurs when two or more processors access the same shared data. If one processor modifies its copy of the data, the copies of this data in the cache of other processors will become stale. ccNUMA (cache coherent NUMA) machines ensure that a processor accessing a memory location receives the most up-to-date version of the data. Cache coherence can be ensured either in software or hardware, however software approaches tend to be slower than the hardware ones.

We analysed the performance of DCMMR on AMD Opteron based NUMA machines. Figure 31 shows the architecture of the AMD Opteron 6300 series processors. The processor has two NUMA nodes connected with a HyperTransport bus. Each node has 8 cores. The cores are arranged in 4 pairs such that each pair shares a Floating Point Unit (FPU) and an L2 cache of 2MB. The pairs are connected to each other by the Crossbar Switch which connects to the HT bus through an HT interface. Each node has its own memory controller with 2 channels. Each channel supports memory up to 32 GB.

Fig. 31
figure 31

AMD Opteron 6300 series processor architecture

Figure 32 shows the architecture of the AMD Opteron 6300 series quad-processor ccNUMA system which we used for one set of our experiments. The system consists of 4 AMD Opteron 6376 processors (Fig. 31) interconnected through HT buses. The system has 512 GB of memory (128 GB per processor, 64 GB per NUMA node). If a processor core is the first one to request a memory page, it is mapped to the memory of the node to which the core belongs (first touch policy). A NUMA aware OS tries to keep the threads running always in the same core pair because they share the same L2 cache. Moving a thread to another core-pair will cause a performance degradation because of cache invalidation. A thread gets further performance hit if it is moved to another node because it needs to get data from a remote node’s memory. Therefore using multiple cores for running a process using context switching although increases the performance but the increase might not be linear depending upon the location of the core the process is moved to.

Fig. 32
figure 32

AMD Opteron 6300 series multi-node architecture

Footnote 7

If an algorithm accesses all of its data from the memory, ccNUMA increases memory bandwidth at a ratio effectively the same as the number of NUMA nodes. In our case, it is expected to have 8 times the memory bandwidth of an SMP machine but it does not necessarily mean that the performance of an algorithm will scale linearly with increase in the number of cores because of the performance bottlenecks explained above. The following steps are required from a NUMA-aware OS for optimal NUMA performance:

  • Processes should be scheduled on cores as close as possible to the memory that contains its data.

  • OS should maintain a queue per node

  • Memory allocation for a process should be in the memory of a single node

  • All child processes should be scheduled on the same node during the lifetime of the parent process

The two most common policies supported by the Linux kernel are NODE LOCAL and INTERLEAVE.Footnote 8Footnote 9 In NODE LOCAL mode, an allocation occurs from the memory node local to where the code is currently executing where as in the INTERLEAVE mode, allocation occurs round-robin. The INTERLEAVE policy is used to distribute memory accesses for data structures that may be accessed from multiple processors in the system in order to have an even load on the interconnect and the memory of each node.

The memory management policies of the OS work best for the general cases and not for a specific application with a different memory access behaviour. When the memory load of a NUMA system increases, its memory management overhead increases, thus resulting in the overall degraded performance. Therefore the best approach is to have an application do the management itself. Hadoop runs in Java Virtual Machines (JVMs) which come with support for NUMA but Hadoop itself is not NUMA aware. Thus, an algorithm running on Hadoop on a NUMA system shows lower scalability in terms of number of cores when compared to its execution on a cluster of SMP machines with the same number of cores.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Orakzai, F., Pedersen, T.B. & Calders, T. Distributed mining of convoys in large scale datasets. Geoinformatica 25, 353–396 (2021). https://doi.org/10.1007/s10707-020-00431-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-020-00431-w

Keywords

Navigation