Elsevier

Computers & Geosciences

Volume 144, November 2020, 104563
Computers & Geosciences

Case study
GeoDenStream: An improved DenStream clustering method for managing entity data within geographical data streams

https://doi.org/10.1016/j.cageo.2020.104563Get rights and content

Highlights

  • A clustering method for entity-based data streams with geographical information.

  • Information on entity-cluster relationships is preserved over space and time.

  • Memory use and handling of overlapping points and false noise are enhanced.

  • Clustering synthetic and real stream data demonstrate improvement in performance.

Abstract

Data streams have become an integral part of the rapidly evolving modern information landscape in various application domains. Stream clustering, and in particular density-based clustering, has emerged as one of the most commonly used data stream analysis tasks. Several density-based stream clustering methods have been proposed; chief among them is DenStream. Existing DenStream clustering methods usually preserve only the key summary descriptors about each cluster such as the center and radius. Such approach is not suitable for streams that observe discrete entities, since the clustering process does not maintain the entity-level composition of each cluster over time. The primary challenge we explore in this paper is therefore how existing DenStream clustering methods can be enhanced to support entity-based stream mining in geographical space. In view of this consideration, this paper presents GeoDenStream, a spatiotemporal entity-based stream clustering method. Building on DenStream, GeoDenStream is particularly suitable for clustering discrete entities due to its ability to track the relationship between entities and clusters over time and its ability to recover data that has been incorrectly labeled as noise. Memory efficiency in GeoDenStream is achieved by using a combination of data pruning and indexing. The performance of GeoDenStream was evaluated with both synthetic and real-world stream data from a popular social media platform (Twitter). The results of these evaluations show that GeoDenStream is able to efficiently handle memory constraints, overlapping data points, and false noise.

Introduction

In recent years, data streams have become an integral part of the rapidly evolving modern information landscape. Various application domains, such as health (Althouse et al., 2015), transportation (Liu et al., 2011), finance (Liu et al., 2010), communication (Naaman et al., 2010), energy (Vikhorev et al., 2013), climate and weather (Freeman et al., 2017), and environmental monitoring (Funk et al., 2015), produce real-time data streams, and rely heavily on the availability of near-continuous data flows for higher-level reasoning and decision making (Valle et al., 2009). In many of these domains, data streams are closely associated with human activity in geographical spaces. Examples of such activity-driven streams range from a user's (entity) check-ins and check-outs at access-controlled facilities (Kromwijk et al., 2010), to users' GPS-enabled movement tracking streams (Moreira-Matias et al., 2016) and geotagged content sharing in social media (Stefanidis et al., 2013). The tight coupling between space, time, and activity in such streams can potentially provide a rich source of information about human behavior and activity patterns. This potential and the emerging need to analyze such streams, which has fostered growing interest within the data mining community (Atluri et al., 2018), serves as the primary motivation for the work presented here.

Generally, it is possible to conceptualize a data stream S as consisting of a sequence of n (n) time-stamped records (X1,t1),(X2,t2),,(Xn,tn), where each record Xi is comprised of a set of d attributes {xi1,xi2,,xid}, and ti is a time stamp indicating when the record was created or received (Aggarwal et al., 2003). While the record attribute vector can contain any type of attribute information, this paper explores the analysis of streams in which at least one of the record attributes contains geographical information (e.g., geographical coordinates or a toponym). In the remainder of this paper, we use the term geographical data stream to denote such a stream. Notably, geographical data streams are spatiotemporal in nature as they combine spatial and temporal information in a single stream record. Additionally, it is important to note that a data stream can, in general, be dedicated to capturing data about one of two types of constructs: entities and events (Krempl et al., 2014). Here, the term entity relates to a discrete thing that endures over time, e.g., a building, a vehicle, or a person, while the term event relates to an occurrence in space and time, e.g., the detection of smoke at a particular sensor location or the detection of congestion along a highway. In practice, a key difference between entity data streams and event data streams is that the former must include a unique entity identifier (e.g., a vehicle ID), while the latter may not. When dealing with entities, it is also important to recognize that entity stream data can be analyzed at different levels of granularity, from the discrete entity (e.g., a person moving in geographical space) level to groups of entities (e.g., a group of people moving together).

Given geographical data streams, it is often of interest to analyze them in order to derive higher-level information that would support reasoning and decision making. Such analysis can include a wide range of operations, from basic data analytics to clustering, pattern and entity mining, event detection, and process modeling (Krempl et al., 2014). Among these, clustering has emerged as one of the most commonly used analysis operations (Von Luxburg, 2007; Xu and Tian, 2015). As a result, various stream data clustering algorithms have been proposed based on a range of data models and similarity (or distance) measures (Gaber et al., 2005), which can be broadly organized into five primary classes, namely Growing Neural Gas (GNG) methods, hierarchical methods, partitioning methods, density-based methods, and grid-based methods (Ghesmoune et al., 2016).

Selecting an algorithm from one of these classes is not always straightforward due to the underlying difficulty in defining a universal notion of a cluster that can be applied in any context. Furthermore, the algorithms in each class may rely on a different set of assumptions, criteria, and underlying models. Consequently, the selection of the clustering process often tends to be domain-specific and exploratory in nature (Estivill-Castro, 2002). When clustering geographical data (and data streams), density- or grid-based methods, such as DenStream (Cao et al., 2006), StreamOptics (Tasoulis et al., 2007), or FlockStream (Forestiero et al., 2009), are often selected (Xu and Tian, 2015). This selection can be attributed, at least in part, to two primary reasons. First, the concept of density naturally lends itself to spatial and spatiotemporal domain since, in these domains, the notion of a cluster is often associated with the “high concentration” of data points. Second, density-based clustering methods offer several distinct characteristics that are advantageous when dealing with activity-based data. Specifically, density-based methods (i) do not require a priori information about the number of clusters, (ii) can handle clusters with arbitrary shapes, and (iii) detect and handle outliers (Amini et al., 2014).

Another critical issue that should be addressed when selecting a clustering method is the way in which cluster information is maintained and reported. In some clustering methods, the focus of the process is to detect whether one or more clusters exist and when clusters are detected to report and preserve only key summary descriptors about each cluster. An example of this approach can be found in the algorithm presented by O’Callaghan et al., 2002, in which only the centers of clusters are maintained over time as stream data is processed. Similarly, in the framework presented by Aggarwal et al. (2003), only information about the center and radius of each micro cluster, along with unique cluster IDs are maintained over time. A key advantage of maintaining only summary descriptors is that it enables managing the clustering process efficiently since each cluster, which can potentially include a large number of stream records, is described only by a limited set of data-driven parameters (e.g., center coordinates and a radius). Such an approach, however, is not suitable for streams that observe discrete entities over time, such as moving vehicles, travelling individuals, or the geotagged postings of a social media user, since the clustering process does not maintain the entity-level composition of each cluster over time. The challenge we address in this paper is therefore how to adapt the commonly used online-offline phase in density-based clustering to support entity stream mining.

In view of these considerations, this paper proposes a method for enhancing existing density-based stream clustering methods in order to support entity stream mining in geographical space. For this purpose, we build on DenStream, a density-based clustering method presented by Cao et al. (2006). The selection of DenStream in this paper is based on three key considerations. The first relates to the conceptual framework behind it: DenStream is based on the conceptual framework for clustering evolving data streams proposed by Aggarwal et al. (2003), which involves the creation of micro and macro clusters in a two-step (online and offline) processing approach that is employed in many stream clustering methods. The second consideration relates to the historical development of density-based stream clustering methods: Since its introduction in 2006, DenStream has served as the foundation for the development of various other density-based algorithms. A third consideration relates to the availability of the DenStream algorithm. Consequently, we argue that because of these considerations, the enhancements of DenStream we propose could, in principle, be more easily adapted to enhance other density-based stream clustering methods to support traceable spatiotemporal clustering.

The remainder of this paper is organized as follows. Section 2 provides an overview of the DenStream algorithm and describes in more detail its key limitations in the context of entity stream mining in geographical space. Building on DenStream, its limitations, and the considerations noted above, Section 3 introduces GeoDenStream, a density-based stream clustering method that supports entity stream mining in geographical space. To showcase the utility of GeoDenStream, Section 4 summarizes the clustering analysis of two real-world Twitter datasets. Finally, a discussion and summary of these results are provided in Section 5.

Section snippets

Conceptual framework

In order to conceptualize the DenStream in the context of entity stream data in geographical space consider a data stream in which each record is comprised of a data “point”, i.e., a geographic location (for example, in the form of geographic coordinates), a timestamp, and a set of related attributes that describe an entity. The DenStream clustering method applies the core-micro-cluster approach to detect arbitrary-shaped clusters (Cao et al., 2006). In this approach, a core-micro-cluster is

GeoDenStream

Motivated by the limitations of DenStream with respect to entity stream clustering, GeoDenStream focuses mainly on generating clusters while maintaining point entity information across clustering iterations. While the original DenStream is unable to support entity-level geographical mining because of its primary focus on the number and center of clusters rather than keeping track of the relationships between entities and clusters, GeoDenStream is designed for geographical analysis based on the

Case studies

Two case studies were conducted in order to examine the utility of GeoDenStream for clustering spatiotemporal data from social media streams. In both cases studies Twitter, a popular social media platform, served as the stream data source. In particular, two Twitter data streams were collected: a first set of 673,740 tweets about the Boston Marathon bombing in 2013; and a second set of 963,990 tweets about the Zika virus epidemics in 2015. Both datasets were collected using a worldwide

Discussion and conclusion

In this paper, we have presented GeoDenStream, a novel method for clustering spatiotemporal data streams. Building on DenStream, this method is particularly suitable for analyzing entity-based geographical data streams such as social media data due to three unique characteristics: its ability to track and maintain information about the identity and composition of clusters over time and space, its ability to handle spatially overlapping data points, and its improved ability to handle noise.

Computer code and data availability

GeoDenStream is implemented in Java as an extension to the MOA framework source code (version 2018.07.0) and is hosted at https://github.com/manqili/GeoDenStream. The code is compiled as a JAR package that is executable in Windows, Linux, and Mac OS platforms. The synthetic datasets used for verification are available at https://github.com/manqili/GeoDenStream/tree/master/TestDatasets.

Declaration of competing interest

The authors declare that they have no conflict of interest.

Acknowledgments

The authors would like to thank the authors of the MOA project (Bifet et al., 2010) for providing the MOA project code framework that was used in this research.

References (48)

  • A. Bifet et al.

    MOA: massive online analysis

    J. Mach. Learn. Res.

    (2010)
  • F. Cao et al.

    Density-based clustering over an evolving data stream with noise

  • Y. Chen et al.

    HiSpatialCluster: a novel high-performance software tool for clustering massive spatial points

    Trans. GIS

    (2018)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE transactions on pattern analysis and machine intelligence

    (1979)
  • M. Ester et al.

    A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

  • V. Estivill-Castro

    Why so many clustering algorithms: a position paper

    ACM SIGKDD Explor. Newsl.

    (2002)
  • A. Forestiero et al.

    FlockStream: a bio-inspired algorithm for clustering evolving data streams

  • E. Freeman et al.

    ICOADS Release 3.0: a major update to the historical marine climate record

    Int. J. Climatol.

    (2017)
  • C. Funk et al.

    The climate hazards infrared precipitation with stations—a new environmental record for monitoring extremes

    Scientific Data

    (2015)
  • M.M. Gaber et al.

    Mining data streams: a review

    ACM SIGMOD Rec.

    (2005)
  • M. Ghesmoune et al.

    State-of-the-art on clustering data streams

    Big Data Analytics

    (2016)
  • A. Guttman

    R-trees: a dynamic index structure for spatial searching

    Proc. SIGMOD International Conference on Management of Data (SIGMOD 84)

    (1984)
  • M. Hahsler et al.

    streamMOA: Interface for MOA Stream Clustering Algorithms

    (2015)
  • M. Hahsler et al.

    Introduction to stream: an extensible framework for data stream clustering research with R

    J. Stat. Software

    (2017)
  • Cited by (4)

    • Unsupervised online anomaly detection in Software Defined Network environments

      2022, Expert Systems with Applications
      Citation Excerpt :

      We selected DenStream clustering as the unsupervised stream kernel using various datasets with different attack configurations. DenStream (Cao, Estert, Qian, & Zhou, 2006) is one of the most promising and successful algorithms applied in different stream applications (Li, Croitoru, & Yue, 2020; Putina & Rossi, 2020; Tajalizadeh & Boostani, 2019). Furthermore, it (Wankhade, Hasan, & Thool, 2013) described that DenStream requires less processing time and space and can also handle concept drifts.

    • Spatial Decision Support Systems with Automated Machine Learning: A Review

      2023, ISPRS International Journal of Geo-Information
    1

    Present address: 27 Wangfujing Street, Beijing 100710, China.

    View full text