Background

With the rapid development of mobile Internet technology, human society has entered an era of high interconnection between human and information. Especially the emergence of online social networks, such as Weibo in China, LinkedIn, Twitter, Face-book and so on, because of its high interaction and strong group participation, makes many related information converging and merging, leading to the emergence of hot topics on the Internet. These hot topics are the aggregation of information about current events (e.g., natural disasters, sports news, celebrity news), which have geotags-timestamp information. These metadata (geotags and timestamp) are embedded within the content of the message, so that the analysis of an event can be performed by applying a space–time classification. In addition, the hot topic is the sprout of the network public opinion. Network public opinion has an important impact on the stability of the country and society. Therefore, the popularity of hot topics is studied from the spatial and temporal dimension can better understand the dissemination of public opinion.

This paper uses the data of Sina Weibo, one of the largest micro-blogging system in China (https://weibo.com/), to study the following two questions: (1) study the classification problem of hot topic on Weibo, cluster the popularity time series by clustering algorithm; (2) study the spatial distribution characteristics of hot topic on Weibo based on location label data.

The remainder of the paper is organized as follows. In “Related work” section, we introduce the related work in the area. “Data collection and description” section gives an overview of our dataset and “Methods/experimental” section reports on our temporal analysis of popularity dynamics. In “Results and discussion” section, we reports on our spatial analysis of popularity dynamics. We conclude the paper with “Conclusion” section discussing our results.

Related work

In this section we briefly review three bodies of related work. First, we summarize the modeling and popularity prediction problem, showing how it has been tackled and in which contexts. Second, we review research from spatial pattern and temporal pattern of popularity dynamics. Finally, we then turn our focus to the study of topics and micro-blogs popularity dynamics.

Modeling and predicting popularity dynamics

Online content popularity has an enormous impact on opinions, culture, policy, and profits, especially with the advent of Web 2.0 and social media. In last decade, quantitative understanding the popularity dynamics of online content has been attracting much attention from academia [1,2,3,4,5]. Popularity dynamics represents many real social phenomena, such as video views on YouTube [6,7,8,9,10,11,12,13], reading volume of tweets and news on social media [14,15,16,17,18,19], and movie views on online system [20,21,22].

Previous work on online content popularity dynamics has two main aspects: one is modeling popularity dynamics, and the other is predicting popularity. In these two aspects, scholars have achieved a lot of research results. For example, Borghol developed a framework for studying the popularity dynamics of user-generated videos, and proposed a model that captures the key properties of these dynamics [23]. Gleeson studied the popularity dynamics of meme and consider competition-induced criticality in the model [24]. Kim developed a model to simulation the origin of the criticality in meme popularity distribution on complex networks [25]. Li modeled information popularity dynamics via branching process on micro-blog networks [26]. Bao [27] and Shen [28] modeled and predicted popularity dynamics via an influence-based self-excited Hawkes process and reinforced Poisson processes, respectively.

Spatio-temporal patterns of popularity dynamics

In recent years, with the spatio-temporal data left by human beings on social media, it has become possible for scholars to study the popularity of micro-blogs and news in temporal patterns and spatial patterns. For example, Wu predicted the popularity of social media using multi-scale temporal decomposition and presented a novel approach to factorize the popularity into user-item context and time-sensitive context for exploring the mechanism of popularity dynamics [29]. Yang studied the patterns of temporal variation in online media [30]. Stilo explored temporal mining of micro-blog texts and its application to event discovery [31]. Brambilla analyzed the temporal features of social media response to live events [32]. Trattner studied the popularity of recipes on two large and well visited online recipe portals (Allrecipes.com, USA and Kochbar.de, Germany) [33].

In the field of spatial dimensions, Overgoor focused on a method for brand popularity prediction and use it to analyze social media posts generated by various brands during a period of time [34]. Wang presented a spatio-temporal mapping system for visualizing a summary of geo-tagged social media as tags in a cloud, and it is associated with a web page by detecting spatio-temporal events [35]. Cunha addressed the problem of identifying and displaying tweet profiles by analyzing multiple types of data: spatial, temporal, social and content [36].

Topics and micro-blogs popularity dynamics

With the development of data mining and tracking technology, social media research has sprung up [37,38,39,40]. As on kind of social media, micro-blogs are widely used for sensing the real-world. The popularity of micro-blogs is an important measurement for evaluation of the influential of pieces of information. Here, we restrict attention mostly to related work on popularity characterization and modeling for user-generated micro-blogs (tweets) and topics. Such as, Ma models temporal dynamics of popularity with multiple tipping points [14]. Zhao proposed a self-exciting point process model for predicting tweets popularity [16]. Sanli proposed the adoption of the so-called local variation in order to uncover salient dynamical properties. Sanli found that popular hashtags present regular and so less bursty behavior, suggesting its potential use for predicting online popularity in social media [17]. Bandari construct a multi-dimensional feature space to forecasting the popularity of news in social media [18]. Leskovec developed a framework for tracking short, distinctive phrases that travel relatively intact through on-line text [19].

In last decade, modeling and predicting the popularity dynamics of online topic has become an interesting area. Zhao proposed a short-term prediction model of topic popularity on micro-blogs [41]. Ardon studied more than 5.96 million topics that include both popular and less popular topics and performed a rigorous temporal and spatial analysis, investigating the time-evolving properties of the sub-graphs formed by the users discussing each topic [42]. Yan proposed STH-Bass model, a Spatial and Temporal Heterogeneous Bass model derived from economic field, to predict the popularity of a single tweet [43]. Yamasaki proposed a TF-IDF-like algorithm to analyze which tags are more potentially important to earn more popularity and extended the idea to show how the important tags are geo-spatially varied and how the importance ranking of the tags evolves over time [44].

Research summary and comparison

Here, we systematically compare previous research including our research, from models and algorithms, data sources and type, research methods and tools, and main findings. The results show that the data sources and type are diverse. Data type includes online videos, blogs, micro-blogs, articles, news, hashtags, and more. The models and algorithms used are also very different, such as rank-shift model [1], SEISMIC algorithm [16], K-SC clustering algorithm [30], Popularity growth model [2], and so on. Our research and previous research are compared in Table 1:

Table 1 Comparisons between previous research and our research

Data collection and description

The dataset of this paper was collected from Sina Weibo (https://weibo.com/), hot topic column. On social media sites such as Weibo and Twitter, a word or phrase preceded by a hash or pound sign (#) and used to identify micro-blogs on a specific topic.

The data includes 1259 hot topics between October 4, 2013 and November 4, 2013, as well as 138,609 micro-blogs related to these topics. Two types of data, topic dimension data and user dimension data are recorded (see Table 2). The topic dimension includes “topic name”, “micro-blog content”, “release time”, “forwarding number”, “like number”, etc., and the user dimension includes “user name”, “user authentication”, “number of fans”, “location”, etc. Among them,the form of the “micro-blog content” maybe text, video or picture. Among the 1259 topic data, some topic data is missing, and some topics include fewer micro-blogs. After data preprocessing, we selected the topics with more than 200 related micro-blogs for research, a total of 1167.

Table 2 Data structure and fields

Methods/experimental

Methods of data selection and processing

Firstly, we construct a discrete time series \( n_{i} (t) = \left( {n_{i} \left( {t_{1} } \right), \ldots ,n_{i} \left( {t_{j} } \right), \ldots ,n_{i} \left( {t_{L} } \right)} \right) \) (\( L \) represents the length of the time series) by counting the number of micro-blogs that contains the topic \( i \) at time interval \( t \), where \( t \) is measure in some time unit, e.g., hours, days. Simply, \( n_{i} \) is defined as the popularity of topic \( i \) and the shape of time series \( n_{i} \left( t \right) \) represents how the popularity of topic \( i \) changed over time.

In principle, the time series of each topic contains \( L = 720 \) elements (i.e., the number of hours in 1 month). However, the volume of topic tends to be concentrated around a peak [30]. In many time intervals, the popularity of topic is zero. This indicates that the time series of topic popularity is sparse. Thus taking such a long time series would increase the difficulty of calculation. Therefore, we truncate the time series to focus on the peak part. We truncate the length of the time series to 72 h, and shift it such that it peaks at the 1/3 of the entire length of the time series (i.e., the 24th index). From our data samples, it is found that the ratio of volume around the peak (72 h) to total volume is more than 80% (see Table 3).

Table 3 Statistics of the clusters from Fig. 3

Figure 1 shows the popularity of three topics over time. Figure 1a is the original popularity time series, which contain more than 700 elements. Figure 1b shows truncated the length of the time series to 72 h, and after aligning them so that they all peak at the same time.

Fig. 1
figure 1

Popularity of topics changed over time. This figure shows the temporal patterns of popularity dynamics of three topics. a Represents the original popularity, and b represents the popularity after processing

From Fig. 1a we find that the micro-blogs of the topic is concentrated in a few days and tends to be concentrated around a peak. Thus taking such a long time series would not be a good idea. For example, we measure the similarity between two topics that are discussed intensively for several days and abandoned for the rest of the time. We would be interested mainly in the differences of them during their active days. However, the differences in inactive periods may not be zero due to noise, and these small differences can dominate the overall similarity since they are accumulated over a long period. Therefore, we truncate the time series to focus on the “interesting” part of the time series (Fig. 1b).

We calculated the popularity time series of all 1167 topics and truncated the length of the time series to 72 h, then aligning them so that they all peak at the same time. Next, we aim to group together topics so that topics in the same group have a similar shape of the time series \( n_{i} \left( t \right) \). Through this method, we can understand what topics have a similar temporal pattern of popularity, and we can then consider the center of each cluster as the representative common pattern of the group.

Secondly, we need define the spatial patterns of popularity (SPP). We construct a one-dimensional vector \( s_{i} \left( l \right) = \left( {s_{i} \left( {l_{1} } \right), \ldots ,s_{i} \left( {l_{j} } \right), \ldots ,s_{i} \left( {l_{M} } \right)} \right) \) (\( M \) represents the total number of locations) by counting the number of micro-blogs that contains location \( l_{j} \) (\( 1 \le j \le M \)) in the topic \( i \), where \( l_{j} \) is measure in some location unit, e.g., cities, provinces. Simply, \( s_{i} \left( {l_{j} } \right) \) is defined as the spatial popularity of topic \( i \) at location \( j \) and the location one-dimensional vector. \( s_{i} \left( l \right) \) records the spatial popularity distribution of topic \( i \). Figure 2 shows the spatial popularity distribution of a topic, which can be approximately described as a power-law distribution \( p\left( l \right)\sim l^{ - \beta } \) with the exponent \( \beta = 0.963 \).

Fig. 2
figure 2

A map of topic spatial popularity. This figure indicates the popularity distribution in spatial

Thirdly, we formulate the probability of a topic \( i \) which belongs to a specific location \( j \) as Eq. (1)

$$ p\left( {location_{j} |topic_{i} } \right) = \frac{{{\text{the number of messages which contain location}}_{j} {\text{ in topic}}_{i} }}{{{\text{the total number of messages in topic}}_{i} }} $$
(1)

Thus, we can also construct a location probability vector \( P\left( {topic_{i} } \right) = \left[ {p\left( {location_{j} |topic_{i} } \right)} \right],\quad 1 \le j \le M \). In addition, the main location can be determined by the maximum of probability for \( topic_{i} \). As shown in Eq. (2)

$$ maimLocation\left( {topic_{i} } \right) = \mathop {\arg \hbox{max} }\limits_{{location_{j} }} \left\{ {p\left( {location_{j} |topic_{i} } \right)} \right\} $$
(2)

In Eq. (2), a main location of the topic is calculated. Then, we can use Eq. (3) to determine whether the topic is a local topic or global topic.

$$ Location_{i} \, = \,\left\{ {\begin{array}{ll} {mainLocation\,(topic_{i} ),} & \quad {if\;p\;\left( {mainLocation(topic_{i} )} \right)\, > \,\theta } \\ {``globalTopic",} & {otherwise} \\ \end{array} } \right. $$
(3)

In Eq. (3), if the probability of a topic’s main location exceeds the threshold \( \theta \), the topic would be regarded as a local topic.

K-spectral centroid (K-SC) clustering algorithm

In this paper, we use the K-spectral centroid (K-SC) clustering algorithm proposed by Yang to process the time series of topic popularity [30]. The specific algorithm is as follows:

figure a

Results and discussion

Temporal patterns of popularity dynamics (TPPD)

We determine the number of clusters \( K = 6 \). Figure 3 is the result of each clusters, Tables 3 and 4 give further descriptive statistics for each of the six clusters.

Fig. 3
figure 3

Clusters identified by K-SC clustering algorithm, \( K = 6 \). This figure shows temporal patterns of popularity dynamics of micro-blogging topics. a topic about life, b topic about fashion entertainment, c topic about leisure mood, df topics of social hot events

Table 4 Interpretation of statistics

In Fig. 3, cluster 1 is topic about life and health. Cluster 2 is topic about fashion entertainment. Such topic can attract a lot of attention in a very short time, and will quickly lose attention. Cluster 3 is topic about leisure mood,such as #Missing is better than meeting#, # What I love is that you love me#, # Those people in those years#, # Happy Time on Campus#. Cluster 4, 5, 6 are all topics of social hot events, including natural disasters, public health, official corruption, social justice and other topics. Such as # Voluntary extension of old-age contributions#, #Is it possible to cancel the golden week?#.

Figure 3 exhibits the high variability in the cluster shapes and very spiky temporal behavior, where the peak lasts for less than 4 h. We found that the hot topic at the top of the list lost their attention after 2 days and was replaced by other topics. Cluster 1 and Cluster 3, accounts for 18.6% of the total topic respectively, had a quick rise followed by a monotone decay. The biggest cluster, cluster 2 accounts for 28.1%, is characterized by a super quick rise just 1 h before peak and a quicker decay than cluster 1 and cluster 3. Finally, topics in cluster 4, 5 and 6 stay popular for more than 3 days, and experience a small peak on the first day and a larger one on the second day.

Figure 4 shows the distribution of popularity decay after peak. We find that the distributions of popularity decay can be approximately described by power-law distribution. We extract the exponents using a least-square fit on the logarithm of the data.

Fig. 4
figure 4

Popularity decay exponents using a least-square fit. This figure shows the distribution of popularity decay after peak

In Fig. 5, we describe the relationship of two statistical values, namely, peak fraction of popularity and the exponents of popularity decay. The results show that the exponents of popularity decay is positively correlated with the peak fraction.

Fig. 5
figure 5

The relationship between the two statistical values. This figure describes the relationship of peak fraction of popularity and the exponents of popularity decay

Spatial patterns of popularity dynamics (SPPD)

We calculate the location probability vector \( P\left( {topic_{i} } \right) = \left[ {p\left( {location_{j} |topic_{i} } \right)} \right] \) of all 1167 topics, as well as the exponent of topic spatial popularity distribution \( \beta \). We find that the maximum of probability \( \hbox{max} \left\{ {p\left( {location|topic} \right)} \right\} \) for each topic is approximately positively correlated with the exponent of topic spatial popularity distribution \( \beta \) (in Fig. 6).

Fig. 6
figure 6

The relationship between the two statistical values. This figure shows the relationship between the exponent of topic spatial popularity distribution \( \beta \) and the maximum of probability for each topic \( \hbox{max} \left\{ {p\left( {location|topic} \right)} \right\} \)

Before we determine the threshold \( \theta \), we need to know the distribution of the maximum probability \( \hbox{max} \left\{ {p\left( {location|topic_{i} } \right)} \right\},\;1 \le i \le 1167 \). Figure 7a shows the distribution of the maximum probability for each topic. As can be seen from Fig. 7a, the maximum probability is mainly concentrated in the intervals [0.05, 0.15) and [0.15, 0.25), which are 32% and 25%, respectively. Shown in Fig. 7b is the ratio of global topics and local topics as functions of threshold \( \theta \).

Fig. 7
figure 7

Maximum probability and threshold \( \theta \). a Shows the distribution of the maximum probability for each topic; b shows the ratio of global topics and local topics as functions of threshold \( \theta \)

In addition, we characterize the spatial popularity of each cluster and find that the distributions of spatial popularity follow power-law. We extract the exponents using a least-square fit (shown in Fig. 8). Tables 5 and 6 give further descriptive statistics for each of the six clusters. In Fig. 8, we find that the spatial popularity of topics for each cluster following power-law distribution \( p\left( l \right)\sim l^{ - \beta } \), where \( l \) represents the spatial location. The exponent \( \beta \) represents the heterogeneity of the spatial popularity distribution.

Fig. 8
figure 8

Spatial distribution of topic popularity for each cluster. This figure indicates spatial patterns of popularity dynamics

Table 5 Statistics of the clusters from Fig. 8
Table 6 Interpretation of statistics

Figure 9 shows the relationship between average exponent of spatial distribution and average probability of location. From Fig. 9, we can find that the two statistics have positive correlation.

Fig. 9
figure 9

The relationship between the two statistical values. This figure shows the relationship between average exponent of spatial distribution and average probability of location

Conclusion

With the development of data collection and tracking technology, social media research has sprung up. As a kind of social media, micro-blogs are widely used for sensing the real-world. The popularity of micro-blogging topics is an important measurement for evaluation of the influential of an event. The popularity of topics was studied from the spatial and temporal dimension, which can better understand the dissemination of public opinion. In our research, we solved two problems: (1) study the classification problem of hot topics in the micro-blogging system, cluster the popularity time series by clustering algorithm; (2) study the spatial distribution characteristics of hot topics. The results from our research show that the temporal popularity of hot topics is rapidly dying, and the distribution of popularity is subject to the power law form. In addition, the higher of peak fraction of popularity, the faster the popularity disappears. On the other hand, the spatial distribution of topics is also very broad. The maximum probability is mainly concentrated in the intervals [0.05, 0.15) and [0.15, 0.25), which are 32% and 25%, respectively. This shows that most of the hot topics are global topic. The results analyzed the temporal and spatial popularity dynamics of online topics. It can provide a literature reference for studying the influence of online topics and the evolution of public opinion.