Abstract

With the growth of online commerce, companies have created virtual communities (VCs) where users can create posts and reply to posts about the company’s products. VCs can be represented as networks, with users as nodes and relationships between users as edges. Information propagates through edges. In VC studies, it is important to know how the number of topics concerning the product grows over time and what network features make a user more influential than others in the information-spreading process. The existing literature has not provided a quantitative method with which to determine key points during the topic emergence process. Also, few researchers have considered the link between multilayer physical features and the nodes’ spreading influence. In this paper, we present two new ideas to enrich network theory as applied to VCs: a novel application of an adjusted coefficient of determination to topic growth and an adjustment to the Jaccard coefficient to measure the connection between two users. A two-layer network model was first used to study the spread of topics through a VC. A random forest method was then applied to rank various factors that might determine an individual user’s importance in topic spreading through a VC. Our research provides insightful ways for enterprises to mine information from VCs.

1. Introduction

Virtual communities (VCs) provide an interactive experience that, if positive, may instil customer loyalty [1]. They enable consumers to learn the functions of products and follow up conveniently with buying online, as well as provide a channel for receiving customer feedback, which plays an important role in product innovation [2, 3]. Mining information provided by consumers in VCs enables companies to adjust the next generation of their products to improve customer satisfaction [4].

Complex network theory has been a major tool in the study of the physical structure and dynamic processes of social, biological, and technological networks [5]. In analyses of information spreading, users are represented as nodes [6]; these nodes may reside in multiple possible states, depending on whether they have learned information and whether they can transmit it to a neighbour [79]. Among real-world VCs, social networks such as Weibo, WeChat, Twitter, and Facebook have different physical structures, leading to different patterns of topic transmission.

Consumer VCs are different from traditional social networks in that they centre around products. They provide a real-time look into customers’ experiences with a product from the date of release. Users in VCs may post their feedback for other users to view, which, in the best-case scenario, may encourage loyalty among existing consumers while encouraging new consumers to buy in. Thus, understanding consumer VCs can provide insight into public opinion trends, helping the company maintain existing markets or develop new markets [10].

In consumer VCs, information is transmitted via posts, which are about particular topics [11]. Networks change over time; thus, time dimensions, i.e., temporal networks, have been incorporated into network analyses [12]. For consumer VCs, the identification of key time points during topic emergence can provide an angle for addressing the consulting service appropriately. Additionally, there are multiple ways for users in VCs to interact with one another, of which “replying” and “searching for interest” are common ways to get information. Mining the influential users and specifying the important network features are also critical for improving the efficiency of multilayer network information transformation [13].

Thus, this paper tries to answer the following questions:(i)Following the introduction of a new product, how does the number of topics concerning the product grow over time?(ii)How do topics spread through a VC? What features characterise the influence of individual users during the information-spreading process?

Here, we focused specifically on the Huawei P10/P10 Plus for our case study. Two new concepts are introduced. First, the coefficient of determination was applied to the growth of topics after a new product was introduced to yield the node sequential emergence coefficient of determination (NSECD), which was used to identify the moment when most of the growth had finished. Second, the Jaccard coefficient was adjusted to gain a new measure of similarity between two users in a VC. Subsequently, a two-layered network model, representing two ways of spreading information, was introduced to study the spread of a topic through a VC and identify the most influential users. The random forest method was then applied to rank the importance of various factors affecting a user’s impact on spreading topics through a VC. Finally, some suggestions are provided for future research.

The remainder of this paper is organised as follows. Section 2 provides a literature review on existing methods. Section 3 describes the Huawei P10/P10 Plus dataset and its preprocessing. Section 4 introduces the NSECD statistic for studying the growth of new topics after the introduction of a new product. Section 5 proposes the adjusted Jaccard coefficient and the two-layered network model. Later in the section, simulations carried out to identify key users in the network are described, with the random forest method used to find features important to users’ information-spreading performance. Section 6 recaps suggestions for enterprise management of VCs and proposes directions for further research.

2. Literature Review

Online social networks exert a major influence on life today [14]. Sange et al. found that online social networks provide a platform for spreading both objective facts and fake news [15]. Park et al. noted the rapid growth of mobile devices such as smartphones and examined rapid information propagation in mobile social networks [16]. Up to now, the spreading of information has been investigated intensively in interdisciplinary fields [17].

Research into information propagation in VCs has two main foci: influence factor analysis and propagation path analysis. Influence factor analysis attempts to identify which factors make a node influential in a network; such factors may include gender, age, beliefs, etc. On the other hand, propagation path analysis studies the way in which information is transmitted through the network by, for instance, assigning weights to edges and setting transmission probabilities based on these weights.

We first provide examples of influence factor analysis. Li et al. developed a multinomial naive Bayes classifier, categorised microblog posts based on content, and found that the information type has a significant influence on propagation patterns in terms of scale and topological features [18]. Zeng and Zhu proposed an emotional model for information propagation based on the emotional states of network users [19]. Hsu developed an integrated conceptual model and explored the effects of brand-evangelism-related behavioural decisions of enterprises on VC members [20]. Wan et al. used a least squares support vector machine to study consumer electronics supply chains [21]. However, all of the above studies were mainly analyses of the influencing factors but did not consider the differences among VCs.

With regard to propagation form, Huo and Cheng established a modified ignorant-wiseman-spreader-stifler model to analyse the spread of rumours through a network [22]. Xu et al. proposed a new iterative algorithm called SpectralRank, which assumes that a node’s propagation capability is proportional to the number of neighbouring nodes after adding a ground node to the network [23]. Shao et al. introduced the NL centrality algorithm to identify influential nodes in a network; the algorithm considers both the semilocal structure of a node and its topological position. [24]. Wang et al. proposed a method based on an integral k-shell to identify the influential nodes in a command-and-control network [25]. Escalante and Odehnal proposed a deterministic SIRS-type model for rumour propagation and applied it in simulations with two types of rumours: an original rumour, followed by a second counteracting rumour based on a complex network [26]. Li et al. presented the Potential Concentration Label method to help locate multiple sources of contagion under a susceptible-infected-recovered model [27]. Zhang et al. introduced a susceptible-infected-true-removed model of rumour spreading to account for members in a network who know or can discern the truth [28]. Xiong et al. introduced the location concept to a local social network model [29] and further extended it to the recommendation system via information spreading in a local-based social network [30, 31]. In their recent research, Xiong et al. combined the location and temporal effects of a social network, proposing constructive advice on dynamic management [32]. Zhang et al. studied networks that can be subdivided into smaller groups called communities and proposed a node-ranking algorithm called AI Rank using two factors: attractive power (which measures the number of followers a node has compared to its neighbours) and initiating power (which accounts for the communities that a node’s neighbours belong to) [33].

Although these studies considered node importance ranking together with the topology of the network, they generally treated networks as single-layered. This does not account for the fact that there may be more than one way for information to propagate through a network. Hence, a multilayer network model may be more appropriate. We propose a two-layered model in this paper.

3. Data Preprocessing and Description

The Pollen Club is the official VC for Huawei’s products, including smartphones, laptops, and other electronic devices. Each user is assigned a unique identifier with which they can express their opinions about products freely, look through other users’ posts, and reply to posts [34].

For our initial dataset, we selected 2000 web pages about the Huawei P10/P10 Plus, containing 2,392,035 posts about the product. After removing duplicate and spam posts, we retained 57,560 original posts and 826,328 reply posts by 129,362 users as our dataset. The data were acquired directly from club.huawei.com to avoid interview effects [35].

Next, core topics were extracted from the posts. In a previous study [36], 100 topics about phones were selected initially (see Table 1). After sorting out those with higher frequency, the remaining topics were grouped into three categories: system, software, and hardware, according to their features, as shown in Table 2.

The remainder of this paper focuses on 61 of these topics. These topics did not all appear on the first day of our dataset, but emerged over the course of the study as users bought and used the product and formulated questions about the product. In the next section, we will discuss the emergence of topics.

4. Dynamic Analysis of Topic Emergence

According to the classic product lifecycle, a new product goes through three stages: emergence, growth, and maturation. Emergence refers to the period before the product’s launch [37]. Growth refers to a period of high consumer activity after the product launch, when consumers who had been eagerly awaiting the launch are ready to buy the product. Maturity refers to a period of low consumer activity afterwards, when consumers may continue to buy the product but do so at a slower pace because enthusiastic consumers have already done so.

For VCs, the transition point from growth to maturation is of interest to the company, because it signals the point after which fewer resources should be needed to monitor and respond to VC posts concerning the product. The purpose of this section is to introduce a statistic that will identify such a transition point.

For this purpose, we used the growth in the cumulative number of topics from the 61 topics up to a given date to measure the VC’s interest in the product. We could, for instance, select the date at which the cumulative number of topics reached 90% or 95% of 61. However, any such choice would involve some uncertainty regarding which threshold to use. Instead, we propose a transition point that avoids such an arbitrary threshold choice.

Based on the adjusted coefficient of determination in statistics [38], we define the NSECD as follows:where is the NSECD value on the day, is the cumulative number of new topics on the day, n is the total number of topics, and is the total number of days.

As increases, increases while decreases. We expect that in a typical dataset, strong growth in new topics early in the study period will lead to increase initially while faster growth in the factor together with a saturation of new topics will lead to decreasing later on. Our key moment will be when this function reaches its maximum; i.e.,

Figure 1 shows the NSECD on each day in our dataset. In this dataset, it was calculated that . Accordingly, .

To summarise the lifecycle of topic growth in our case study, the period from the start of the study to the product launch date (Day 9) was considered the “emergence” stage. Twelve new topics were added during this stage, reflecting early interest in the Huawei P10/P10 Plus. The period from the product launch date to our key moment (Day 33) was the “growth” stage. The period from our key moment to the end of the study was considered the “maturity” stage.

In Figure 2, 12 new topics were added during the “emergence” stage, all on Day 9, the final day of the stage. This reflected early interest in the Huawei P10/P10 Plus.

The “growth” stage continued the trend of new topics from the last day of the emergence stage. New topics appeared rapidly early in this stage, as users shared their opinions towards the product from different aspects, but the rate at which new topics appeared slowed towards the end. By the end of this stage, 58 topics had emerged, 95.08% of the total number of topics considered.

The “maturity” stage witnessed even slower growth in new topics compared to the earlier stages, as most topics had already appeared earlier.

5. Network Modelling

In this section, we first introduce a two-layer network model and then perform information-propagation simulations to determine which users are the most effective at spreading information in a VC.

5.1. Structure of the Network

Here, we establish a two-layered network model for VCs. Each user is represented by a node that occurs on both layers. The two layers correspond to two ways in which a user in a VC may interact with another user: by replying to their posts or by searching for their posts.

The first layer is the “flow of information by replies (FIR)” network (denoted as ). Given two users and , let be the number of times replied to a post by within the dataset. If , then an arrow from to is drawn. The FIR network consists of , the set of all users, the set of all weights, and the set of all arrows.

The second layer is called the “flow of information by interest (FII)” network (denoted as ). It is inspired by the idea that two users are likely to search for each other’s posts, only when they share the same interests. Given two users and , we can construct a measure representing the commonality of interests and draw an edge between them when is above some preset threshold . The FII network consists of , the set of all users, , the set of all weights, and the set of all arrows. It remains to define each weight.

Let be the total number of topics and let the topics have a fixed order. Then, let be the vector comprising the numbers of the main posts concerning each of the topics. Because some users may focus on only a portion of the topics, most entries in are zero. As such, a Jaccard coefficient is taken into consideration. Let , i.e., the set of all topics for which has at least one main post. By definition, the Jaccard coefficient of and can be calculated as follows:where and denote the number of elements in the intersection and union of topics mentioned by users and , respectively.

The disadvantage of the Jaccard coefficient is that it does not distinguish a pair of casual users who post about only one topic, which happens to be the same, and a pair of enthusiastic users who post about many shared topics. For example, consider the following two situations:Situation 1: users and post only on “system” and “updates.”Situation 2: users and post only on “system.”

Based on equation (3), the Jaccard coefficient assigns a weight of 1 to both situations. However, this may not be appropriate, because the users in Situation 1 may be more active in the VC and, hence, more likely to exchange information by searching. Thus, we propose the following adjusted Jaccard coefficient:

In our case study, . Based on equation (3), in Situation 1 and in Situation 2. Like the Jaccard coefficient, the adjusted Jaccard coefficient is always between 0 and 1, and is 0 when both users share no common topics. However, unlike the classical Jaccard coefficient, it is 1 only when both users share all topics.

In Figure 3, we plot the distribution of values of over all pairs of users in our case study. Figures 3(a) and 3(b) show that the overwhelming majority of pairs of users were associated with a small value of . A small means that there is little possibility for information spreading. In our analysis, we used a threshold value of . That is, an edge was drawn between User and User only when . This made the FII network a sparse network. Notably, the difference in the range longitudinal axis between Figures 3(a) and 3(b) is caused by the difference in the bar intervals.

Gephi software was used to draw illustrations of both layers for our case study. The results are shown in Figure 4.

To illustrate how our two-layer network model works, consider the following simple example with seven users, as shown in Figure 5.

Here, a represent the seven users. Users may send or receive information from the FIR or FII networks simultaneously. For example, User a can receive information from User b or c in the FIR network, or from User b or d in the FII network. User a can send information to User b or d in the FII network. Information can spread through both channels simultaneously. The larger the weight of an edge, the more likely information is to flow through it at any given step in both layers.

5.2. Information Propagation and User Importance

Next, the two-layered network model will be expanded with information transmission mechanisms, to simulate the flow of information through the VC. Suppose we are interested in the propagation of a specific piece of information through the network. The information-propagation model will be based on a simple two-state framework; that is, at any given time, a node is in one of two possible states:(1)A susceptible state, representing the states of users who do not receive the information(2)An infected state, indicating the states of users who receive the information

Time is treated as discrete. At each time point , infected nodes have a probability of transmitting the information to their susceptible out-neighbours. The spread of the information is thus a random process. The mathematical model of this process is described below.

For convenience, susceptible and infected states are denoted by and , respectively. We denote the FIR layer and the FII layer as and , respectively. The key notations are listed as follows:(i) is the state of User at time (ii) are the indicator variables for User being in states or , respectively, at time (iii) is the set of out-neighbours of User in the FIR network(iv) is the set of neighbours of User in the FII network(v) is the group of target users that User may spread information to(vi) is the indicator function. It takes the value of 1 when the assertion inside is true and 0 when the assertion inside is false

The key mechanisms for this model are as follows:(1)For User , each day except for the last step and each layer is a draw from . If User is infected at , the value of this draw will decide which of the ’s out-neighbours on the layer become infected at time t. More probable infections are always prioritised over less probable infections. A low value of means that the node will be easily infected, whereas a high value means that it will be difficult to infect.(2)For each User , each layer , and each in-neighbor of , will measure the infectiousness of the channel. Large values indicate that easily infects , whereas small values indicate that has difficulty infecting . The values will depend only on the network structure from the previous section.

Note that out-neighbours and in-neighbours of node are different only when the layer’s edges are directed, i.e., the FIR network in this case. If the layer’s edges are not directed, out-neighbours and in-neighbours are the same and are simply called neighbours.

At the starting time , a single node is infected while all others are susceptible; i.e., while for . Inductively, assume we know for all . Then, the node states of the next day are given aswhere and can be determined byand where values of are calculated as follows:

This means that infected nodes stay infected, whereas noninfected nodes become infected if, on some layers, one of its infected in-neighbours has a strong enough connection to overcome the given node’s threshold at that time and layer. The information-spreading process begins with a single node infected, while all other nodes are susceptible. Any node can be used as the starting node. The process ends when either no infected node has any noninfected out-neighbours, or t has reached some specified time limit. In our simulations, we used a time limit of . Note that because the spreading process depends on random draws , the process itself will be random.

A standard way to measure node ’s importance in a network is by the extent of the infection that begins at [39]. To this end, we define the information-spreading rate of a node as , where is the number of infected nodes at the end of the spreading process and is the total number of nodes. Because the spread of information depends on random variables , will itself be a random variable. Its expectation is too complicated to calculate explicitly; however, it can be estimated by repeated sampling as follows:where is the number of trials used, i.e., the number of simulations with as the starting node, and is the value obtained in the trial. Due to the large size of our dataset and the resulting high computation time, was set as 30. Users were then ranked according to their mean information-spreading rate across these 30 trials.

The simulation procedure implemented in MATLAB (MathWorks, Natick, MA, USA) is as follows. Let be the node where the information starts. We track the spread of information throughout the network through the set of infected nodes, denoted as , and the set of noninfected nodes, denoted as :Step 1: begin with the network nodes and edges with weights and starting node .Step 2: if has no out-neighbours, halt and output . Otherwise, initialise .Step 3: while ,Step 3a: let . If this is empty, exit the while loop, otherwise continue.Step 3b: for in , (1) draw and from , and (2) for each , calculate as above. If , insert into and remove from S.Step 3c: update and .Step 4: output .

An example of a two-layered network and an information-spreading history with node as the starting node is provided in Figure 6.

Notice that node cannot spread information to node in the FIR layer but may do so via the FII layer. From there, the information can spread from node to via the FIR layer. For a different spreading path, notice that a cannot spread to in the FIR layer but may do so through the FII layer. From there, the information can spread to e through the FIR layer.

The algorithm was run on this simple example once for each of the nodes as the starting node. Table 3 shows the average spreading rate with examples in Figure 5 across 30 trials.

In their corresponding simulations Nodes had the highest mean spreading rates, with a possibility of spreading information to all nodes, except for the isolated node . Of these four, node performed best in terms of average spreading rate. Also, Nodes and were more efficient by 1% compared to node a, although node a had more links than those two. This suggests that and have special positions in this network, which matches the graph.

The algorithm was then applied to the real Pollen Club dataset described in Section 3. The results for the top 20 of 129,362 users, as ranked by mean spreading rate across 30 trials in MATLAB, are summarised in Table 4. The standard deviations were not too large for only 30 trials and could likely be shrunk further by running more trials as needed.

In Table 4, IUG and OUG stand for “intermediary user group” and “ordinary user group,” respectively [36]. The OUG refers to customers who bought Huawei products and registered for the Pollen Club. The IUG refers to customers who received official training from Huawei and are willing to answer questions from other customers.

Specially, one of the IUG members is on the top of the list, indicating that the Pollen Club is organised by the customers themselves, saving the company and the effort of doing so. The number next to each OUG member is the user level. There are 12 levels in total, with higher levels indicating greater user experience. Users can advance their levels by joining activities in the Pollen Club. As can be seen, except for User 4948, the OUG members in our top 20 generally have high user levels. This shows that our method of ranking a user’s spreading rate aligns well with Huawei’s own method of ranking users.

To prove our model’s effectiveness, comparisons were conducted using the probability 1 transformation model shown in Appendix. The results using our model were more consistent with trends observed in the real data.

5.3. Feature Selection of the Spreading Process

In this section, we investigate the relationships between spreading rate and network features that can be computed simply, without the need for repeated simulations. Twenty-two network features [40] (denoted as , respectively) of the two-layer model for a node are considered. Ten are for the FII layer, and 12 are associated with the FIR layer. The 22 features are listed in Table 5.

Each feature is normalised to have a mean of 0 and a variance of 1. We then ask which of these features can predict the spreading rate. This provides insight into which features may cause a node to spread information more efficiently through the networks. We ran a random forest algorithm [41] by Breiman with the code referenced therein using scikit-learn [42]. Random forest is an ensemble method used to model regression in a nonlinear way.

Recall that the random forest algorithm consists of randomly selecting a subset of the users and a subset of the network features, forming a tree by selecting at each node the network feature and a boundary value for that feature. This splits the users into two branches at each such node, and continuing until in all branches, the number of users is at most some threshold value. Given a new vector of values for the features, each tree is used to predict a value for the spreading rate. The forest as a whole then makes a prediction by averaging the values of each of the individual trees’ predictions. The key parameters in this process are follows:(i): the number of trees used. Increasing this value should decrease variance without leading to overfitting [43]. As suggested by reference [44], the number of trees was set to 500.(ii): the number of features selected in each tree. We followed the suggestion in [44] to take , where is the total number of features.(iii): the number of users to sample in each tree. We used all the users, i.e., (iv): the maximum leaf size, which controls when the construction of each tree halts. We used i.e., we continued splitting until there was only one user in each branch.

Recall that after running the random forest algorithm on a training set, it outputs a regression function. Given a new point , the regression function outputs a predicted adjusted spreading rate , which can then be compared with the spreading rates from the simulations. We performed five-fold cross-validation on the Pollen Club dataset together with the spreading rates. Table 6 shows the resulting values when these regression functions were tested against the training set and the testing set.

The random forest algorithm can also be used to rank the importance of network features. Each tree ranks its selected features according to the decrease in variance at the corresponding node. That is, the variance in spreading rates at the parent node is compared to the sum of variances at the two child nodes; the greater the decrease, the more important that feature according to that tree. During the same process, the decreasing variance can be calculated for each tree. Finally, the decreasing variance for each of the features is summed as the corresponding feature importance.

To check the stability, we again divided the dataset into 10 pieces, to check whether each piece yielded a similar ranking of the network features. The result is shown in Table 7. We used Kendall’s W-test [45] to assess the degree of agreement in the rankings across the ten runs and obtained a W value of 0.7570. According to the test, this corresponds to a value of , indicating very high confidence in a genuine agreement in the rankings.

Lastly, we obtained a single overall ranking based on the whole dataset. The results are shown in Figure 7. The top nine features all belong to the FIR layer. The most important features are closeness centrality () and harmonic closeness centrality () in the FIR layer. In the FII layer, the most important feature is eigencentrality; it indicates that the user in the central position in the FIR network can affect the information spreading. Also, the quality of the neighbours in the FII network plays an important role in transformation.

6. Conclusions

In this paper, we proposed the NSECD to identify the key moment in topic growth in a VC after a new product is introduced. A two-layer model was developed for assessing information propagation in a VC, where information can flow among users either by replies to posts (the FIR layer) or searching for topics of common interest (the FII layer). We applied this model to our case study, which focused on the P10/P10 Plus device in Huawei’s Pollen Club, to identify which users were most effective at spreading information through the network. Lastly, we compared these results with commonly used network features using a random forest algorithm and found that spreading effectiveness correlated best with closeness centrality and harmonic closeness centrality in the FIR layer and eigencentrality in the FII layer.

We have two suggestions for how our model may be improved in future research. First, the infectiousness formula in the FIR layer can be modified to consider not just the quantity of postreplies but also their quality. For instance, natural language analysis [46] could be used to score the quality of posts. Second, the network model can be extended to have more than two layers. For example, many VCs enable users to follow other users; thus, a third layer could be used to capture these follower relations.

In conclusion, this research introduces new concepts for network theory and provides suggestions for how companies could manage their VCs.

Appendix

Comparisons with a probability 1 transformation model: the probability 1 transformation model is defined as follows. The information-propagation model will also be based on a simple two-state framework; that is, at any given time, a node is in one of two possible states. In contrast to the model described in Section 5.2, if node state is infected at time , the states of neighbours () in all layers of node will turn into infected states. This means that once infected, it will transform the information to all neighbours with probability 1 (Figure 8).

To illustrate this clearly, the same example used in Figure 5 in Section 5.2 is demonstrated. If node is chosen as the initial source of the information, the information will spread to all other nodes in both layers, except for the isolated node, in four steps. Because the probability 1 model does not involve uncertainty, it can be implemented with one time step. To coordinate with our model as described in the main text, the maximum spreading time was set as 10.Where the meaning of the headline of the table is the same as that of Table 4. The results shown in Table 8 suggest that the lower-grade OUG members are more influential, which contradicts the meaning of the grade. This is due to the probability 1 model, which considers only the degree of the nodes and ignores the weight preference and uncertainty. Additionally, our model was more effective in predicting how information spreads among the nodes, compared with the probability 1 model.

Data Availability

All data, the models used during the study that appear in the submitted article, and the original data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (no. 71974115), the Innovation Method Fund of China (2018IM020200), and a theme-based research grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (project no. T12-710/16-R).