1 Introduction

Due to the lack of data recording the information communication between investors, uncovering the information spreading path in investors is a great challenge. Recently, Ozsoylev et al. [22] first proposed the empirical investor network (EIN) as a novel representation of the information diffusion network, based on their order placements: two investors are said to be connected if they placed the same type (ask or bid) of orders within a short time window (usually 30 seconds). The underlying hypothesis behind the EIN is that, when new information comes, it spreads from source nodes to peripheral nodes in investor social networks and the time lags with which the information reaches different investors determine the lags between their order placements. However, no direct evidence is found to support the assumption of EIN, which is crucial for applying EIN on determining asset pricing dynamics and understanding the trading behaviors and profitability of investors. As social network is the backbone of information diffusion, EIN can be regarded as a proxy of investor social network. We thus propose to check the validity of the EIN construction by studying the specific structures in social networks, such that the degree distribution of entire network and the layer structures of ego network. As a reference and for comparison, we also test the hierarchical structures present in cellphone communication networks (CN), which are usually considered as information spreading network. It is found that EIN and CN share very similar network structures, characterized by the same distribution of (weighted) degree, the same layer structure of ego network, and the same distribution of scale ratio, giving credence to the hypothesis that EIN uncovers a significant part of the information spreading path between investors.

The contributions of this paper are listed as follows. First, differing from the strand of literatures which focus on the layered structure in western social networks, like cellphone communication network in Europe, Christmas card exchange network in UK, and online social networks in Facebook and Twitter [7, 13, 20], we empirically uncover the layered structure in the Chinese social networks, including EIN and CN. Second, by investigating the subtle structures in EIN and CN, we initially find that there are great similarities in the layered structures between EIN and CN, which complements the existing studies on Dunbar’s number and EIN. Third, Ozsoylev et al. [22] conclude that EIN captures information diffusion between investors by the evidence that central investors in EIN trade earlier and gain higher returns than their peripheral neighbors. We complementally contribute the evidence supporting that EIN reveals the information spreading path between investors from the perspective of social network structures.

The present work is related to the research on Dunbar’s number and its generalised discrete hierarchical structure in social networks. Recall that Dunbar’s number of about 150 represents the average size of the personal ego network, i.e., the group of people one can typically maintain stable social relationships with due to cognitive limits [3, 4]. Furthermore, the social relations in human and animal network have been found to form layer structures, each layer representing different emotional closeness [5, 6]. And layer structures have approximately the configuration of 3–5, 10–15, 30–50, and 100–200 alters from the inner layer to outer layer [30]. Many empirical ego networks are found to exhibit such layer structures, including the network abstracted from the exchange of Christmas cards [13], the hunter-gatherer social networks [11, 30], and online societies in virtual world [8].

Another strand of literature relevant to our work is the use of cellphone and internet communication data that enable one to test the classical social theories empirically in large scale individuals. For example, the weak tie theory [9] has been validated for cellphone communication networks [17, 21]. Such data have also been used to verify the hierarchical layer structures in social networks [20]. Arnaboldi et al. [1] found that the co-author networks in academic fields also have discrete hierarchical structures. By scanning the online social network from Facebook and Twitter, Dunbar et al. [7] found that the ego networks exhibit limited size and hierarchical structures. More importantly, such layer structure can be considered as a “social fingerprint” for a specific individual, because it is stable and not affected by the change of friends [24].

This paper is organized as follows. Data and methods are given in Sect. 2. Section 3 presents the results on the degree distribution, clustering, and theoretical model fits. Section 4 concludes.

2 Data and methods

2.1 Empirical investor networks

Our empirical investor networks (EIN) are constructed from the order flows of 100 stocks included in the Shenzhen 100 index (399,004). The order flow data span the whole year of 2013. Following Ozsoylev et al. [22], on each trading day, the EIN is obtained by connecting investors if they submit at least 3 buy (or sell) orders for the same stocks within 30 seconds. By aggregating the EIN on each trading day together, we obtain the annual EIN, which contains 381,345 nodes and 8,143,541 links. Ozsoylev et al. [22] argue that the links in EIN may reflect the potential channels of information diffusion among investors, indicating the existence of localized structures in social networks formed by investors. Thus, the larger the occurrence of links between two investors, the higher the probability for the existence of social connections between them. We further employ a statistical validated method [2, 10, 12, 19, 25, 27] to check whether two investors are occasionally connected, which provides us with the statistical validated empirical investor networks, abbreviated as SVEIN.

2.2 Cellphone communication network

The cellphone call records obtained from one Chinese cellphone operator cover periods from June 28th to July 24th and October 1st to December 31st in 2010. By excluding the days October 12th, November 5th, 6th, 13th, 21st and 27th, and December 6th, 8th, 21st and 22nd on which the data were missing, we have a total of 109 days. In the data, there are 91,911,735 cell phone users and 4,599,472,652 calls. As we cannot access the call records from the other cellphone operators, only the call records in which both mobile phone subscribers belong to the data provider are included in our analysis, which leads to 1,173,501,607 records. As it is known that the frequency of calls may represent the intimacy between friends, the higher the communication frequency between two cellphone users, the stronger their assumed intimacy. We exclude the users who are identified as robots, telecom frauds and telephone sales [15]. Finally, we build cellphone communication networks based on the reciprocal calls between normal users. The statistical validated method mentioned above is also employed to remove the random calls, thus providing us with the statistical validated cellphone communication networks, abbreviated as SVCN.

2.3 Statistical validated method

As is well known, EIN and CN contain a great deal of noise: for instance, two investors may submit orders at the same time by pure coincidence and callers may make wrong calls to callees. This suggests to remove such irrelevant signals by testing whether two nodes are randomly connected. For this, we employ a statistically validated method, proposed by Tumminello et al. [25] and used in different systems [2, 10, 12, 19, 27] to extract the links that are not randomly generated.

For two given nodes i and j, the purpose of the statistical validation is to check whether i preferentially connects to j. The EIN is taken as an example to illustrate the statistical validation method. Let us denote by N is the total number of transactions between investors in EIN, by \(N_{ic}\) the number of transactions initiated by investor i, by \(N_{jr}\) the number of transactions matched by investor j, and by \(X = N_{icjr}\) the number of transactions initiated by investor i and matched by investor j. We can then calculate the probability of observing X co-occurrences via the following equation [25, 26],

$$ H(X|N,N_{ic},N_{jr}) = \frac{C^{X}_{N_{ic}}C^{N_{jr} - X}_{N - N_{ic}}}{C^{N_{jr}}_{N}}, $$
(1)

where \(C^{X}_{N_{ic}}\) is a binomial coefficient. We can also estimate the p-value associated with the observed \(N_{icjr}\) as follows:

$$ p(N_{icjr}) = 1 - \sum^{N_{icjr} - 1}_{X = 0} H(X|N,N_{ic},N_{jr}). $$
(2)

For the EIN, we need to perform \(2 \times 8\text{,}143\text{,}541 = 16\text{,}287\text{,}082\) tests. The corresponding Bonferroni correction of our multiple testing hypothesis is \(p_{b} = 0.01/N_{E}\) where \(N_{E}=N(N-1)/2\) is the maximal possible number of edges. If the estimated \(p(N_{icjr})\) is less than \(p_{b}\), we can infer that investor i preferentially connects to investor j. Otherwise, we conclude that the edge pointed from i to j is randomly generated.

For a given edge between node i and node j in the CN, we are able to estimate the p-value for the number of calls \(N_{jcir}\) initiated by j and received by i in a similar way. We need to conduct \(2 \times 296\text{,}928\text{,}030 = 593\text{,}856\text{,}060\) tests. And the Bonferroni correction is set as \(p_{b} = 0.01/N_{E}\). When \(p(N_{icjr})\) is less than \(p_{b}\), this suggests that individual i preferentially calls individual j. Only when the two conditions that (1) i preferentially calls j and (2) j preferentially calls i are simultaneously satisfied, we conclude that the edge between i and j is significant.

2.4 Clustering method

Figure 1 illustrates the layer structure of a typical ego network. The ego in the center are surrounded by the alters, who have direct connections with the ego. The alters usually form a layer structure, in which their emotional closeness decrease from the inner layer to the outer layer. The theoretical Dunbar Circle corresponds to a four-layer hierarchical structure with the cumulative number of 5, 15, 50, and 150 from inside to outside. We employ two clustering algorithms, including the k-means algorithm and the head-to-tail (\(H/T\)) break algorithm [14], to detect the layer structures of the ego network in the SVEIN and SVCN based on the activity frequencies on links. The k-means algorithm is implemented with the R package CKmeans.1d.dp [28]. The optimized number of clusters are determined according to the BIC. In the \(H/T\) break algorithm, the data is split into two parts according to the data mean \(m_{1}\), and the head part in which all values are larger than \(m_{1}\) is further separated into two parts according to the head mean \(m_{2}\). Such process iterates until the head is not heavy-tailed distributed. The \(H/T\) break algorithm is proposed to cluster the data with a heavy-tailed distribution, corresponding to the case of link weights in the SVEIN and SVCN.

Figure 1
figure 1

Illustration of the theoretical Dunbar Circle in ego networks. The square in the center represent the ego and the circles around are the alters, who have direct connection with the ego. The circle size is proportional to the emotional closeness between the alters and the ego. According to the emotional closeness, the alters form a hierarchical structure with different layers in which their closeness to the ego decrease from inner layer to the outer layer. The theoretical Dunbar Circle corresponds to a four-layer hierarchical structure with the cumulative number of 5, 15, 50, and 150 from inside to outside

3 Result

3.1 Degree distribution

We first report the descriptive statistics of both filtered networks. As reported in Panel A of Table 1, in the SVEIN we find that there are 2.23%, 6.39%, and 91.37% of the total number of users (21,806 users) whose degrees are in the range of \(k > 100\), \(50 < k \le 100\), and \(k < 50\), respectively. And their average degree and standard deviation are 142.9 and 38.5, 68.8 and 13.9, and 10.0 and 11.8, leading to a coefficient of variation of 26.95%, 20.22%, and 117.95% (standard deviation/mean). Their average weighted degree and standard deviation are 18,487.1 and 10,984.6, 5504.3 and 2935.4, and 477.0 and 1134.

Table 1 Statistical descriptions of SVEIN and SVCN. k denotes the degree of users in the network

In Panel B of Table 1, we find that the number of users in the SVCN with degree \(k > 100\), \(50 < k \le 100\), and \(k < 50\) are 60,748, 177,076, and 3,930,604, accounting for 1.46%, 4.25%, and 94.29% of the users, respectively. The corresponding average degree and standard deviation are 142.2 and 45.8, 69.4 and 13.7, and 8.1 and 10, resulting in a coefficient of variation of 32.23%, 19.79%, 124.08%. And their average weighted degree and standard deviation are 1544.7 and 775, 780.3 and 410.9, and 92.1 and 161.7. The absolute number of nodes with \(k > 100\) in the SVEIN is much smaller than those in the SVCN, and the relative numbers are very close to each other. According to the descriptive statistics, both filtered networks exhibit great similarities in their degree distributions.

We further fit the empirical degree and weighted degree distributions of the SVEIN and SVCN with the following four distributions, including the power-law, the normal, the exponential, and the log-normal distribution,

$$\begin{aligned} &f_{P}(x) = \frac{\alpha -1}{x_{\min }} \biggl(\frac{x}{x_{\min }} \biggr)^{-\alpha }, \quad\alpha >1, \end{aligned}$$
(3)
$$\begin{aligned} &f_{N}(x) = \frac{1}{\sqrt{2\pi } \sigma _{N} } \exp \biggl[ - \frac{(x-\mu _{N})^{2}}{2\sigma _{N}^{2}} \biggr], \end{aligned}$$
(4)
$$\begin{aligned} &f_{E}(x) = \lambda e^{-\lambda x},\quad x>0 \end{aligned}$$
(5)
$$\begin{aligned} &f_{L}(x) = \frac{1}{\sqrt{2\pi } \sigma _{L} x} \exp \biggl[- \frac{(\ln x-\mu _{L})^{2}}{2\sigma _{L}^{2}} \biggr]. \end{aligned}$$
(6)

The parameters of these distributions are obtained by Maximum Likelihood Estimation (MLE). The results are listed in Table 2. Kolmogorov–Smirnov (KS) tests are also conducted to check whether the (weighted) degrees are drawn from the four distributions. The null hypothesis is that the data set follows one of the four distributions. One find that, for both networks, the samples of the degree with \(k>0\) and the weighted degree with \(k>0\) and \(k>100\) conform precisely to none of the four distributions. This is not surprising, given the large sizes of our data sets, which will thus reject null hypotheses on the basis of even slight deviations. However, we can still compare the goodness of the fits by the four distributions using the Akaike information criterion (AIC) listed in Table 2. Except for the sample with \(k>100\) in the SVEIN, the log-normal distribution has the smallest AIC value. Thus, among the four distributions, the log-normal distribution fits the empirical degree distributions best.

Table 2 Results of fitting the (weighted) degrees to the power-law, normal, exponential, and log-normal distribution for the SVEIN and SVCN and statistically testing on whether the (weighted) degrees are drawn from the four distributions. The symbols , ∗∗, and ∗∗∗ indicate the significant levels of 5%, 1%, and 0.1%, respectively

The results of Table 2 strongly suggest that the correct distribution of degrees is a mixture of at least two log-normal distribution, one for small k and one for large k. Roughly, we can find a threshold \(k_{H}\), the degrees less than \(k_{H}\) are fitted by the left truncated log-normal distributions and the degrees greater than \(k_{H}\) are fitted by the right truncated log-normal distribution. Following Refs. [16, 29], the threshold \(k_{H}\) can be estimated by minimizing the following residual,

$$ R = \frac{ \biggl\{ \sum_{i}^{n_{s}} \biggl[\frac{K_{i,\mathrm{fit}}^{s}-K_{i,\mathrm{emp}}^{s}}{K_{i,\mathrm{fit}}^{s}+K_{i,\mathrm{emp}}^{s}} \biggl] + \sum_{j}^{n_{l}} \biggl[\frac{K_{j,\mathrm{fit}}^{l}-K_{j,\mathrm{emp}}^{l}}{K_{j,\mathrm{fit}}^{l}+K_{j,\mathrm{emp}}^{l}} \biggl] \biggr\} ^{\frac{1}{2}}}{\sqrt{n_{s}+n_{l}}}, $$
(7)

where \(K_{\mathrm{{fit}}}\) and \(K_{\mathrm{{emp}}}\) represent the fitting distribution and empirical distribution, the superscripts s and l stand for the sample less and greater than the threshold \(k_{H}\), and n is the sample size. The parameters of both truncated distributions are determined through the Maximum Likelihood Estimation (MLE). Figure 2(a) and (b) illustrate the fitting residuals as a function of the possible thresholds for the degrees of SVEIN and SVCN. Thus, we can find that the optimal threshold are 152 and 48 for SVEIN and SVCN, respectively. The corresponding right-truncated and left-truncated degree distributions are plotted in Fig. 2(c)–(f) for SVEIN and SVCN. The solid lines in each panel represent the best fits to the truncated log-normal distributions. For the weighted degrees of both networks, we perform the same analysis and illustrate the results in Fig. 3. The optimal thresholds are 374 and 653 for the weighted degrees of SVEIN and SVCN, respectively. One can see that the empirical distributions agree well with the fitted distributions in Figs. 2 and 3, which support that the (weighted) degrees of both network conform to a mixed log-normal distribution.

Figure 2
figure 2

Results of the optimal truncated distribution of degrees for SVEIN (a), (c), (e) and SVCN (b), (d), (f). (a), (b) Plots of the fitting residual (Eq. (7)) as a function of threshold \(k_{H}\). (c), (d) Plots of the right-truncated degree distributions. (e), (f) Plots of the left-truncated degree distributions

Figure 3
figure 3

Results of the optimal truncated distribution of weighted degrees for SVEIN (a), (c), (e) and SVCN (b), (d), (f). (a), (b) Plots of the fitting residual (Eq. (7)) as a function of threshold \(k_{H}\). (c), (d) Plots of the right-truncated degree distributions. (e), (f) Plots of the left-truncated degree distributions

As is well known, the log-normal distribution plays an important role in describing natural phenomena in which growth processes are driven by the accumulation of many small percentage changes (growth rates), which is additive on the logarithmic scale. If each percentage change is small enough, the summation on the logarithmic scale tends to be normally distributed according to the central limit theorem, which means that the percentage change follows a log-normal distribution in the linear scale. One intriguing feature of the log-normal distribution is that the growth rate is independent of its size. According to the log-normal degree distributions, one can infer that the growth rate of one’s “friends” should be independent of one’s current number of “friends” in the SVEIN and SVCN.

3.2 Clusters

The layer structures in ego networks are usually determined based on the emotional closeness on links [23]. Here, we cannot measure the emotional closeness directly. As an alternative, we employ the number of order placements in the EIN and the number of calls in the CN as a proxy for the emotional closeness on links. For a given node with n links, we first normalize the number of order placements (respectively, the number of calls) \(W_{i}\) (\(i = 1, 2, 3, \ldots, n\)) on each link via the following equation,

$$ \hat{W_{i}}=\frac{W_{i}-W_{\min }}{W_{\max }-W_{\min }}, $$
(8)

where \(W_{\min } = \min (\{W_{i}\})\) and \(W_{\max } = \max (\{W_{i}\})\). Equation (8) insures \(0 \le \hat{W_{i}} \le 1\). The presence of natural breaks (associated with network layers) should then be reflected in the existence of sharp peaks in the distributions of \(\hat{W_{i}}\). We thus plot the distribution of the normalized weights \(\hat{W_{i}}\) in Fig. 4 for both networks. As shown in Fig. 4(a), no break can be observed for the SVEIN. A possible explanation is that the data sample of SVEIN is too small. In contrast, there is a significant peak at around 0.1 for the SVCN, as illustrated in Fig. 4(b), which corresponds to the natural break \(w_{i} \approx 0.1 = 15/150\), i.e. the second layer at 15 of Dunbar’s discrete hierarchy. In the following, we use the clustering algorithms (k-means and \(H/T\) break) to uncover the discrete hierarchical structure of the node with \(k > 100\) based on the normalized weights \(\hat{W_{i}}\).

Figure 4
figure 4

Probability distribution of the normalized weights \({\hat{W_{i}}}\). (a) SVEIN. (b) SVCN

Figure 5 shows the percentage of users who have the same number of layers according to the clustering algorithm of k-means and \(H/T\) break. As shown in Fig. 5(a) and (b), the alters belonging to investors with degree \(k > 100\) in the SVEIN are mainly divided into 2–4 classes and 4–6 classes according to the k-means and \(H/T\) Break algorithm, respectively. And we also find 56.9% of the investors whose alters can be grouped into 5 layers. In order to measure the similarity and robustness of the clustering results, we further estimate the Jaccard coefficient between the clustering results of the two algorithms for the same user. The average Jaccard coefficient of all users is 0.11. As illustrated in Fig. 5(c) and (d), we find that in the SVCN the alters of the users with degree \(k > 100\) are mainly divided into 3–6 classes and 4–5 classes based on the k-means algorithm and the \(H/T\) Break algorithm. And the average Jaccard coefficient of the clustering results is 0.23. Our results indicate that the overlapping degree of the clusters from both algorithms is low.

Figure 5
figure 5

Plots of the percentage of the users who have the same number of layers in the SVEIN (a), (b) and SVCN (c), (d) based on the k-means (a), (c) and \(H/T\) (b), (d) break algorithm

Table 3 shows the comparison of the clustering results for the users with degree \(k > 100\) in both networks based on the k-means and \(H/T\) break algorithms. The results of the two clustering algorithms for the SVEIN are reported in panel A of Table 3. We find that 43% of users with degree \(k > 100\) in SVEIN are grouped into 3 layers and the average cumulative number of alters in layers is 10.9, 45.8 and 141.7, in which the last two layers correspond to the middle two layers of the empirical discrete hierarchical structure and the first layer seems to correspond to the amalgamation of the first two layers of the empirical structure reported in Refs. [13, 30]. The \(H/T\) Break algorithm reveals that about 90% of the investors whose alters exhibit a configuration with 5 and 6 layers. One can observe that the number of alters in the outer four layers are very close to the theoretical Dunbar Circle 5, 15, 50, and 150. The number of alters in the inner or two layers is only 1–3.

Table 3 Comparison of the clustering results for the users with degree \(k >100\) based on the k-mean and \(H/T\) break algorithm for the SVEIN and SVCN. N and f represents the total number and the percentage of users. \(n_{k}\) stands for the cumulative number of users in the k-th layer. \(\langle r \rangle \) is the average scale ratio

Panel B of Table 3 lists the cumulative number of friends in each layer for the SVCN. For the k-means algorithm, we find that 16,918 (a fraction of 41.1%) users have a four-layer structure. The average cumulative number of alters from inside to outside are 3.0, 12.8, 42.8 and 132.0, which is in agreement with the discrete hierarchical structure 3–5, 10–15, 30–50, and 100–200 reported in Refs. [13, 30]). The corresponding scale ratio is 3.22 which is near to the Dunbar number 3. We also find that there are 15,209 users have a five-layer structure with an average accumulative number of 2.1, 7.3, 20.4, 54. 2, and 141.4. Besides the inner layer \(n_{1} = 2.1\), the number of alters in the outside four layers are very close to the reported hierarchical structure in Refs. [13, 30]. For the H/T Break algorithm, 29,125 users (about 50.2%) exhibit a four-layer structure and the average cumulative number of alters are 2.1, 8.7, 33.4 and 133.9. There are 25,539 (about 44.1%) users whose alters can be classified into 5 layers and the average accumulative number of alters in successive layers are 1.2, 3.8, 11.7, 39.5 and 147.6.

Both clustering algorithms reveal a similar discrete hierarchical structure in cellphone networks. We find that there is an extra innermost layer (1.2–2.1), with about 1–2 alters, for the users with four layers in their ego networks. We further fix the number of clusters to 4 for the k-means algorithm and estimate the cumulative numbers of in each layer, obtaining 2.5, 10.3, 36.8, and 142.2. In addition, we perform the clustering analysis on the link activities for each ego network, in which the ego investor with degrees \(50 < k < 100\), by means of the k-means algorithm. We find that there are 621 investors (about 44.9%) having a two-layer structure and the corresponding layer structure is 19.8 and 67.2, which is close to the middle two layers of the reported hierarchical structure [13, 30].

The empirical hierarchical structures of the personal ego networks in SVEIN and SVCN are compatible with the structure of 3–5, 10–15, 30–50, 100–200 from the inner to the outer layer, which is close to the theoretical Dunbar Circle. And the average empirical scaling ratio is close to the theoretically value 3 [18].

Figures 6 and 7 show the distributions of the numbers of alters in each layer for the egos having degree \(k > 100\) in the SVEIN and SVCN. We only show the nodes whose personal ego networks having three-layer and four-layer (respectively, five-layer and six-layer) structures in the SVEIN (SVCN). For both networks, the clustering results of both algorithms are not in agreement with each other, as reflected by the low values of their Jaccard coefficients. An intriguing phenomenon is that the empirical distributions of the number of alters in each layer can be well fitted by the log-normal distributions, evidenced by the solid curves. Such log-normal distributions are robust when using different clustering algorithms, which are in agreement with the results of the online social network from Facebook and Twitter [7].

Figure 6
figure 6

Probability distribution of the number of alters in different layers for the SVEIN. The solid curves represent the best log-normal fits to the empirical distribution. (a) Egos with three layers obtained from the k-means algorithm. (b) Egos with four layers obtained from the k-means algorithm. (c) Egos with five layers obtained from the \(H/T\) break algorithm. (d) Egos with six layers obtained from the \(H/T\) break algorithm

Figure 7
figure 7

Probability distribution of the number of alters in different layers for the SVCN. The solid curves represent the best log-normal fits to the empirical distribution. (a) Egos with four layers obtained from the k-means algorithm. (b) Egos with five layers obtained from the k-means algorithm. (c) Egos with four layers obtained from the \(H/T\) break algorithm. (d) Egos with five layers obtained from the \(H/T\) break algorithm

3.3 Fits to the theoretical model

We further fit the clustering results to the theoretical model of layer structures in personal social network [24]. According to this model, the probability, that the alters of an individual are divided into \(\pmb{\ell }=(\ell _{1}, \ell _{2},\ldots,\ell _{r})\), is calculated as follows

$$ P(\pmb{\ell } |\mathcal{L},\mu,N) = \mathscr{B} \biggl(L, \frac{\mathcal{L}}{N-1},N-1 \biggr) \biggl( \frac{e^{\mu }-1}{e^{\mu r}-1} \biggr)^{L} \dbinom{L}{\pmb{\ell }}e^{ \mu \sum _{k=0}^{r-1} k\ell _{k+1} }, $$
(9)

where \(\pmb{\ell }=(\ell _{1},\ell _{2},\ldots,\ell _{r})\) represents the number of alters in each layer. \(\mathcal{L}\) represents the sum of the alters expectation of each layer and is equal to the total number of alters L. N is the total number of individuals in the network. \(\mathscr{B}(L,p,N) = \binom{N}{L}p^{L}(1-p)^{ \mathrm{NL}}\) represents a binomial distribution. There is a unique parameter μ in the model, which is an indicator of the discrete hierarchy for the ego network. The parameter μ is approximately equal to the logarithm of the scale ratio \(\log (r)\) between the cumulative numbers of individuals in successive layers, if the personal investment (time and energy) decrease linearly with the layers [24].

Once the empirical hierarchical structure of egos is obtained, we calculate the average scale ratio \(\langle r \rangle \) between adjacent cumulative layers based on the model proposed by Tamarit et al. [24]. The estimated theoretical scale ratios of both algorithms are listed in the last column of Table 3. For the SVEIN, the k-means algorithm indicates that the users are preferentially divided into the group having a three-layer structure while the \(H/T\) break algorithm uncovers that the ego networks exhibit a configuration of five layers. And their scale ratio are very close to the scaling ratio 3 discovered by Zhou et al. [30]. However, we find the existence of significant differences in the average scale ratio between the two clustering algorithms for the SVCN. On average, the average scale ratio of the \(H/T\) break algorithm is larger than 3.5 and the scale ratio obtained with the k-means algorithm is smaller than 3.5. Both clustering algorithms reveal that most of the users exhibit a four-layer structure in their ego networks, for which the scale ratio are respectively 3.2 and 4.0, which are roughly compatible with the scale ratio reported in Ref. [30].

Figures 8 and 9 show the distribution of the estimated average scale ratios for the egos having the same layer structure for both networks. We find that the scale ratio distributions given by the Tamarit’s model conform to the log-normal distributions for both clustering algorithms. The \(\chi ^{2}\) test, KS test and AD test cannot reject the null hypothesis, that the scale ratios are log-normal distributed, at the significant level of 5%. The solid curves in Figs. 8 and 9 are the best fits to the log-normal distributions. The scaling ratios given by \(\exp (\hat{\mu })\) are located in the range of 2.5–3.3, which is compatible with the previous scaling ratio 3 discovered by Zhou et al. [30]. Our results reveal that the ego networks in SVEIN exhibit very similar layer structures to those in SVCN, confirming that the SVEIN captures the information spreading channels between investors.

Figure 8
figure 8

Empirical distribution of the scale ratios for the egos with different layers based on different cluster algorithms for the SVEIN. (a) Three layers and k-means algorithm. (b) Four layers and k-means algorithm. (c) Five layers and \(H/T\) break algorithm. (d) Six layers and \(H/T\) break algorithm

Figure 9
figure 9

Empirical distribution of the scale ratios for the egos with different layers based on different cluster algorithms for the SVCN. (a) Four layers and k-means algorithm. (b) Five layers and k-means algorithm. (c) Four layers and \(H/T\) break algorithm. (d) Five layers and \(H/T\) break algorithm

4 Conclusion

We have performed a comparative analysis to detect the layer structures in Empirical Investor Networks and Cellphone Communication Networks. The layer structures have been quantified by two clustering algorithms, namely the k-means and \(H/T\) break algorithms. And both clustering algorithms reveal that there are two types of inner structure for both networks: one exhibits a layer structure similar to that of the theoretical Dunbar Circle, while the other has an additional inner layer, which is also found in Facebook and Twitter datasets [7]. Furthermore, we also find that both networks have a similar scale ratio (close to 3). And more interesting, these scale ratios remain stable even when old alters are replaced by new alters. By fitting our empirical clustering results to the theoretical model of layer structures [24], we confirm that the scale ratios of different egos follow a log-normal distribution for both networks. Our results suggest strong evidence that the structures of ego networks in EIN and CN exhibit great similarities, which captures the information spreading routes between investors and validates the underlying assumption of EIN.

The Dunbar Circle referred to the layered structure of ego social network is ubiquitous in online and offline social networks [7, 13, 20]. In such layered structure, the size of each layer increases as the emotion closeness decrease, which can be attributed to the fact that individuals are restrained for maintaining more emotionally close social relationships due to the constrained cognitive capacity [4]. Our work demonstrates that EIN share very similar layered structure as CN, supporting that how many neighbors the investor having to exchange information is dominated by his/her cognitive ability. Furthermore, the Dunbar Circle also reveals that the ego shares information with the investors in inner structure more often than those in outer structure, indicating the existence of tight cliques between investors for exchanging trading information.