Skip to main content
Log in

Temporal locality-aware sampling for accurate triangle counting in real graph streams

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

If we cannot store all edges in a dynamic graph, which edges should we store to estimate the triangle count accurately? Counting triangles (i.e., cliques of size three) is a fundamental graph problem with many applications in social network analysis, web mining, anomaly detection, etc. Recently, much effort has been made to accurately estimate the counts of global triangles (i.e., all triangles) and local triangles (i.e., all triangle incident to each node) in large dynamic graphs, especially with limited space. Although existing algorithms use sampling techniques without considering temporal dependencies in edges, we observe temporal locality in the formation of triangles in real dynamic graphs. That is, future edges are more likely to form triangles with recent edges than with older edges. In this work, we propose a family of single-pass streaming algorithms called Waiting-Room Sampling (WRS) for estimating the counts of global and local triangles in a fully dynamic graph, where edges are inserted and deleted over time, within a fixed memory budget. WRS exploits the temporal locality by always storing the most recent edges, which future edges are more likely to form triangles with, in the waiting room, while it uses reservoir sampling and its variant for the remaining edges. Our theoretical and empirical analyses show that WRS is: (a) Fast and ‘any time’: runs in linear time, always maintaining and updating estimates, while the input graph evolves, (b) Effective: yields up to 47% smaller estimation error than its best competitors, and (c) Theoretically sound: gives unbiased estimates with small variances under the temporal locality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. A preliminary version of WRS for insertion-only graph streams was presented in [35]. This work is an extended version of [35] with (a) a new algorithm for fully dynamic graph streams, (b) theoretical analyses on its accuracy and complexity, and (c) additional experiments with more datasets, competitors, and evaluation metrics.

  2. That is, for each measure, we computed the measure of the estimates on each trial, and then, we reported the mean of the computed values. We did not report each measure of the mean of the estimates obtained from all trials.

  3. For Mascot and \({\textsc {ThinkD}}_{\textsc {FAST}}\), we set the sampling probability p so that the expected number of sampled edges is equal to the memory budget.

  4. All the considered algorithms were always better than setting all estimates to zero in terms of global error and rank correlation. However, all the considered algorithms were not in terms local error when the memory budget k was extremely small, and thus, the variances of estimates of local triangle counts were very large.

  5. Specifically, we ran the Knuth shuffle [22] while skipping different fractions of iterations.

References

  1. Ahmed, N.K., Duffield, N., Neville, J., Kompella, R.: Graph sample and hold: a framework for big-graph analytics. In: KDD, pp. 1446–1455 (2014)

  2. Ahmed, N.K., Duffield, N., Willke, T.L., Rossi, R.A.: On sampling from massive graph streams. PVLDB 10(11), 1430–1441 (2017)

    Google Scholar 

  3. Arifuzzaman, S., Khan, M., Marathe, M.: Patric: A parallel algorithm for counting triangles in massive networks. In: CIKM, pp. 529–538 (2013)

  4. Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp. 623–632 (2002)

  5. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)

    Article  MathSciNet  Google Scholar 

  6. Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient algorithms for large-scale local triangle counting. ACM Trans. Knowl. Discov. Data 4(3), 13 (2010)

    Article  Google Scholar 

  7. Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE, pp. 6–6 (2006)

  8. De Stefani, L., Epasto, A., Riondato, M., Upfal, E.: Trièst: Counting local and global triangles in fully dynamic streams with fixed memory size. ACM Trans. Knowl. Discov. Data 11(4), 43 (2017)

    Article  Google Scholar 

  9. Eckmann, J.P., Moses, E.: Curvature of co-links uncovers hidden thematic layers in the world wide web. PNAS 99(9), 5825–5829 (2002)

    Article  MathSciNet  Google Scholar 

  10. Erdös, P.: On the structure of linear graphs. Isr. J. Math. 1(3), 156–160 (1963)

    Article  MathSciNet  Google Scholar 

  11. Etemadi, R., Lu, J., Tsin, Y.H.: Efficient estimation of triangles in very large graphs. In: CIKM, pp. 1251–1260 (2016)

  12. Gehrke, J., Ginsparg, P., Kleinberg, J.: Overview of the 2003 kdd cup. ACM SIGKDD Explor. Newsl. 5(2), 149–151 (2003)

    Article  Google Scholar 

  13. Gemulla, R., Lehner, W., Haas, P.J.: Maintaining bounded-size sample synopses of evolving datasets. VLDB J. 17(2), 173–201 (2008)

    Article  Google Scholar 

  14. Hall, B.H., Jaffe, A.B., Trajtenberg, M.: The nber patent citation data file: Lessons, insights and methodological tools. Tech. rep., National Bureau of Economic Research (2001)

  15. Han, G., Sethu, H.: Edge sample and discard: a new algorithm for counting triangles in large dynamic graphs. In: ASONAM, pp. 44–49 (2017)

  16. Hu, X., Tao, Y., Chung, C.W.: I/O-efficient algorithms on triangle listing and counting. ACM Trans. Database Syst. 39(4), 27 (2014)

    Article  MathSciNet  Google Scholar 

  17. Jha, M., Seshadhri, C., Pinar, A.: A space efficient streaming algorithm for triangle counting using the birthday paradox. In: KDD, pp. 589–597 (2013)

  18. Jung, M., Lim, Y., Lee, S., Kang, U.: Furl: Fixed-memory and uncertainty reducing local triangle counting for multigraph streams. Data Min. Knowl. Discov. 33(5), 1225–1253 (2019)

    Article  MathSciNet  Google Scholar 

  19. Kallaugher, J., Price, E.: A hybrid sampling scheme for triangle counting. In: SODA, pp. 1778–1797 (2017)

  20. Kim, J., Han, W.S., Lee, S., Park, K., Yu, H.: Opt: A new framework for overlapped and parallel triangulation in large-scale graphs. In: SIGMOD, pp. 637–648 (2014)

  21. Klimt, B., Yang, Y.: Introducing the enron corpus. In: CEAS (2004)

  22. Knuth, D.E.: Seminumerical algorithms. Art Comput. Program. 2, 139–140 (1997)

    Google Scholar 

  23. Kolountzakis, M.N., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient triangle counting in large graphs via degree-based vertex partitioning. In: WAW, pp. 15–24 (2010)

  24. Kutzkov, K., Pagh, R.: On the streaming complexity of computing local clustering coefficients. In: WSDM, pp. 677–686 (2013)

  25. Kutzkov, K., Pagh, R.: Triangle counting in dynamic graph streams. In: SWAT, pp. 306–318 (2014)

  26. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1(1), 2 (2007)

    Article  Google Scholar 

  27. Lim, Y., Kang, U.: Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In: KDD, pp. 685–694 (2015)

  28. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Annu. Rev. Sociol. 27(1), 415–444 (2001)

    Article  Google Scholar 

  29. Mislove, A.: Online social networks: measurement, analysis, and applications to distributed information systems. Ph.D. thesis, Rice University (2009)

  30. Newman, M.E.: The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003)

    Article  MathSciNet  Google Scholar 

  31. Park, H.M., Myaeng, S.H., Kang, U.: Pte: Enumerating trillion triangles on distributed systems. In: KDD, pp. 1115–1124 (2016)

  32. Park, H.M., Silvestri, F., Kang, U., Pagh, R.: Mapreduce triangle enumeration with guarantees. In: CIKM, pp. 1739–1748 (2014)

  33. Pavan, A., Tangwongsan, K., Tirthapura, S., Wu, K.L.: Counting and sampling triangles from a graph stream. PVLDB 6(14), 1870–1881 (2013)

    Google Scholar 

  34. Seshadhri, C., Pinar, A., Kolda, T.G.: Triadic measures on graphs: the power of wedge sampling. In: SDM, pp. 10–18 (2013)

  35. Shin, K.: Wrs: Waiting room sampling for accurate triangle counting in real graph streams. In: ICDM, pp. 1087–1092 (2017)

  36. Shin, K., Eliassi-Rad, T., Faloutsos, C.: Patterns and anomalies in k-cores of real-world graphs with applications. Knowl. Inf. Syst. 54(3), 677–710 (2018)

    Article  Google Scholar 

  37. Shin, K., Hammoud, M., Lee, E., Oh, J., Faloutsos, C.: Tri-fly: Distributed estimation of global and local triangle counts in graph streams. In: PAKDD, pp. 651–663 (2018)

  38. Shin, K., Kim, J., Hooi, B., Faloutsos, C.: Think before you discard: accurate triangle counting in graph streams with deletions. In: ECML/PKDD, pp. 141–157 (2018)

  39. Shin, K., Lee, E., Oh, J., Hammoud, M., Faloutsos, C.: Cocos: Fast and accurate distributed triangle counting in graph streams. arXiv preprint arXiv:1802.04249 (2018)

  40. Shin, K., Oh, S., Kim, J., Hooi, B., Faloutsos, C.: Fast, accurate and provable triangle counting in fully dynamic graph streams. ACM Trans. Knowl. Discov. Data 14(2), 1–39 (2020)

    Article  Google Scholar 

  41. Shun, J., Tangwongsan, K.: Multicore triangle computations without tuning. In: ICDE, pp. 149–160 (2015)

  42. Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 15(1), 72–101 (1904)

    Article  Google Scholar 

  43. Suri, S., Vassilvitskii, S.: Counting triangles and the curse of the last reducer. In: WWW, pp. 607–614 (2011)

  44. Tsourakakis, C.E.: Fast counting of triangles in large real networks without counting: algorithms and laws. In: ICDM, pp. 608–617 (2008)

  45. Tsourakakis, C.E., Kang, U., Miller, G.L., Faloutsos, C.: Doulion: counting triangles in massive graphs with a coin. In: KDD, pp. 837–846 (2009)

  46. Turk, A., Türkoglu, D.: Revisiting wedge sampling for triangle counting. In: TheWebConf, pp. 1875–1885 (2019)

  47. Türkoglu, D., Turk, A.: Edge-based wedge sampling to estimate triangle counts in very large graphs. In: ICDM, pp. 455–464 (2017)

  48. Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user interaction in facebook. In: WOSN, pp. 37–42 (2009)

  49. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MathSciNet  Google Scholar 

  50. Wang, P., Qi, Y., Sun, Y., Zhang, X., Tao, J., Guan, X.: Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. PVLDB 11(2), 162–175 (2017)

    Google Scholar 

  51. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications, vol. 8. Cambridge University Press, Cambridge (1994)

    Book  Google Scholar 

  52. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-worldnetworks. Nature 393(6684), 440–442 (1998)

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by Disaster-Safety Platform Technology Development Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (Grant Number: 2019M3D7A1094364) and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)). This research was also supported by the National Science Foundation under Grant No. CNS-1314632 and IIS-1408924. This research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kijung Shin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: Proof of Lemma 2

We provide a proof of Lemma 2, which is based on several properties of RP. We first introduce the uniformity of RP [38] in Lemma 6. Then, we present the mean and variance of the size of the reservoir \(\mathcal {R}\) [13] in Lemma 7. Throughout this section, we use the superscript (t) to indicate the value of each variable after the t-th element \(\varDelta ^{(t)}\) is processed. We let \(\mathcal {E}_{\mathcal {R}}^{(t)}\) be the set of edges flowing to the reservoir \(\mathcal {R}\) from the waiting room \(\mathcal {W}\) at time t or earlier in Algorithm 4. We also let \(y_{\mathcal {R}}^{(t)}=\min (k(1-\alpha ),|\mathcal {E}_{\mathcal {R}}^{(t)}|+n_{b}^{(t)}+n_{g}^{(t)})\).

Lemma 6

(Uniformity of Random Pairing [38] ) At any time t, all equally sized subsets of the \(\mathcal {E}_{\mathcal {R}}^{(t)}\) are equally likely to be a subset of the reservoir \(\mathcal {R}^{(t)}\). Formally,

$$\begin{aligned}&\mathbb {P}[\mathcal {A}\subseteq \mathcal {R}^{(t)}] = \mathbb {P}[\mathcal {B}\subseteq \mathcal {R}^{(t)}], \nonumber \\&\quad \forall t\ge 1,\ \forall A\ne B \subset \mathcal {E}_{\mathcal {R}}^{(t)},\ s.t.\ |\mathcal {A}|=|\mathcal {B}|. \end{aligned}$$
(23)

Lemma 7

(Expectation, Variance of the Reservoir Size of Random Pairing [13]) The expected value and the variance of the size of the reservoir \(\mathcal {R}\) at any time t in Algorithm 4 are formulated as follows:

$$\begin{aligned} \mathbb {E}[|\mathcal {R}^{(t)}|]= & {} \frac{|\mathcal {E}_{\mathcal {R}}^{(t)}|}{|\mathcal {E}_{\mathcal {R}}^{(t)}|+d^{(t)}} \cdot y_{\mathcal {R}}^{(t)}, \end{aligned}$$
(24)
$$\begin{aligned} \mathrm {Var}[|\mathcal {R}^{(t)}|]= & {} \frac{d^{(t)} \cdot y_{\mathcal {R}}^{(t)} \cdot (|\mathcal {E}_{\mathcal {R}}^{(t)}|+d^{(t)}-y_{\mathcal {R}}^{(t)})\cdot |\mathcal {E}_{\mathcal {R}}^{(t)}|}{(|\mathcal {E}_{\mathcal {R}}^{(t)}|+d^{(t)})^{2} \cdot (|\mathcal {E}_{\mathcal {R}}^{(t)}|+d^{(t)}-1)}, \end{aligned}$$
(25)

where \(d^{(t)}=n_{b}^{(t)}+n_{g}^{(t)}\).

In Lemma 8, we formulate the probability that each edge in \(\mathcal {E}_{\mathcal {R}}\) is stored in the reservoir \(\mathcal {R}\) in \(\textsc {WRS}_{\textsc {DEL}}\).

Lemma 8

(Sampling Probability of Each Edge in Random Pairing) At any time t, the probability that each edge in \(\mathcal {E}_{\mathcal {R}}\) is stored in \(\mathcal {R}\) after the t-th element is processed in Algorithm 4 is

$$\begin{aligned} \mathbb {P}[\{u,v\}\in \mathcal {R}^{(t)}] = \frac{y^{(t)}_{\mathcal {R}}}{|\mathcal {E}_{\mathcal {R}}^{(t)}|+n_{b}^{(t)}+n_{g}^{(t)}}. \end{aligned}$$
(26)

Proof

Let \(\mathbb {1}(\{u,v\}\in \mathcal {R}^{(t)})\) be a random variable that becomes 1 if \(\{u,v\}\in \mathcal {R}^{(t)}\) and 0 otherwise. By definition,

$$\begin{aligned} |\mathcal {R}^{(t)}|=\sum _{\{u,v\}\in \mathcal {E}_{\mathcal {R}}^{(t)}} \mathbb {1}(\{u,v\}\in \mathcal {R}^{(t)}). \end{aligned}$$
(27)

Then, by linearity of expectation and Eq. (27),

$$\begin{aligned} \mathbb {E}[|\mathcal {R}^{(t)}|]&= \sum _{\{u,v\}\in \mathcal {E}_{\mathcal {R}}^{(t)}}\mathbb {E}[\mathbb {1}(\{u,v\}\in \mathcal {R}^{(t)})] \nonumber \\&= \sum _{\{u,v\}\in \mathcal {E}_{\mathcal {R}}^{(t)}}\mathbb {P}[\{u,v\}\in \mathcal {R}^{(t)}]. \end{aligned}$$
(28)

Then, Eq. (26) is obtained as follows:

$$\begin{aligned} \mathbb {P}[\{u,v\}\in \mathcal {R}^{(t)}]&= \frac{1}{|\mathcal {E}_{\mathcal {R}}^{(t)}|} \sum _{\{w,x\}\in \mathcal {E}_{\mathcal {R}}^{(t)}}\mathbb {P}[\{w,x\}\in \mathcal {R}^{(t)}] \\&= \frac{\mathbb {E}[|\mathcal {R}^{(t)}|]}{|\mathcal {E}_{\mathcal {R}}^{(t)}|} = \frac{y^{(t)}_{\mathcal {R}}}{|\mathcal {E}_{\mathcal {R}}^{(t)}|+n_{b}^{(t)}+n_{g}^{(t)}}, \end{aligned}$$

where the first, second, and last equalities are from Eqs. (23), (28) and (24), respectively. \(\square \)

Proof of Lemma 2

We prove Lemma 2 based on Lemma 8. \(\square \)

Proof

Without loss of generality, we assume \(e^{(1)}_{uvw}= \{v,w\}\), \(e^{(2)}_{uvw}= \{w,u\}\), and \(e^{(3)}_{uvw}= e^{(t)}= \{u,v\}\). That is, \(\{v,w\}\) arrives earlier than \(\{w,u\}\), and \(\{w,u\}\) arrives earlier than \(\{u,v\}\). When \(e^{(t)}=\{u,v\}\) arrives at time t, the triangle \(\{u,v,w\}\) is discovered if and only if both \(\{v,w\}\) and \(\{w,u\}\) are in \(\mathcal {S}^{(t-1)}\).

Note that, since \(\textsc {WRS}_{\textsc {DEL}}\) updates the triangle counts before sampling or deleting edges, \(\mathcal {E}_{\mathcal {R}}=\mathcal {E}^{(t-1)}_{\mathcal {R}}\), \(n_{b}=n_{b}^{(t-1)}\) and \(n_{g}=n_{g}^{(t-1)}\) when executing lines 6 and 8 of Algorithm 4.

If \(type_{uvw}=1\), \(\{v,w\}\) and \(\{w,u\}\) are always stored in \(\mathcal {W}^{(t-1)}\), when \(\{u,v\}\) arrives. Thus, \(\textsc {WRS}_{\textsc {DEL}}\) discovers \(\{u,v,w\}\) with probability 1.

If \(type_{uvw}=2\), when \(\{u,v\}\) arrives, \(\{w,u\}\) is always stored in \(\mathcal {W}^{(t-1)}\), while \(\{v,w\}\) cannot be in \(\mathcal {W}^{(t-1)}\) but can be in \(\mathcal {R}^{(t-1)}\). For \(\textsc {WRS}_{\textsc {DEL}}\) to discover \(\{u,v,w\}\), \(\{v,w\}\) should be in \(\mathcal {R}^{(t-1)}\), and from Eq. (26), the probability of the event is

$$\begin{aligned} \mathbb {P}[\{v,w\}\in \mathcal {R}^{(t-1)}]= \frac{y_{\mathcal {R}}^{(t-1)}}{|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)}}, \end{aligned}$$

where \(d^{(t-1)}=n_{b}^{(t-1)}+n_{g}^{(t-1)}\).

If \(type_{uvw}=3\), \(\{v,w\}\) and \(\{w,u\}\) cannot be in \(\mathcal {W}^{(t-1)}\), when \(\{u,v\}\) arrives. For \(\textsc {WRS}_{\textsc {DEL}}\) to discover \(\{u,v,w\}\), both \(\{v,w\}\) and \(\{w,u\}\) should be in \(\mathcal {R}^{(t-1)}\). The probability of the event is \(\mathbb {P}[\{v,w\}\in \mathcal {R}^{(t-1)} \text { and } \{w,u\}\in \mathcal {R}^{(t-1)}]\).

Below, we formulate this joint probability by expanding the covariance sum \(\sum _{\{v,w\}\ne \{w,u\}}Cov\bigl (\mathbb {1}(\{v,w\}\in \mathcal {R}^{(t-1)}), \mathbb {1}(\{w,u\}\in \mathcal {R}^{(t-1)})\bigr )\) in two ways and compare them. Each random variable \(\mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)})\) is 1 if \(\{u,v\}\in \mathcal {R}^{(t-1)}\) and 0 otherwise.

First, we expand the variance of \(\mathcal {R}^{(t-1)}\). From Eq. (27),

$$\begin{aligned}&Var\bigl [|\mathcal {R}^{(t-1)}|\bigr ]=\sum \limits _{\{u,v\}\in \mathcal {E}^{(t-1)}_{\mathcal {R}}}\, Var\bigl [\mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)})\bigr ] \\&\quad + \sum \limits _{{\{u,v\}\ne \{x,y\}}}\ \ Cov\left( \mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)}), \mathbb {1}(\{x,y\}\in \mathcal {R}^{(t-1)})\right) , \end{aligned}$$

and hence, the covariance sum is be expanded as

$$\begin{aligned}&\sum \limits _{{\{u,v\}\ne \{x,y\}}}\ \ Cov\left( \mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)}), \mathbb {1}(\{x,y\}\in \mathcal {R}^{(t-1)})\right) \nonumber \\&\quad = Var\bigl [|\mathcal {R}^{(t-1)}|\bigr ]\ - \sum \limits _{\{u,v\}\in \mathcal {E}^{(t-1)}_{\mathcal {R}}}\ \ Var\bigl [\mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)})\bigr ]. \end{aligned}$$
(29)

From \(Var[x]=\mathbb {E}[x^{2}]-(\mathbb {E}[x])^{2}\), we have

$$\begin{aligned}&Var\bigl [\mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)})\bigr ]\nonumber \\&\quad =\mathbb {P}\bigl [\{u,v\}\in \mathcal {R}^{(t-1)}\bigr ]-\mathbb {P}\bigl [\{u,v\}\in \mathcal {R}^{(t-1)}\bigr ]^{2}. \end{aligned}$$
(30)

Applying Eqs. (26) and (30) to Eq. (29) results in

$$\begin{aligned}&\sum \limits _{{\{u,v\}\ne \{x,y\}}}\ \ Cov\left( \mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)}), \mathbb {1}(\{x,y\}\in \mathcal {R}^{(t-1)})\right) \nonumber \\&\quad = Var\bigl [|\mathcal {R}^{(t-1)}|\bigr ]\ -\ \sum \limits _{{\ \ \ \ \{u,v\}\in \mathcal {E}^{(t-1)}_{\mathcal {R}}}}\ \ Var\bigl [\mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)})\bigr ] \nonumber \\&\quad = Var\bigl [|\mathcal {R}^{(t-1)}|\bigr ] \nonumber \\&\qquad - |\mathcal {E}^{(t-1)}_{\mathcal {R}}|\cdot \frac{y_{\mathcal {R}}^{(t-1)}\cdot (|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)}-y_{\mathcal {R}}^{(t-1)})}{(|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)})^{2}}. \end{aligned}$$
(31)

Then, we directly expand the covariance sum. With \(Cov(x,y)=\mathbb {E}[xy]-\mathbb {E}[x]\cdot \mathbb {E}[y]\) and Eq. (26), the covariance sum is expanded as

$$\begin{aligned}&\sum \limits _{{\{u,v\}\ne \{x,y\}}}\ \ Cov\left( \mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)}), \mathbb {1}(\{x,y\}\in \mathcal {R}^{(t-1)})\right) \nonumber \\&\quad = \sum \limits _{{\{u,v\}\ne \{x,y\}}}\ \ \Bigl (\mathbb {P}\bigl [\{u,v\}\in \mathcal {R}^{(t-1)}\cap \{x,y\}\in \mathcal {R}^{(t-1)}\bigr ] \nonumber \\&\qquad -\mathbb {P}\bigl [\{u,v\}\in \mathcal {R}^{(t-1)}\bigr ]\cdot \mathbb {P}\bigl [\{x,y\}\in \mathcal {R}^{(t-1)}\bigr ] \Bigr ) \nonumber \\&\quad = \sum \limits _{{\{u,v\}\ne \{x,y\}}}\ \ \mathbb {P}\bigl [\{u,v\}\in \mathcal {R}^{(t-1)}\cap \{x,y\}\in \mathcal {R}^{(t-1)}\bigr ] \nonumber \\&\qquad -\frac{y_{\mathcal {R}}^{(t-1)}\cdot y_{\mathcal {R}}^{(t-1)}\cdot |\mathcal {E}^{(t-1)}_{\mathcal {R}}|\cdot (|\mathcal {E}^{(t-1)}_{\mathcal {R}}|-1)}{(|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)})^{2}}. \end{aligned}$$
(32)

Next, we obtain the sum of joint probabilities by comparing the two expansions [i.e., Eqs. (31) and (32)] as follows:

$$\begin{aligned}&\sum \limits _{{\{u,v\}\ne \{x,y\}}}\, \mathbb {P}\bigl [\{u,v\}\in \mathcal {R}^{(t-1)}\cap \{x,y\}\in \mathcal {R}^{(t-1)}\bigr ] \nonumber \\&\quad = \sum \limits _{{\{u,v\}\ne \{x,y\}}}\ \ Cov\left( \mathbb {1}(\{u,v\}\in \mathcal {R}^{(t-1)}), \mathbb {1}(\{x,y\}\in \mathcal {R}^{(t-1)})\right) \nonumber \\&\qquad +\frac{y_{\mathcal {R}}^{(t-1)}\cdot y_{\mathcal {R}}^{(t-1)}\cdot |\mathcal {E}^{(t-1)}_{\mathcal {R}}|\cdot (|\mathcal {E}^{(t-1)}_{\mathcal {R}}|-1)}{(|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)})^{2}} \nonumber \\&\quad = Var\bigl [|\mathcal {R}^{(t-1)}|\bigr ] \nonumber \\&\qquad -|\mathcal {E}^{(t-1)}_{\mathcal {R}}|\cdot \frac{y_{\mathcal {R}}^{(t-1)}\cdot (|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)}-y_{\mathcal {R}}^{(t-1)})}{(|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)})^{2}} \nonumber \\&\qquad +\frac{y_{\mathcal {R}}^{(t-1)}\cdot y_{\mathcal {R}}^{(t-1)}\cdot |\mathcal {E}^{(t-1)}_{\mathcal {R}}|\cdot (|\mathcal {E}^{(t-1)}_{\mathcal {R}}|-1)}{(|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)})^{2}} \nonumber \\&\quad = \frac{y_{\mathcal {R}}^{(t-1)}\cdot (y_{\mathcal {R}}^{(t-1)}-1)\cdot |\mathcal {E}^{(t-1)}_{\mathcal {R}}|\cdot (|\mathcal {E}^{(t-1)}_{\mathcal {R}}|-1)}{(|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)})\cdot (|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)}-1)}. \end{aligned}$$
(33)

Lastly, each joint probability \(\mathbb {P}\bigl [\{u,v\}\in \mathcal {R}^{(t-1)} \cap \{x,y\}\in \mathcal {R}^{(t-1)}\bigr ]\) is implied from Eqs. (23) and (33) as follows:

$$\begin{aligned}&\mathbb {P}\bigl [\{u,v\}\in \mathcal {R}^{(t-1)} \cap \{x,y\}\in \mathcal {R}^{(t-1)}\bigr ] \\&\quad = \frac{\sum _{\{u,v\}\ne \{x,y\}}\mathbb {P}\bigl [\{u,v\}\in \mathcal {R}^{(t-1)} \cap \{x,y\}\in \mathcal {R}^{(t-1)}\bigr ]}{|\mathcal {E}^{(t-1)}_{\mathcal {R}}|\cdot (|\mathcal {E}^{(t-1)}_{\mathcal {R}}|-1)} \\&\quad = \frac{y_{\mathcal {R}}^{(t-1)}}{|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)}} \times \frac{y_{\mathcal {R}}^{(t-1)}-1}{|\mathcal {E}^{(t-1)}_{\mathcal {R}}|+d^{(t-1)}-1}, \end{aligned}$$

which proves Eq. (4) in Lemma 2 when \(type_{uvw}=3\). \(\square \)

Appendix: Variance analysis

Fig. 13
figure 13

The true and simplified variances are strongly correlated \((R^{2}>0.99)\) on a log-log scale. Both true and simplified variances of \(\textsc {WRS}_{\textsc {INS}}\) are smaller than those of \(\mathrm{Tri}\grave{\mathrm{e}}\mathrm{st}_{\textsc {IMPR}}\) within the same memory budget

We measured the variance \(\mathrm {{Var}}[c^{(t)}]\) of the estimate of the global triangle count and the simplified version \(\tilde{\mathrm {{Var}}}[c^{(t)}]\) [i.e., Eq. (16)] while changing the memory budget k from \(30\%\) to \(50\%\) of the number of elements in the input stream. As seen in Fig. 13, while there is a clear difference between the true and simplified variances, the two values are strongly correlated (\(R^{2}>0.99\)) on a log-log scale both in the ArXiv and Facebook datasets. This strong correlation supports the validity of our simple and illuminating analysis in Sect. 5.4.2, where we compare the simplified variances of \(\textsc {WRS}_{\textsc {INS}}\) and \(\mathrm{Tri}\grave{\mathrm{e}}\mathrm{st}_{\textsc {IMPR}}\), instead of their true variances, to provide an intuition why \(\textsc {WRS}_{\textsc {INS}}\) is more accurate than \(\mathrm{Tri}\grave{\mathrm{e}}\mathrm{st}_{\textsc {IMPR}}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, D., Shin, K. & Faloutsos, C. Temporal locality-aware sampling for accurate triangle counting in real graph streams. The VLDB Journal 29, 1501–1525 (2020). https://doi.org/10.1007/s00778-020-00624-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-020-00624-7

Keywords

Navigation