Skip to main content
Log in

Towards distributed node similarity search on graphs

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Node similarity search on graphs has wide applications in recommendation, link prediction, to name just a few. However, existing studies are insufficient due to two reasons: (i) the scale of the real-world graph is growing rapidly, and (ii) vertices are always associated with complex attributes. In this paper, we propose an efficiently distributed framework to support node similarity search on massive graphs, which considers both graph structure correlation and node attribute similarity in metric spaces. The framework consists of preprocessing stage and query stage. In the preprocessing stage, a parallel KD-tree construction (KDC) algorithm is developed to form a newly defined graph so-called hybrid graph, in order to integrate node attribute similarity into the original graph. To equally divide graph vertices into subsets, KDC adopts the KD-tree partitioning after the pivot mapping. In addition, two metric pruning rules and an optimized allocation strategy are presented to reduce communication and computation costs. In the query stage, based on the formed hybrid graph, we develop similarity search methods using random walk with restart (RWR) to measure node similarity. To boost efficiency, we derive tight bounds to rapidly shrink the search region. Extensive experiments with three real massive graphs are conducted to verify the effectiveness, efficiency, and scalability of our proposed techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. As suggested in [13], restart probability c is empirically set as 0.5.

  2. Giraph is available at http://giraph.apache.org/.

References

  1. Batarfi, O., Shawi, R.E., Fayoumi, A.G., Nouri, R., Beheshti, S., Barnawi, A., Sakr, S.: Large scale graph processing systems: Survey and an experimental evaluation. Clust. Comput. 18(3), 1189–1213 (2015)

    Article  Google Scholar 

  2. Batko, M., Kohoutková, P., Novak, D.: Cophir image collection under the microscope. In: SISAP , pp 47–54 (2009)

  3. Boutet, A., Kermarrec, A., Mittal, N., Taïani, F.: Being prepared in a sparse world: The case of K NN graph construction. In: ICDE, pp. 241–252 (2016)

  4. Chen, L., Gao, Y., Li, X., Jensen, C.S., Chen, G.: Efficient metric indexing for similarity search. In: ICDE, pp. 591–602 (2015)

  5. Chen, L., Gao, Y., Chen, G., Zhang, H.: Metric all-k-nearest-neighbor search. IEEE Trans. Knowl. Data Eng. 28(1), 98–112 (2016)

    Article  Google Scholar 

  6. Chen, G., Yang, K., Chen, L., Gao, Y., Zheng, B., Chen, C.: Metric similarity joins using mapreduce. IEEE Trans. Knowl. Data Eng. 29(3), 656–669 (2017)

    Article  Google Scholar 

  7. Cheng, H., Zhou, Y., Yu, J.X.: Clustering large attributed graphs: A balance between structural and attribute similarities. TKDD 5(2), 12:1–12:33 (2011)

    Article  MathSciNet  Google Scholar 

  8. Cohen, S., Kimelfeld, B., Koutrika, G.: A survey on proximity measures for social networks. In: Search Computing - Broadening Web Search, pp. 191–206 (2012)

  9. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  10. Dong, W., Charikar, M., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: WWW, pp. 577–586 (2011)

  11. Dong, Y., Zhang, J., Tang, J., Chawla, N.V., Wang, B.: Coupledlp: Link prediction in coupled networks. In: SIGKDD, pp. 199–208 (2015)

  12. Fujiwara, Y., Nakatsuji, M., Onizuka, M., Kitsuregawa, M.: Fast and exact top-k search for random walk with restart. PVLDB 5(5), 442–453 (2012)

    Google Scholar 

  13. Fujiwara, Y., Nakatsuji, M., Shiokawa, H., Mishima, T., Onizuka, M.: Efficient ad-hoc search for personalized pagerank. In: SIGMOD, pp. 445–456 (2013)

  14. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: Graph processing in a distributed dataflow framework. In: OSDI, pp. 599–613 (2014)

  15. Jeh, G., Widom, J.: Simrank: A measure of structural-context similarity. In: SIGKDD, pp. 538–543 (2002)

  16. Khemmarat, S., Gao, L.: Fast top-k path-based relevance query on massive graphs. IEEE Trans. Knowl. Data Eng. 28(5), 1189–1202 (2016)

    Article  Google Scholar 

  17. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: A framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)

    Google Scholar 

  18. Ma, H., Zhu, J., Lyu, M.R., King, I.: Bridging the semantic gap between image contents and tags. IEEE Trans. Multimedia 12(5), 462–473 (2010)

    Article  Google Scholar 

  19. Maehara, T., Akiba, T., Iwata, Y., Kawarabayashi, K.: Computing personalized pagerank quickly by exploiting graph structures. PVLDB 7(12), 1023–1034 (2014)

    Google Scholar 

  20. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010)

  21. Meng, F., Rui, X., Wang, Z., Xing, Y., Cao, L.: Coupled node similarity learning for community detection in attributed networks. Entropy 20(6), 471 (2018)

    Article  Google Scholar 

  22. Pan, J., Yang, H., Faloutsos, C., Duygulu, P.: Automatic multimedia cross-modal correlation discovery. In: SIGKDD, pp. 653–658 (2004)

  23. Plaku, E., Kavraki, L.E.: Distributed computation of the k nn graph for large high-dimensional point sets. J. Parallel Distrib. Comput. 67(3), 346–359 (2007)

    Article  Google Scholar 

  24. Sarkar, P., Moore, A.W.: Fast nearest-neighbor search in disk-resident graphs. In: SIGKDD, pp. 513–522 (2010)

  25. Sarkar, P., Moore, A.W.: A tractable approach to finding closest truncated-commute-time neighbors in large graphs. arXiv:1206.5259(2012)

  26. Shao, B., Wang, H., Li, Y.: Trinity: A distributed graph engine on a memory cloud. In: SIGMOD, pp. 505–516 (2013)

  27. Shin, K., Jung, J., Sael, L., Kang, U.: Bear: Block elimination approach for random walk with restart on large graphs. In: SIGMOD, pp. 1571–1585 (2015)

  28. Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From “think like a vertex” to “think like a graph”. PVLDB 7(3), 193–204 (2013)

    Google Scholar 

  29. Trad, M.R., Joly, A., Boujemaa, N.: Distributed k NN-graph approximation via hashing. In: ICMR, p. 43 (2012)

  30. Wu, Y., Jin, R., Zhang, X.: Fast and unified local search for random walk based k-nearest-neighbor query in large graphs. In: SIGMOD, pp. 1139–1150 (2014)

  31. Xu, G., Fu, B., Gu, Y.: Point-of-interest recommendations via a supervised random walk algorithm. IEEE Intell. Syst. 31(1), 15–23 (2016)

    Article  Google Scholar 

  32. Yang, D., Zhang, D., Qu, B.: Participatory cultural mapping based on collective behavior data in location-based social networks. ACM TIST 7(3), 30 (2016)

    Google Scholar 

  33. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)

  34. Zhang, C., Shou, L., Chen, K., Chen, G., Bei, Y.: Evaluating geo-social influence in location-based social networks. In: CIKM, pp. 1442–1451 (2012)

  35. Zhang, Q., Li, M., Deng, Y., Mahadevan, S.: Measure the similarity of nodes in the complex networks. arXiv:1502.00780 (2015)

  36. Zhang, Y., Huang, K., Geng, G., Liu, C.: Fast k NN graph construction with locality sensitive hashing. In: PKDD, pp. 660–674 (2013)

  37. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. PVLDB 2(1), 718–729 (2009)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant No. 2018YFB1004003, the NSFC under Grants No. 61972338 and 61802344, the NSFC-Zhejiang Joint Fund under Grant No. U1609217, and the ZJU-Hikvision Joint Project. Yunjun Gao is the corresponding author of the work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunjun Gao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, T., Gao, Y., Zheng, B. et al. Towards distributed node similarity search on graphs. World Wide Web 23, 3025–3053 (2020). https://doi.org/10.1007/s11280-020-00819-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-020-00819-6

Keywords

Navigation