Abstract
A framework is proposed to simultaneously cluster objects and detect anomalies in attributed graph data. Our objective function along with the carefully constructed constraints promotes interpretability of both the clustering and anomaly detection components, as well as scalability of our method. In addition, we developed an algorithm called Outlier detection and Robust Clustering for Attributed graphs (ORCA) within this framework. ORCA is fast and convergent under mild conditions, produces high quality clustering results, and discovers anomalies that can be mapped back naturally to the features of the input data. The efficacy and efficiency of ORCA is demonstrated on real world datasets against multiple state-of-the-art techniques.
Similar content being viewed by others
Notes
Our code and datasets are publicly available at https://gitlab.com/seswar3/orca.
References
Aggarwal, C.C.: Outlier analysis. In: Data Mining, pp. 237–263. Springer (2015). https://doi.org/10.1007/978-3-319-14142-8_8
Aggarwal, C.C.: An introduction to outlier analysis. In: Outlier Analysis, pp. 1–34. Springer (2017). https://doi.org/10.1007/978-3-319-47578-3_1
Akoglu, L., McGlohon, M., Faloutsos, C.: Oddball: spotting anomalies in weighted graphs. In: Advances in Knowledge Discovery and Data Mining, pp. 410–421 (2010). https://doi.org/10.1007/978-3-642-13672-6_40
Akoglu, L., Tong, H., Koutra, D.: Graph based anomaly detection and description: a survey. Data Min. Knowl. Discov. 29(3), 626–688 (2015). https://doi.org/10.1007/s10618-014-0365-y
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)
Bouwmans, T., Sobral, A., Javed, S., Jung, S.K., Zahzah, E.H.: Decomposition into low-rank plus additive matrices for background/foreground separation: a review for a comparative evaluation with a large-scale dataset. Comput. Sci. Rev. 23, 1–71 (2017). https://doi.org/10.1016/j.cosrev.2016.11.001
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM (JACM) 58(3), 11 (2011). https://doi.org/10.1145/1970392.1970395
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717 (2009). https://doi.org/10.1007/s10208-009-9045-5
Chakrabarti, D., Faloutsos, C.: Graph mining: laws, generators, and algorithms. ACM Comput. Surv. (CSUR) 38(1), 2-es (2006). https://doi.org/10.1145/1132952.1132954
Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240. ACM (2006). https://doi.org/10.1145/1143844.1143874
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556. ACM (2004). https://doi.org/10.1145/1014052.1014118
Drineas, P., Frieze, A., Kannan, R., Vempala, S., Vinay, V.: Clustering large graphs via the singular value decomposition. Mach. Learn. 56(1), 9–33 (2004). https://doi.org/10.1023/B:MACH.0000033113.59016.96
Du, R., Drake, B., Park, H.: Hybrid clustering based on content and connection structure using joint nonnegative matrix factorization. J. Glob. Optim. 74, 861–877 (2017). https://doi.org/10.1007/s10898-017-0578-x
Du, R., Kuang, D., Drake, B., Park, H.: Hierarchical community detection via rank-2 symmetric nonnegative matrix factorization. Computat. Soc. Netw. 4(1), 7 (2017). https://doi.org/10.1186/s40649-017-0043-5
Dunlavy, D.M., Kolda, T.G., Acar, E.: Temporal link prediction using matrix and tensor factorizations. ACM Trans. Knowl. Discov. Data (TKDD) 5(2), 1–27 (2011). https://doi.org/10.1145/1921632.1921636
Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010). https://doi.org/10.1016/j.physrep.2009.11.002
Gao, H., Chen, Y., Lee, K., Palsetia, D., Choudhary, A.N.: Towards online spam filtering in social networks. NDSS 12, 1–16 (2012). https://doi.org/10.1109/ICDM.2011.124
Gao, J., Liang, F., Fan, W., Wang, C., Sun, Y., Han, J.: On community outliers and their efficient detection in information networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 813–822. ACM (2010). https://doi.org/10.1145/1835804.1835907
Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H., Faloutsos, C.: It’s who you know: graph mining using recursive structural features. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 663–671. ACM (2011). https://doi.org/10.1145/2020408.2020512
Huber, P.J.: Robust Statistics, vol. 523. Wiley, Hoboken (2004). https://doi.org/10.1002/9780470434697
Kannan, R., Ballard, G., Park, H.: Mpi-faun: an mpi-based framework for alternating-updating nonnegative matrix factorization. IEEE Trans. Knowl. Data Eng. 30(3), 544–558 (2018). https://doi.org/10.1109/TKDE.2017.2767592
Kannan, R., Woo, H., Aggarwal, C.C., Park, H.: Outlier detection for text data. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 489–497. SIAM (2017). https://doi.org/10.1137/1.9781611974973.55
Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Glob. Optim. 58(2), 285–319 (2014). https://doi.org/10.1007/s10898-013-0035-4
Kim, J., Park, H.: Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J. Sci. Comput. 33(6), 3261–3281 (2011). https://doi.org/10.1137/110821172
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM (JACM) 46(5), 604–632 (1999). https://doi.org/10.1145/324133.324140
Kuang, D., Yun, S., Park, H.: Symnmf: nonnegative low-rank approximation of a similarity matrix for graph clustering. J. Glob. Optim. 62(3), 545–574 (2015). https://doi.org/10.1007/s10898-014-0247-2
Kumar, S., Hooi, B., Makhija, D., Kumar, M., Faloutsos, C., Subrahmanian, V.: Rev2: fraudulent user prediction in rating platforms. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 333–341. ACM (2018). https://doi.org/10.1145/3159652.3159729
Lee, D.D., Seung, H.S.: (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791. https://doi.org/10.1038/44565
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001). https://doi.org/10.5555/3008751.3008829
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web, pp. 695–704. ACM (2008). https://doi.org/10.1145/1367497.1367591
Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th International Conference on World Wide Web, pp. 631–640. ACM (2010). https://doi.org/10.1145/1772690.1772755
Li, J., Dani, H., Hu, X., Liu, H.: Radar: Residual analysis for anomaly detection in attributed networks. In: IJCAI, pp. 2152–2158 (2017). https://doi.org/10.24963/ijcai.2017/299
Liu, N., Huang, X., Hu, X.: Accelerated local anomaly detection via resolving attributed networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence (2017). https://doi.org/10.24963/ijcai.2017/325
Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 496–503 (2003). https://doi.org/10.5555/3041838.3041901
Mahoney, M.W., Drineas, P.: Cur matrix decompositions for improved data analysis. Proc. Natl. Acad. Sci. 106(3), 697–702 (2009). https://doi.org/10.1073/pnas.0803205106
McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Inf. Retrieval 3(2), 127–163 (2000). https://doi.org/10.1023/A:1009953814988
Muller, E., Sánchez, P.I., Mulle, Y., Bohm, K.: Ranking outlier nodes in subspaces of attributed graphs. In: 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW), pp. 216–222. IEEE (2013). https://doi.org/10.1109/ICDEW.2013.6547453
Peng, Z., Luo, M., Li, J., Liu, H., Zheng, Q.: Anomalous: a joint modeling approach for anomaly detection on attributed networks. In: IJCAI, pp. 3513–3519 (2018). https://doi.org/10.5555/3304222.3304256
Pfeiffer III, J.J., Moreno, S., La Fond, T., Neville, J., Gallagher, B.: Attributed graph models: modeling network structure with correlated attributes. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 831–842. ACM (2014). https://doi.org/10.1145/2566486.2567993
Revelle, M., Domeniconi, C., Sweeney, M., Johri, A.: Finding community topics and membership in graphs. In: Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 9285, pp. 625–640. Springer (2015). https://doi.org/10.1007/978-3-319-23525-7_38
She, Y., Owen, A.B.: Outlier detection using nonconvex penalized regression. J. Am. Stat. Assoc. 106(494), 626–639 (2011). https://doi.org/10.1198/jasa.2011.tm10390
Tong, H., Lin, C.Y.: Non-negative residual matrix factorization with application to graph anomaly detection. In: Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 143–153. SIAM (2011). https://doi.org/10.1137/1.9781611972818.13
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007). https://doi.org/10.1007/s11222-007-9033-z
Wang, G., Xie, S., Liu, B., Philip, S.Y.: Review graph based online store review spammer detection. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 1242–1247. IEEE (2011)
Whang, J.J., Du, R., Jung, S., Lee, G., Drake, B., Liu, Q., Kang, S., Park, H.: Mega: multi-view semi-supervised clustering of hypergraphs. Proc. VLDB Endowment 13(5), 698–711 (2020). https://doi.org/10.14778/3377369.3377378
Wright, J., Ganesh, A., Rao, S., Peng, Y., Ma, Y.: Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization. In: Advances in Neural Information Processing Systems, pp. 2080–2088 (2009). https://doi.org/10.5555/2984093.2984326
Xu, H., Caramanis, C., Sanghavi, S.: Robust PCA via outlier pursuit. In: Advances in Neural Information Processing Systems, pp. 2496–2504 (2010). https://doi.org/10.5555/2997046.2997174
Yu, R., He, X., Liu, Y.: Glad: group anomaly detection in social media analysis. ACM Trans. Knowl. Discov. Data (TKDD) 10(2), 18 (2015). https://doi.org/10.1145/2811268
Acknowledgements
This material is based in part upon work supported by the U.S. National Science Foundation (NSF) under Grant Nos. OAC-1642410, CCF-1533768, and OAC-1710371. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF or DOE.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan)
Rights and permissions
About this article
Cite this article
Eswar, S., Kannan, R., Vuduc, R. et al. ORCA: Outlier detection and Robust Clustering for Attributed graphs. J Glob Optim 81, 967–989 (2021). https://doi.org/10.1007/s10898-021-01024-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-021-01024-z