Abstract
The volume of data generated by internet and social networks is increasing every day, and there is a clear need for efficient ways of extracting useful information from them. As this information can take different forms, it is important to use all the available data representations for prediction; this is often referred to multi-view learning. In this paper, we consider semi-supervised classification using both regular, plain, tabular, data and structural information coming from a network structure (feature-rich networks). Sixteen techniques are compared and can be divided in three families: the first one uses only the plain features to fit a classification model, the second uses only the network structure, and the last combines both information sources. These three settings are investigated on 10 real-world datasets. Furthermore, network embedding and well-known autocorrelation indicators from spatial statistics are also studied. Possible applications are automatic classification of web pages or other linked documents, of nodes in a social network, or of proteins in a biological complex system, to name a few. Based on our findings, we draw some general conclusions and advice to tackle this particular classification task: it is clearly observed that some dataset labelings can be better explained by their graph structure or by their features set.
Similar content being viewed by others
Notes
Graph and network will be used interchangeably.
Recall that autocorrelation means that neighboring nodes tend to take similar values.
Hence the name autologistic.
The datasets are available at http://github.com/B-Lebichot/Research.
References
Abney S (2008) Semisupervised learning for computational linguistics. Chapman and Hall/CRC, Boca Raton
Akamatsu T (1996) Cyclic flows, Markov process and stochastic traffic assignment. Transp Res B 30(5):369–386
Anselin L (1988) Spatial econometrics: methods and models. Kluwer Academic Press, New York
Augustin NH, Mugglestone MA, Buckland ST (1996) An autologistic model for the spatial distribution of wildlife. J Appl Ecol 33(2):339–347
Augustin NH, Mugglestone MA, Buckland ST (1998) The role of simulation in modelling spatially correlated data. Environmetrics 9(2):175–196
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from examples. J Mach Learn Res 7:2399–2434
Benali H, Escofier B (1990) Analyse factorielle lissee et analyse des differences locales. Revue de Statistique Appliquee 38(2):55–76
Besag JE (1972) Nearest-neighbour systems and the auto-logistic model for binary data. J R Stat Soc Ser B (Methodol) 34(1):75–83
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory, COLT’ 98, pp 92–100. ACM, New York
Borcard D, Legendre P (2002) All-scale spatial analysis of ecological data by means of principal coordinates of neighbour matrices. Ecol Model 153(1–2):51–68
Bottou L, Lin CJ (2007) Support vector machine solvers. In: Bottou L et al (eds) Large scale kernel machines. MIT Press, Cambridge, pp 1–28
Chapelle O, Scholkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge
Chen D, Cheng X (2001) An asymptotic analysis of some expert fusion methods. Pattern Recognit Lett 22:901–904
Chung FR (1997) Spectral graph theory. American Mathematical Society, Providence
Cooke RM (1991) Experts in uncertainty. Oxford University Press, Oxford
Courtain S, Lebichot B, Kivimaki I, Saerens M (2019) Graph-based fraud detection with the free energy distance. In: Proceedings of the 8th international conference on complex networks and their applications (complex networks 2019). Springer, pp 40–52
de Jong P, Sprenger C, van Veen F (1984) On extreme values of Moran’s I and Geary’s c. Geogr Anal 16(1):17–24
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B 39(1):1–38
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Devooght R, Mantrach A, Kivimaki I, Bersini H, Jaimes A, Saerens M (2014) Random walks based modularity: application to semi-supervised learning. In: Proceedings of the 23rd international conference on World Wide Web, WWW ’14, pp 213–224
Dray S, Legendre P, Peres-Neto P (2006) Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices. Ecol Model 196(3–4):483–493
Dubois D, Grabisch M, Prade H, Smets P (1999) Assessing the value of a candidate: comparing belief function and possibility theories. In: Proceedings of the 15th international conference on uncertainty in artificial intelligence, pp 170–177
Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
Fouss F, Francoisse K, Yen L, Pirotte A, Saerens M (2012) An experimental investigation of kernels on a graph on collaborative recommendation and semisupervised classification. Neural Netw 31:53–72
Fouss F, Pirotte A, Renders JM, Saerens M (2007) Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation. IEEE Trans Knowl Data Eng 19(3):355–369
Fouss F, Saerens M (2004) Yet another method for combining classifiers outputs: a maximum entropy approach. In: Proceedings of the 5th international workshop on multiple classifier systems (MCS 2004), lecture notes in computer science, vol 3077. Springer, pp 82–91
Fouss F, Saerens M, Shimbo M (2016) Algorithms and models for network data and link analysis. Cambridge University Press, Cambridge
Francoisse K, Kivimaki I, Mantrach A, Rossi F, Saerens M (2017) A bag-of-paths framework for network data analysis. Neural Netw 90:90–111
Gammerman A, Vapnik V, Vowk V (1998) Learning by tranduction. In: Proceedings of the 14th conference on uncertainty in artificial intelligence. Wisconsin, pp 273–297
Gartner T (2008) Kernels for structured data. World Scientific Publishing, Singapore
Geary RC (1954) The contiguity ratio and statistical mapping. Incorp Stat 5(3):115–146
Gómez-Chova L, Camps-Valls G, Munoz-Mari J, Calpe J (2008) Semisupervised image classification with Laplacian support vector machines. IEEE Geosci Remote Sens Lett 5(3):336–340
Green P, Silverman B (1994) Nonparametric regression and generalized linear models. A roughness penalty approach. Chapman & Hall, London
Haining R (2003) Spatial data analysis. Cambridge University Press, Cambridge
Hardoon DR, Szedmak SR, Shawe-taylor JR (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
He X (2010) Laplacian regularized d-optimal design for active learning and its application to image retrieval. IEEE Trans Image Process 19(1):254–263
Hill S, Provost F, Volinsky C (2006) Network-based marketing: identifying likely adopters via consumer networks. Stat Sci 21(2):256–276
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Jacobs RA (1995) Methods for combining experts’ probability assessments. Neural Comput 7:867–888
Jiang X, Gold D, Kolaczyk E (2011) Network-based auto-probit modeling for protein function prediction. Biometrics 67(3):958–966
Johnson R, Wichern D (2007) Applied multivariate statistical analysis, 6th edn. Prentice Hall, Upper Saddle River
Kittler J, Alkoot FM (2003) Sum versus vote fusion in multiple classifier systems. IEEE Trans Pattern Anal Mach Intell 25(1):110–115
Klir GJ, Folger TA (1988) Fuzzy sets, uncertainty, and information. Prentice-Hall, Upper Saddle River
Kolaczyk ED (2009) Statistical analysis of network data: methods and models. Springer, Berlin
Kuncheva L (2004) Combining pattern classifiers: methods and algorithms. Wiley, Hoboken
Lad F (1996) Operational subjective statistical methods. Wiley, Hoboken
Lebart L (2000) Contiguity analysis and classification. In: Gaul W, Opitz O, Schader M (eds) Data analysis, studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 233–243
Lebichot B, Braun F, Caelen O, Saerens M (2016) A graph-based, semi-supervised, credit card fraud detection system. In: Proceedings of the 5th international workshop on complex networks and their applications (complex networks 2016). Springer, pp 721–733
Lebichot B, Kivimaki I, Françoisse K, Saerens M (2014) Semi-supervised classification through the bag-of-paths group betweenness. IEEE Trans Neural Netw Learn Syst 25:1173–1186
LeSage J, Pace RK (2009) Introduction to spatial econometrics. Chapman & Hall, London
Levy WB, Delic H (1994) Maximum entropy aggregation of individual opinions. IEEE Trans Syst Man Cybern 24(4):606–613
Lu Q, Getoor L (2001) Link-based classification. In: Proceedings of the 20th international conference on machine learning (ICML 2003), pp 496–503
Macskassy SA, Provost F (2007) Classification in networked data: a toolkit and a univariate case study. J Mach Learn Res 8:935–983
Mantrach A, van Zeebroeck N, Francq P, Shimbo M, Bersini H, Saerens M (2011) Semi-supervised classification and betweenness computation on large, sparse, directed graphs. Pattern Recognit 44(6):1212–1224
Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, New York
McAuley J, Leskovec J (2012) Learning to discover social circles in ego networks. Advances in neural information processing systems (NIPS 25), pp 539–547
McLachlan G, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, Hoboken
Meot A, Chessel D, Sabatier R (1993) Operateurs de voisinage et analyse des donnees spatio-temporelles (in french). In: Lebreton D, Asselain B (eds) Biometrie et environnement. Masson, Paris, pp 45–72
Merz C (1999) Using correspondence analysis to combine classifiers. Mach Learn 36:226–239
Moran P (1948) The interpretation of statistical maps. J R Stat Soc B 10:243–251
Moran P (1950) Notes on continuous stochastic phenomena. Biometrika 37(1/2):17–23
Mulders D, de Bodt C, Bjelland J, Pentland A, Verleysen M, de Montjoye Y (2019) Inference of node attributes from social network assortativity. Neural Comput Appl 1433–3058:1–21
Myung IJ, Ramamoorti S, Andrew D, Bailey J (1996) Maximum entropy aggregation of expert predictions. Manag Sci 42(10):1420–1436
Newman M (2006) Modularity and community structure in networks. Proc Natl Acad Sci U S A 103(23):8577–8582
Newman M (2018) Networks: an introduction, 2nd edn. Oxford University Press, Oxford
Newman M, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113
Pawitan Y (2001) In all likelihood: statistical modelling and inference using likelihood. Oxford University Press, Oxford
Pfeiffer D, Robinson T, Stevenson M, Stevens K, Rogers D, Clements A (2008) Spatial analysis in epidemiology. Oxford University Press, Oxford
Prithviraj S, Galileo G, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T (2008) Collective classification in network data. AI Mag 29(3):93–106
Roth V (2001) Probabilistic discriminative kernel classifiers for multi-class problems. In: Radig B, Florczyk S (eds) Pattern recognition: proceedings of the 23rd DAGM symposium, lecture notes in computer science, vol 2191. Springer, Berlin, pp 246–253
Saerens M, Achbany Y, Fouss F, Yen L (2009) Randomized shortest-path problems: two related models. Neural Comput 21(8):2363–2404
Scholkopf B, Smola A (2002) Learning with kernels. The MIT Press, Cambridge
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Silva T, Zhao L (2016) Machine learning in complex networks. Springer, Berlin
Subramanya A, Pratim Talukdar P (2014) Graph-based semi-supervised learning. Morgan & Claypool Publishers, San Rafael
Sun S (2013) A survey of multi-view machine learning. Neural Comput Appl 23:2031–2038
Tang L, Liu H (2009) Relational learning via latent social dimensions. In: Proceedings of the ACM conference on knowledge discovery and data mining (KDD 2009), pp 817–826
Tang L, Liu H (2009) Scalable learning of collective behavior based on sparse social dimensions. In: Proceedings of the ACM conference on information and knowledge management (CIKM 2009), pp 1107–1116
Tang L, Liu H (2010) Toward predicting collective behavior via social dimension extraction. IEEE Intell Syst 25(4):19–25
Van Vlasselaer V, Bravo C, Caelen O, Eliassi-Rad T, Akogu L, Snoeck M, Baesens B (2015) APATE: a novel approach for automated credit card transaction fraud detection using network-based extensions. Decis Support Syst 75:38–48
von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
Waldhor T (2006) Moran’s spatial autocorrelation coefficient. In: Kotz S, Balakrishnana N, Read C, Vidakovic B, Johnson N (eds) Encyclopedia of statistical sciences, vol 12, 2nd edn. Wiley, Hoboken, pp 7875–7878
Waller L, Gotway C (2004) Applied spatial statistics for public health data. Wiley, Hoboken
Zhang D, Mao R (2008) Classifying networked entities with modularity kernels. In: Proceedings of the 17th ACM conference on information and knowledge management (CIKM 2008). ACM, pp 113–122
Zhao J, Xie X, Xu X, Sun S (2017) Multi-view learning overview: recent progress and new challenges. Inf Fusion 38(C):43–54
Zhou D, Bousquet O, Lal T, Weston J, Scholkopf B (2003) Learning with local and global consistency. In: Proceedings of the neural information processing systems conference (NIPS 2003), pp 237–244
Zhu X (2008) Semi-supervised learning literature survey. Unpublished manuscript from the Computer Science Department of the University of Wisconsin-Madison. http://pages.cs.wisc.edu/~jerryzhu/research/ssl/semireview.html
Zhu X, Goldberg A (2009) Introduction to semi-supervised learning. Morgan & Claypool Publishers, San Rafael
Acknowledgements
This work was partially supported by the Elis-IT project funded by the “Région wallonne” and the Brufence project supported by INNOVIRIS (“Région bruxelloise”), Belgium. We thank this institution for giving us the opportunity to conduct both fundamental and applied research. We also thank the anonymous reviewers for their relevant remarks and suggestions that helped us to improve significantly the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lebichot, B., Saerens, M. An experimental study of graph-based semi-supervised classification with additional node information. Knowl Inf Syst 62, 4337–4371 (2020). https://doi.org/10.1007/s10115-020-01500-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-020-01500-0