Elsevier

Information Sciences

Volume 575, October 2021, Pages 133-154
Information Sciences

LaPOLeaF: Label propagation in an optimal leading forest

https://doi.org/10.1016/j.ins.2021.06.010Get rights and content

Abstract

This paper presents an efficient graph semisupervised learning (GSSL) method that meets the criterion of optimization without iterations. Most existing GSSL methods require iterative optimization to achieve a preset objective because they consider data points to be in peer-to-peer relationships. Additionally, existing GSSL methods must learn from scratch for unseen data because graph structures are specifically built for a given dataset. By leveraging the partial order relationships induced by the local density and distances between data, we developed a novel label propagation algorithm based on the data structure of an optimal leading forest (OLeaF). The time complexity of our method is O(N) for both labeling unclassified data and labeling new data from a dataset after an OLeaF is constructed. Therefore, the two main weaknesses of traditional GSSL are addressed. Additionally, the constructed leading forest offers good interpretability for learning results. We scale the proposed method to accommodate big data by utilizing the block distance matrix technique and locality-sensitive hashing. Extensive experiments on datasets with different characteristics demonstrate the superior efficiency and competitive accuracy of the proposed method.

Introduction

Labeling data is laborious and expensive, and there is a tremendous amount of unlabeled data available. Therefore, semi-supervised learning (SSL) has attracted significant attention from the machine learning community [1], [2], [3], [4], [5], [6], [7], [8], [9]. Among the many types of SSL model, graph-based SSL (GSSL) has a reputation for being easily understood based on visual representations and is convenient for improving the learning performance by exploiting matrix calculations. Therefore, there have been many studies in this area [1], [2], [3], [4], [5], [9], [10].

However, existing GSSL models have two apparent limitations. One is that such models typically need to solve an optimization problem in an iterative manner, resulting in low efficiency. Although some studies have focused on the efficiency of GSSL ([11]), the iterative paradigm remains unchanged. The other is that such models typically have difficulty in assigning labels to out-of-sample data because the solutions for unlabeled data are derived specifically for a particular graph. When new data are introduced, the graph changes and the entire iterative optimization process typically has to be re-executed.

We have considered the possible reasons for these limitations and we hypothesize that the main issue is that existing models consider the relationships between neighboring data points to be peer-to-peer relationships. Because data points are considered to be equally significant in terms of representing their corresponding classes, most GSSL models attempt to optimize an objective function for each data point with equal priors or weights. However, such peer-to-peer relationships are questionable in many scenarios. For example, if a data point xc lies at the center of the space corresponding to its class, then it will have more representation power than another point xd that diverges from the central location, even if xc and xd are in the same K-nearest neighbors (KNN) neighborhood. A number of researchers have noted the significance of this relationship. Recently, Li et al. proposed a measure called stability to ensemble clustering, where they differentiated the objects within a cluster as core and halo objects [12].

Based on our doubts regarding peer-to-peer relationships, this study focused on the following partial-order assumptions. A) Neighboring data points do not have equal status. B) The label of a leader (or parent) is the weighted sum of the contributions of its followers (or children). Part A of this assumption is self-explanatory and we elaborate on part B below. First, this assumption is intuitively reasonable, as indicated by the colloquialism “a man is known by the company he keeps,” The labels of peripheral data may change based on the model or parameter selection, but the labels of core data are relatively stable. Second, the similar concept of “a leader’s label is the weighted sum of the labels of its followers” can be observed commonly in early machine learning literature, such as that on local linear embedding [13], sparse representation [14], and anchor graph regularization [1]. Additionally, in Section 3, we will show that Part B of the assumption is actually the manifold assumption modified for an OLeaF.

Mainstream methods for GSSL are closely related to spectral clustering [15]. As we will discuss in greater detail in Section 5.2, spectral clustering shares the same core concepts as the justifiable granulation principle [16] of the granular computing (GrC) community. GrC is a subset of granular mathematics in which the theories, methodologies, techniques and tools make use of the multi-granularity (or multi-scale) data constructs called information granules (IGs) for problem solving [17]. The process of constructing IGs from raw data is called granulation. Among various concrete granulation methods [18], [19], [20], local-density-based optimal granulation (LoDOG) [20] has the advantages of being non-iterative and highly accurate, regardless of the shapes of IGs. Just as spectral clustering has inspired several GSSL methods, we adopted LoDOG to develop a novel GSSL method called label propagation in an optimal leading forest (LaPOLeaF). Fig. 1 presents the location of LaPOLeaF in the context formed by related existing works.

LaPOLeaF originates from LoDOG [20]. In LoDOG, input data are organized into an optimal number of subtrees, and every non-center node in each subtree is led by its parent to join the micro-cluster to which the parent belongs. In an earlier related work [21], these subtrees were called leading trees (LTs). Therefore a collection of the optimal number of leading trees is called an OLeaF. LaPOLeaF performs label propagation on the structures of relatively independent subtrees in a forest, rather than on a traditional nearest neighbor graph. Therefore, LaPOLeaF exhibits several advantages compared to other GSSL methods.

  • (a)

    LaPOLeaF performs label propagation in a non-iterative fashion, so it is highly efficient.

  • (b)

    It is convenient to learn a label for a new (out-of-sample) datum.

  • (c)

    The leading relationships between samples reflect the evolution process from neutral positions to marginal zones within a particular class. Therefore, the interpretability of learning results is enhanced.

The core concept of LaPOLeaF is applying the manifold assumption to an OLeaF to perform SSL tasks, resulting in the merits outlined above. Overall, the LaPOLeaF algorithm has a simple formulation and empirical evaluations demonstrate competitive accuracy with significantly enhanced efficiency.

The remainder of this paper is organized as follows. Section 2 briefly reviews related work. The LaPOLeaF model is presented in detail in Section 3. Section 4 describes the proposed method for scaling LaPOLeaF for big data. Section 5 analyzes computational complexity and discusses relationships with other research. Section 6 describes our experimental study. Section 7 investigates how the bandwidth parameter affects the output of LaPOLeaF. Conclusions are summarized in Section 8.

Section snippets

Related studies

Before reviewing related work and introducing our method, we first list the main notations we will use throughout this paper in Table 1.

LaPOLeaF

LaPOLeaF first performs global optimization to construct an OLeaF, and then performs label propagation on each subtree. During each step of propagation, label information is transmitted between children and their parent in each atom tree. The framework of LaPOLeaF is illustrated in Fig. 2.

Scalability of LaPOLeaF

To scale the LaPOLeaF model for big data applications, we propose two approaches. One uses a divide-and-conquer strategy and block matrix techniques to derive an exact solution and the other is an approximation approach based on LSH. This concept is similar to the method presented by Zhang et al. [32], who scaled DP-Clust model for large-scale data using map-reduce and LSH. However, the approach we present here trades memory usage for communication overhead. We assemble the blocked distance

Complexity analysis

By investigating each step of Algorithm 1, we found that with the exception of the calculation of the distance matrix, which requires exactly N(N-1)/2 basic operations (which can be accelerated using matrix computation via Theorem 1), all other steps in LaPOLeaF have a linear time complexity relative to N (the size of X). Consider the label propagation part of the illustrative example on the double-moon dataset. First, 40 adjacent lists (ALs) for representing sub-LTs are computed by scanning

Experimental studies

The efficiency and effectiveness of LaPOLeaF were evaluated on nine real-world datasets, among which three are small datasets from UCI machine learning repository 3, while the others are larger. Information regarding the datasets is provided in Table 3. The three small datasets are used to demonstrate the effectiveness of LaPOLeaF and the other four datasets for classification are used to demonstrate the scalability and efficiency of LaPOLeaF using the

How the parameter dc affects LaPOLeaF

The authors of [26] claimed that clustering results are robust to the parameter dc, but comments on that paper and later research works have argued that dc can affect the clustering outputs to a considerable extent. Fig. 11 shows the sensitivity of parameter percent on the three datasets of Wine, Iris, and ImageNet_2. Instead of simply labeling this parameter as sensitive or insensitive, we examine how it affects the result of LaPOLeaF.

Conclusions

Existing GSSL methods have two main weaknesses. One is low efficiency as a result of using iterative optimization processes and the other is the inconvenience of predicting labels for new data. This paper proposed a sound assumption that neighboring data points do not have equal status, but share partial-ordered relationships in the form of LTs. Additionally, we believe that the label of a center can be regarded as the sum of the contributions of its followers. Based on these assumptions and a

CRediT authorship contribution statement

Ji Xu: Writing - original draft, Conceptualization, Methodology, Investigation, Software. Tianrui Li: Writing - review & editing, Conceptualization. Yongming Wu: Writing - review & editing, Visualization, Software. Guoyin Wang: Writing - review & editing, Conceptualization, Methodology, Supervision.

Declaration of competing interest

No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication.

Acknowledgments

This work was supported by the National Key Research and Development Program of China under Grant 2016QY01W0200 and the National Natural Science Foundation of China under Grants 61966005 and 61936001. We are grateful to the anonymous reviewers for their valuable comments and constructive feedback.

References (49)

  • D. Wu et al.

    Self-training semi-supervised classification based on density peaks of data

    Neurocomputing

    (2018)
  • J. Xu et al.

    Fat node leading tree for data stream clustering with density peaks

    Knowl.-Based Syst.

    (2017)
  • M.J. Patwary et al.

    Sensitivity analysis on initial classifier accuracy in fuzziness based semi-supervised learning

    Inf. Sci.

    (2019)
  • M.K. Goyal et al.

    Modeling of daily pan evaporation in sub tropical climates using ann, ls-svr, fuzzy logic, and anfis

    Expert Syst. Appl.

    (2014)
  • W. Liu et al.

    Large graph construction for scalable semi-supervised learning

  • B. Ni et al.

    Learning a propagable graph for semisupervised learning: Classification and regression

    IEEE Trans. Knowl. Data Eng.

    (2012)
  • M. Wang et al.

    Learning on big graph: Label inference and regularization with anchor hierarchy

    IEEE Trans. Knowl. Data Eng.

    (2017)
  • B. Du et al.

    Robust graph-based semisupervised learning for noisy labeled data via maximum correntropy criterion

    IEEE Trans. Cybern.

    (2019)
  • C. Gong et al.

    Ensemble teaching for hybrid label propagation

    IEEE Trans. Cybern.

    (2019)
  • Y. Fujiwara, G. Irie, Efficient label propagation, in: International Conference on Machine Learning, 2014, pp....
  • S.T. Roweis et al.

    Nonlinear dimensionality reduction by locally linear embedding

    Science

    (2000)
  • J. Wright et al.

    Robust face recognition via sparse representation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • U.V. Luxburg

    A tutorial on spectral clustering

    Stat. Comput.

    (2007)
  • Y.Y. Yao, Granular computing: basic issues and possible solutions, in: Proceedings of the 5th joint conference on...
  • View full text