LaPOLeaF: Label propagation in an optimal leading forest
Introduction
Labeling data is laborious and expensive, and there is a tremendous amount of unlabeled data available. Therefore, semi-supervised learning (SSL) has attracted significant attention from the machine learning community [1], [2], [3], [4], [5], [6], [7], [8], [9]. Among the many types of SSL model, graph-based SSL (GSSL) has a reputation for being easily understood based on visual representations and is convenient for improving the learning performance by exploiting matrix calculations. Therefore, there have been many studies in this area [1], [2], [3], [4], [5], [9], [10].
However, existing GSSL models have two apparent limitations. One is that such models typically need to solve an optimization problem in an iterative manner, resulting in low efficiency. Although some studies have focused on the efficiency of GSSL ([11]), the iterative paradigm remains unchanged. The other is that such models typically have difficulty in assigning labels to out-of-sample data because the solutions for unlabeled data are derived specifically for a particular graph. When new data are introduced, the graph changes and the entire iterative optimization process typically has to be re-executed.
We have considered the possible reasons for these limitations and we hypothesize that the main issue is that existing models consider the relationships between neighboring data points to be peer-to-peer relationships. Because data points are considered to be equally significant in terms of representing their corresponding classes, most GSSL models attempt to optimize an objective function for each data point with equal priors or weights. However, such peer-to-peer relationships are questionable in many scenarios. For example, if a data point lies at the center of the space corresponding to its class, then it will have more representation power than another point that diverges from the central location, even if and are in the same K-nearest neighbors (KNN) neighborhood. A number of researchers have noted the significance of this relationship. Recently, Li et al. proposed a measure called stability to ensemble clustering, where they differentiated the objects within a cluster as core and halo objects [12].
Based on our doubts regarding peer-to-peer relationships, this study focused on the following partial-order assumptions. A) Neighboring data points do not have equal status. B) The label of a leader (or parent) is the weighted sum of the contributions of its followers (or children). Part A of this assumption is self-explanatory and we elaborate on part B below. First, this assumption is intuitively reasonable, as indicated by the colloquialism “a man is known by the company he keeps,” The labels of peripheral data may change based on the model or parameter selection, but the labels of core data are relatively stable. Second, the similar concept of “a leader’s label is the weighted sum of the labels of its followers” can be observed commonly in early machine learning literature, such as that on local linear embedding [13], sparse representation [14], and anchor graph regularization [1]. Additionally, in Section 3, we will show that Part B of the assumption is actually the manifold assumption modified for an OLeaF.
Mainstream methods for GSSL are closely related to spectral clustering [15]. As we will discuss in greater detail in Section 5.2, spectral clustering shares the same core concepts as the justifiable granulation principle [16] of the granular computing (GrC) community. GrC is a subset of granular mathematics in which the theories, methodologies, techniques and tools make use of the multi-granularity (or multi-scale) data constructs called information granules (IGs) for problem solving [17]. The process of constructing IGs from raw data is called granulation. Among various concrete granulation methods [18], [19], [20], local-density-based optimal granulation (LoDOG) [20] has the advantages of being non-iterative and highly accurate, regardless of the shapes of IGs. Just as spectral clustering has inspired several GSSL methods, we adopted LoDOG to develop a novel GSSL method called label propagation in an optimal leading forest (LaPOLeaF). Fig. 1 presents the location of LaPOLeaF in the context formed by related existing works.
LaPOLeaF originates from LoDOG [20]. In LoDOG, input data are organized into an optimal number of subtrees, and every non-center node in each subtree is led by its parent to join the micro-cluster to which the parent belongs. In an earlier related work [21], these subtrees were called leading trees (LTs). Therefore a collection of the optimal number of leading trees is called an OLeaF. LaPOLeaF performs label propagation on the structures of relatively independent subtrees in a forest, rather than on a traditional nearest neighbor graph. Therefore, LaPOLeaF exhibits several advantages compared to other GSSL methods.
- (a)
LaPOLeaF performs label propagation in a non-iterative fashion, so it is highly efficient.
- (b)
It is convenient to learn a label for a new (out-of-sample) datum.
- (c)
The leading relationships between samples reflect the evolution process from neutral positions to marginal zones within a particular class. Therefore, the interpretability of learning results is enhanced.
The core concept of LaPOLeaF is applying the manifold assumption to an OLeaF to perform SSL tasks, resulting in the merits outlined above. Overall, the LaPOLeaF algorithm has a simple formulation and empirical evaluations demonstrate competitive accuracy with significantly enhanced efficiency.
The remainder of this paper is organized as follows. Section 2 briefly reviews related work. The LaPOLeaF model is presented in detail in Section 3. Section 4 describes the proposed method for scaling LaPOLeaF for big data. Section 5 analyzes computational complexity and discusses relationships with other research. Section 6 describes our experimental study. Section 7 investigates how the bandwidth parameter affects the output of LaPOLeaF. Conclusions are summarized in Section 8.
Section snippets
Related studies
Before reviewing related work and introducing our method, we first list the main notations we will use throughout this paper in Table 1.
LaPOLeaF
LaPOLeaF first performs global optimization to construct an OLeaF, and then performs label propagation on each subtree. During each step of propagation, label information is transmitted between children and their parent in each atom tree. The framework of LaPOLeaF is illustrated in Fig. 2.
Scalability of LaPOLeaF
To scale the LaPOLeaF model for big data applications, we propose two approaches. One uses a divide-and-conquer strategy and block matrix techniques to derive an exact solution and the other is an approximation approach based on LSH. This concept is similar to the method presented by Zhang et al. [32], who scaled DP-Clust model for large-scale data using map-reduce and LSH. However, the approach we present here trades memory usage for communication overhead. We assemble the blocked distance
Complexity analysis
By investigating each step of Algorithm 1, we found that with the exception of the calculation of the distance matrix, which requires exactly basic operations (which can be accelerated using matrix computation via Theorem 1), all other steps in LaPOLeaF have a linear time complexity relative to (the size of ). Consider the label propagation part of the illustrative example on the double-moon dataset. First, 40 adjacent lists (ALs) for representing sub-LTs are computed by scanning
Experimental studies
The efficiency and effectiveness of LaPOLeaF were evaluated on nine real-world datasets, among which three are small datasets from UCI machine learning repository 3, while the others are larger. Information regarding the datasets is provided in Table 3. The three small datasets are used to demonstrate the effectiveness of LaPOLeaF and the other four datasets for classification are used to demonstrate the scalability and efficiency of LaPOLeaF using the
How the parameter affects LaPOLeaF
The authors of [26] claimed that clustering results are robust to the parameter , but comments on that paper and later research works have argued that can affect the clustering outputs to a considerable extent. Fig. 11 shows the sensitivity of parameter percent on the three datasets of Wine, Iris, and ImageNet_2. Instead of simply labeling this parameter as sensitive or insensitive, we examine how it affects the result of LaPOLeaF.
Conclusions
Existing GSSL methods have two main weaknesses. One is low efficiency as a result of using iterative optimization processes and the other is the inconvenience of predicting labels for new data. This paper proposed a sound assumption that neighboring data points do not have equal status, but share partial-ordered relationships in the form of LTs. Additionally, we believe that the label of a center can be regarded as the sum of the contributions of its followers. Based on these assumptions and a
CRediT authorship contribution statement
Ji Xu: Writing - original draft, Conceptualization, Methodology, Investigation, Software. Tianrui Li: Writing - review & editing, Conceptualization. Yongming Wu: Writing - review & editing, Visualization, Software. Guoyin Wang: Writing - review & editing, Conceptualization, Methodology, Supervision.
Declaration of competing interest
No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication.
Acknowledgments
This work was supported by the National Key Research and Development Program of China under Grant 2016QY01W0200 and the National Natural Science Foundation of China under Grants 61966005 and 61936001. We are grateful to the anonymous reviewers for their valuable comments and constructive feedback.
References (49)
A self-training hierarchical prototype-based approach for semi-supervised classification
Inf. Sci.
(2020)- et al.
Joint auto-weighted graph fusion and scalable semi-supervised learning
Inf. Fusion
(2021) - et al.
Learning adaptive criteria weights for active semi-supervised learning
Inf. Sci.
(2021) - et al.
Consensus rate-based label propagation for semi-supervised classification
Inf. Sci.
(2018) - et al.
Semi-supervised classification via simultaneous label and discriminant embedding estimation
Inf. Sci.
(2021) - et al.
Clustering ensemble based on sample’s stability
Artif. Intell.
(2019) - et al.
Building the fundamentals of granular computing: A principle of justifiable granularity
Appl. Soft Comput.
(2013) - et al.
Data description: A general framework of information granules
Knowl.-Based Syst.
(2015) - et al.
DenPEHC: Density peak based efficient hierarchical clustering
Inf. Sci.
(2016) - et al.
New label propagation algorithm with pairwise constraints
Pattern Recogn.
(2020)
Self-training semi-supervised classification based on density peaks of data
Neurocomputing
Fat node leading tree for data stream clustering with density peaks
Knowl.-Based Syst.
Sensitivity analysis on initial classifier accuracy in fuzziness based semi-supervised learning
Inf. Sci.
Modeling of daily pan evaporation in sub tropical climates using ann, ls-svr, fuzzy logic, and anfis
Expert Syst. Appl.
Large graph construction for scalable semi-supervised learning
Learning a propagable graph for semisupervised learning: Classification and regression
IEEE Trans. Knowl. Data Eng.
Learning on big graph: Label inference and regularization with anchor hierarchy
IEEE Trans. Knowl. Data Eng.
Robust graph-based semisupervised learning for noisy labeled data via maximum correntropy criterion
IEEE Trans. Cybern.
Ensemble teaching for hybrid label propagation
IEEE Trans. Cybern.
Nonlinear dimensionality reduction by locally linear embedding
Science
Robust face recognition via sparse representation
IEEE Trans. Pattern Anal. Mach. Intell.
A tutorial on spectral clustering
Stat. Comput.
Cited by (6)
About Machine Learning Techniques in Water Quality Monitoring
2023, 2023 5th International Conference on Advances in Computational Tools for Engineering Applications, ACTEA 2023IbLT: An effective granular computing framework for hierarchical community detection
2022, Journal of Intelligent Information Systems