Next Article in Journal
Comparative Study: Catalytic Activity and Rhodamine Dye Luminescence at the Surface of TiO2-Based Nanoheterostructures
Next Article in Special Issue
Mathematical Analysis for the Effects of Medicine Supplies to a Solid Tumor
Previous Article in Journal
p-Adic q-Twisted Dedekind-Type Sums
Previous Article in Special Issue
Research on Projection Filtering Method Based on Projection Symmetric Interval and Its Application in Underwater Navigation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Two-Phase Algorithm for Robust Symmetric Non-Negative Matrix Factorization

School of Mathematics Science, Yuquan Campus, Zhejiang University, Hangzhou 310038, China
*
Author to whom correspondence should be addressed.
Symmetry 2021, 13(9), 1757; https://doi.org/10.3390/sym13091757
Submission received: 7 August 2021 / Revised: 16 September 2021 / Accepted: 17 September 2021 / Published: 20 September 2021

Abstract

:
As a special class of non-negative matrix factorization, symmetric non-negative matrix factorization (SymNMF) has been widely used in the machine learning field to mine the hidden non-linear structure of data. Due to the non-negative constraint and non-convexity of SymNMF, the efficiency of existing methods is generally unsatisfactory. To tackle this issue, we propose a two-phase algorithm to solve the SymNMF problem efficiently. In the first phase, we drop the non-negative constraint of SymNMF and propose a new model with penalty terms, in order to control the negative component of the factor. Unlike previous methods, the factor sequence in this phase is not required to be non-negative, allowing fast unconstrained optimization algorithms, such as the conjugate gradient method, to be used. In the second phase, we revisit the SymNMF problem, taking the non-negative part of the solution in the first phase as the initial point. To achieve faster convergence, we propose an interpolation projected gradient (IPG) method for SymNMF, which is much more efficient than the classical projected gradient method. Our two-phase algorithm is easy to implement, with convergence guaranteed for both phases. Numerical experiments show that our algorithm performs better than others on synthetic data and unsupervised clustering tasks.

1. Introduction

A large amount of data in the real world, such as images and texts, are non-negative. Non-negative matrix factorization (NMF) [1] is a special low rank factorization approach for these non-negative data. NMF aims to decompose a non-negative matrix X R m × n into the product of two non-negative matrices W R m × r ( r < < min { m , n } ) and H R n × r , that is,
min W , H 0 X W H T F 2 ,
where · F is the Frobenius norm of a matrix. Due to the non-negativity of W and H, NMF has better interpretability than unconstrained matrix factorization. For images, the columns of W can be interpreted as the basis images, and the columns of H can be seen as the linear combinations of these basis images. Since the number of basic images is much smaller than the dimensionality of the original images, H can be regarded as a low-dimensional representation of X. Generally, the clustering performance on H is better than that on the original data X. Due to the above advantages of NMF, it has been widely used in object recognition [2], face feature extraction [3], blind source separation [4], hyper-spectral imaging processing [5], and many others. In recent years, in order to improve the representation ability of NMF, a variety of constrained NMF models have been proposed, including Sparse NMF [6], orthogonal NMF [7] and graph regularized NMF [8]. For a survey of NMF, we recommend [9,10].
In recent years, symmetric non-negative matrix factorization (SymNMF) has become a competitive low-rank approximation approach in pattern mining. Given a symmetric non-negative matrix A R n × n , SymNMF aims to find a non-negative matrix H R n × r , in order to solve
min H 0 1 4 A H H T F 2 ,
where r < < n is given in advance. SymNMF can be regarded as a special form of NMF in (1), requiring W = H . However, there are significant differences between the two approaches. In SymNMF, the matrix to be decomposed is the affinity matrix of the data, instead of the data matrix itself in NMF. SymNMF obtains a low-dimensional representation of the data, where the non-negative constraint on the representation makes the class information easy to distinguish. When different classes cannot be separated linearly, the effect of SymNMF is often better than that of the original NMF and its variants. In addition, SymNMF is closely related to fuzzy clustering [11]. Let x 1 , , x n be n independent data points, and C 1 , , C r are the r clusters to which the above data points belong. Assuming that the probability that the data point x i belongs to the category C j is p i j , then we have Σ j = 1 r p i j = 1 . Given a similarity matrix A = ( a i j ) such that a i j equals the probability that x i and x j belongs to the same cluster, it has been proved in [12] that the SymNMF of A can recover P = ( p i j ) . Thus, the fuzzy clustering of x 1 , , x n can be done.
SymNMF has been proven to be effective in cluster inference [13], image segmentation [14], genetic data analysis [15], and so on. It performs better than classical clustering approaches, such as K-means [16], PCA [17] and spectral clustering [18].

1.1. Existing Methods

Due to the powerful ability of SymNMF in pattern recognition, it has attracted widespread attention since its introduction. Most of the algorithms used for solving (2) are derived from variants of the NMF algorithms, mainly due to the correlation between SymNMF and NMF. Generally speaking, the mainstream algorithms used for SymNMF can be divided into three variants: multiplicative update, coordinate descent, and asymmetric relaxation. Next, we briefly introduce the ideas and progress of these three variants.
Multiplicative update methods. The earliest research on the use of SymNMF in data mining can be traced back to the work of Zass and Shashua [12]. They extended the multiplicative update (MU) algorithm from NMF to SymNMF. However, their algorithm for SymNMF performed poorly on real-world data sets, mainly because the MU algorithm has no convergence guarantee and the objective function declines very slowly. In [19], He et al. proposed two algorithms based on parallel MU; namely, α -SNMF and β -SNMF. Unlike the matrix–vector operations used by Zass and Shashua, α -SNMF and β -SNMF use matrix–matrix operations to update the non-negative factor, H. The experiments detailed in [19] demonstrated that the computational efficiency of α -SNMF and β -SNMF is higher than that of previous algorithms. In [20], based on the method of He et al., the authors proposed an accelerated MU (AMU) algorithm, which combines the Nesterov method and the restart strategy. Experiments have shown that the acceleration strategy proposed in [20] is more effective than α -SNMF and β -SNMF.
Coordinate descent methods. Inspired by the high efficiency of the coordinate descent method [21] in NMF, Vandaele et al. extended it to SymNMF [22]. The authors rewrote the symmetric non-negative matrix factorization as:
min h 1 , . . . , h r 0 A i = 1 r h i h i T F 2 ,
and changed one of { h 1 , , h r } in each update, where h i ( 1 i r ) is the i-th column vector of H. This approach can obtain a closed-form solution in each iteration. Their algorithm is named block coordinate descent (BCD) method. However, BCD is not guaranteed to converge to the stationary point of SymNMF, mainly due to the possibility of multiple solutions in its update. To overcome this weakness of the BCD algorithm, Shi et al. proposed an inexact block coordinate descent (IBCD) algorithm [23], updating variables by successively minimizing a sequence of approximations of the objective function. The IBCD algorithm and its variants are guaranteed to achieve convergence to stationary solutions under very mild conditions.
Asymmetric relaxation methods. Unlike the above direct algorithms, Kuang et al. [14] relaxed (2) to
min W , H 0 A W H T F 2 + λ W H F 2 ,
and solved it by updating W and H alternately, where λ > 0 is a pre-determined parameter. However, the stationary point ( W * , H * ) of the problem (3) cannot provide a SymNMF. In [14], the author enforced W * = H * and adopted H * H * T as the approximation of A. However, H * may not be a stationary point of the problem (2). The defect of the asymmetric relaxation was solved in [24], where Zhu et al. proved that, if the coefficient λ in the problem (3) is large enough, the ( W * , H * ) automatically satisfies W * = H * , and H * is a stationary point of (2). In addition, Zhu et al. extended the hierarchical alternating least squares (HALS) method [24] from NMF to the symmetric case. The HALS algorithm is superior to the previous algorithm, in terms of speed and clustering accuracy, according to their report.
Most of the above algorithms guarantee the stationary point of the problem (2). However,  they have two common flaws as follows:
  • The existing SymNMF algorithm always restricts the iteration sequence to be non-negative. Since non-negative constraint problems are more difficult to solve than unconstrained problems, the efficiency of existing algorithms is still unsatisfactory, even though coordinate drops or asymmetric relaxation are used to speed up execution.
  • The objective function of SymNMF is a quartic function with non-negative constraints, which is strongly non-convex and has a lot of local optimal solutions. Starting from different initial points, the existing algorithms will always find different local optima, leading to loss of robustness.
These two defects limit the application of SymNMF in practice. Hence, there is an urgent need for a new SymNMF algorithm with high efficiency and robustness.

1.2. Our Contributions

In this article, we aim to find a fast and robust algorithm for SymNMF. We achieve this goal through the use of a two-phase strategy. In the first phase, we drop the non-negative constraint and, instead, punish the negative part of H for making H as close to a non-negative matrix as possible. The unconstrained optimization problem we propose for use in this phase can be quickly implemented in the non-linear conjugate gradient method. In the second phase, we directly solve the problem (2), using the non-negative part of the factor obtained in the first phase as the initial point. Additionally, in the second phase, we propose an interpolation projected gradient (IPG) method for SymNMF. IPG uses the quadratic interpolation technique to determine the step length, which can efficiently obtain the stationary point of the problem (2). Compared with the existing algorithms, our two-phase method (TPM) has the following advantages:
  • In both phases, TPM works quickly and guarantees convergence. In general, the total running time of TPM is much smaller than that of existing methods.
  • Compared with existing algorithms, TPM can obtain SymNMF with a lower approximation error, and is more robust to the initial point.
  • In real-world unsupervised clustering tasks, TPM performs significantly better than existing methods. TPM can not only obtain higher clustering accuracy in a shorter time, but also performs very robustly.
The rest of the paper is organized as follows: In the second section, we discuss the proposed two-phase method. The convergence analysis of the two phases is also given in this section. In the third section, we illustrate the superiority of our algorithm, through multiple simulations and real-world examples. Finally, we provide our conclusions in Section 4.

2. Two-Phase Method for SymNMF

Due to the non-convexity and non-negativity constraint of (2), existing SymNMF algorithms face two dilemmas: low efficiency and dependence on the initial point. The two-phase algorithm we propose in this section solves the above dilemma in a step-by-step manner. In the first phase, we propose an unconstrained model for SymNMF. It can be solved quickly, using the non-linear conjugate gradient method, and can provide a non-negative factor H of A with a small approximation error. In the second phase, we propose a first-order algorithm, named interpolation projected gradient (IPG) method, which is more efficient than the classical gradient projection method. Starting from the factor obtained in the first phase, IPG quickly converges to the stationary point of (2).

2.1. The Unconstrained Model for SymNMF

In the first phase of our algorithm, we consider dropping the non-negative constraint in (2) and add a penalty term, in order to ensure that the decomposition factor H of A contains as few negative entries as possible. Our proposed model is:
min H R n × r { f ( H ) : = 1 4 A H H T F 2 + λ 2 [ H ] F 2 } ,
where λ > 0 is a pre-determined parameter and [ H ] represents the negative part of H (i.e.,  [ H ] = min { H , 0 } ). Obviously, f ( H ) is an unconstrained differentiable function, and it is easy to verify the gradient of f ( H ) as follows:
f ( H ) = ( H H T A ) H + λ [ H ] .
Hence, first-order optimization methods, such as non-linear conjugate gradient methods (NCG), can be used to solve (4).
Basically, NCG is of the form:
H k + 1 = H k + α k D k ,
where the step length α k > 0 is obtained by some line search methods, and the search direction D k is determined by:
D k = f ( H k ) , k = 1 , f ( H k ) + β k D k 1 , k > 1 ,
where D k 1 is the search direction of the previous step. The step length α k and combination coefficient β k determine the convergence behavior of NCG. Among the various forms of β k in the literature, the Polak–Ribiere–Polyak (PRP) formula [25]
β k PRP = f ( H k ) , f ( H k ) f ( H k 1 ) f ( H k 1 ) F 2
is popular, due to its high efficiency, where · , · in (7) is the matrix inner product, that is, for given matrix A = ( a i j ) and B = ( b i j ) , A , B = i j a i j b i j . However, for general continuous differentiable functions, the PRP formula does not guarantee convergence. Meanwhile, the direction D k determined by (7) may not satisfy the descent property; that is, f ( H k ) , D k 0 may appear for some k.
In order to improve the convergence properties of the PRP formula, we combine the PRP method with the steepest descent method, such that the search direction always satisfies the descent property. For the current iteration point, H k , we consider the search direction
D k ( p ) : = f ( H k ) + 1 2 p β k PRP D k 1 ,
where p is an integer. For  p = 0 , D k ( p ) is equivalent to the direction generated by the PRP formula. While p + , D k ( p ) is the opposite direction of the gradient. Let us consider the function:
ϕ k ( p ) : = cos D k ( p ) , f ( H k ) = D k ( p ) , f ( H k ) D k ( p ) F f ( H k ) F .
As lim p + ϕ k ( p ) = 1 , for a given μ ( 0 , 1 ) , there must be an integer p such that ϕ k ( p ) > μ . For each k, we take:
D k = D k ( p k * ) , s . t . p k * = arg min { p N + : ϕ k ( p ) > μ }
as the search direction in (5). For each k, the angle between D k and f ( H k ) is less than arccos ( μ ) .
After D k is determined, we need to find a suitable step length α k . Generally, α k should at least follow an inexact line search criteria. The weak Wolfe condition [26],
f ( H k + α D k ) f ( H k ) + ρ α f ( H k ) , D k
f ( H k + α D k ) , D k σ f ( H k ) , D k ,
has been commonly used, where ρ and σ satisfy 0 < ρ < σ < 1 . In our previous work [27], we proposed a simple strategy combining bisection with interpolation, which can be used to obtain α k satisfying (9) and (10). Initially, we set α k , 0 = 0 , which satisfies (9) but not (10), and selects a relatively large α k , 0 > 0 , making (9) invalid. Starting from the interval [ α k , 0 , α k , 0 ] , we iteratively generate a sequence of intervals { [ α k , , α k , ] } , such that each α k , satisfies (9) but does not satisfy (10). At the same time, α k , does not satisfy (9). For the current interval [ α k , , α k , ] , we consider the quadratic function q ( α ) that satisfies the interpolation conditions:
q ( α k , ) = f ( H k + α k , D k ) , q ( α k , ) = f ( H k + α k , D k ) , D k , q ( α k , ) = f ( H k + α k , D k ) .
The minimizer c k , = arg min α q ( α ) can be directly obtained by:
c k , = α k , + α k , α k , 2 ( α k , α k , ) q ( α k , ) q ( α k , ) q ( α k , ) ( α k , α k , ) q ( α k , ) .
We slightly modified c k , to:
c ˜ k , = max c k , , η α k , + ( 1 η ) α k , ( α , α ) ,
where η = σ 2 ( σ ρ ) . If the Wolfe–Powell conditions (9) and (10) hold for α = c ˜ k , , we can obtain α k = c ˜ k , . Otherwise, we continue to shrink [ α k , , α k , ] as follows:
[ α k , + 1 , α k , + 1 ] = [ α k , , c ˜ k , ] , if ( 9 ) does   not   hold   for   α = c ˜ ; [ c ˜ k , , α k , ] , otherwise .
Using the Lemma 3 in [27], for a decreasing direction D k , we can always obtain α k satisfying (9) and (10) in finite steps.
Summarizing the strategy for determining D k and α k given above, we propose a modified non-linear conjugate gradient method for solving (4). The detail of the algorithm can be seen in Algorithm 1, and the parameters of the algorithm are set as ρ = 0.1 , σ = 0.4 , μ = 10 3 , ϵ 1 = 10 4 , and k max ( 1 ) = 500 . The sequence generated by Algorithm 1 converges to the stationary point of f ( H ) , according to the following lemma [28].
Algorithm 1 Modified non-linear conjugate gradient method for solving (4).
Require: 
initial point H 0 ; parameters λ , 0 < ρ < σ < 1 , μ ( 0 , 1 ) , ϵ 1 > 0 , k = 0 ; and maximum number of iterations k max ( 1 ) .
Ensure: 
an approximate solution H * .
1:
While k k max ( 1 ) ,
2:
    Compute f ( H k ) , f ( H k ) , and set D k by (8).
3:
    If f ( H k ) < ϵ 1 or k = k max ( 1 ) , set H * = H and terminate the iteration.
4:
    Otherwise, compute α k by alternatively repeating (12) and (13), until we obtain     α , satisfying (9) and (10), then update H k + 1 = H k + α k D k .
5:
    Set k = k + 1 .
6:
End for
Lemma 1.
Let each α k satisfy the Wolfe–Powell conditions (9) and (10) and each D k satisfy
cos D k , f ( H k ) μ
for a fixed μ ( 0 , 1 ) . If  f exists and and is uniformly continuous on the level set { H : f ( H ) f ( H 0 ) } , where H 0 is the initial point, then the sequence { H k } generated by (5) is convergent and lim k f ( H k ) F = 0 .
By the definition of D k in (8), it is obvious that cos D k , f ( H k ) μ holds for each k. Moreover, the Wolfe–Powell conditions (9) and (10) hold for each α k , from the result in [27]. Therefore, we know that f ( H k ) generated by Algorithm 1 is monotone decreasing and lim k f ( H k ) F = 0 .
Generally, Algorithm 1 converges very fast. However, the accumulation point H * generated by Algorithm 1 might not be non-negative. However, H * is approximately non-negative, due to the penalty term in (4). Hence, we directly truncate H * as [ H * ] + = max { H * , 0 } and regard [ H * ] + [ H * ] + T as an approximation of A.

2.2. The Interpolation Projected Gradient Method for SymNMF

The symmetric non-negative matrix [ H * ] + [ H * ] + T obtained by Algorithm 1 approximates A well. However, we observe that [ H * ] + is generally not a stationary point of the original SymNMF problem (2). This implies that we can find H 0 such that A H H T F < A [ H * ] + [ H * ] + T F . This phenomenon inspires us to solve (2) directly, starting from [ H * ] + , until we find the stationary point of (2).
Among all SymNMF algorithms, the projected gradient (PG) method [29] is simple to implement and has a low computational cost. However, PG suffers from slow convergence, mainly due to the inefficiency of the method it uses to determine the step length α k . In this subsection, we revisit PG and modify it to a significantly faster interpolation projected gradient (IPG) method for SymNMF. Let
g ( H ) = 1 4 A H H T F 2 .
It is easy to verify that its gradient has the form:
g ( H ) = ( H H T A ) H .
Given that H k and 0 < ν < 1 , the PG searches for α k = α such that
H k ( α ) : = [ H k α g ( H k ) ] +
satisfies
g ( H k ( α ) ) g ( H k ) + ν H k ( α ) H k , g ( H k ) ,
and update H k + 1 = H k ( α k ) . Given an initial step length α ^ > 0 , PG seeks α k by backtracking. That is, if  α ^ does not satisfy (16), we update the step length using α ^ τ α ^ , where τ ( 0 , 1 ) is a parameter. It has been proven, in [30], that the required α k satisfying (16) can be found by backtracking in finite steps. However, PG is generally inefficient, as the derivative information of g ( H ) is not used by the backtracking strategy.
Given that α ^ does not satisfy (16), let us consider the minimizer c of g ( H ( α ) ) on [ 0 , α ^ ] ; that is,
c = arg min α [ 0 , α ^ ] g ( H k ( α ) ) .
It is not easy to find c, as g ( H ) is a quartic function and H k ( α ) is a polyline. To tackle this issue, we approximate g ( H k ( α ) ) using the following two steps. First, we approximate H k ( α ) with a straight line:
H ^ k ( α ) = H k + α α ^ ( H k ( α ^ ) H k )
on [ 0 , α ^ ] , noting that H ^ k ( 0 ) = H k ( 0 ) = H k and H ^ k ( α ^ ) = H k ( α ^ ) . Then, we approximate g ( H ^ k ( α ) ) using a quadratic function q ( α ) that satisfies the interpolation conditions:
q ( 0 ) = g ( H k ) , q ( α ^ ) = g ( H k ( α ^ ) ) q ( 0 ) = g ( H k ) , 1 α ^ ( H k ( α ^ ) H k ) .
q ( α ) can be seen as a rough approximation of g ( H k ( α ) ) and can be minimized by a closed-form solution. We have:
Lemma 2.
The quadratic function q ( α ) opens upward and has a minimizer, as follows:
c = α ^ 2 g ( H k ) , ( H k ( α ^ ) H k ) g ( H k ( α ^ ) ) g ( H k ) g ( H k ) , ( H k ( α ^ ) H k ) .
Proof. 
By the definition of q ( α ) , this can be represented as:
q ( α ) = g ( H k ) + g ( H k ) , 1 α ^ ( H k ( α ^ ) H k ) α + ( g ( H k ( α ^ ) ) g ( H k ) g ( H k ) , ( H k ( α ^ ) H k ) ) α 2 α ^ 2 .
To prove Lemma 2, we first prove:
g ( H k ) , ( H k ( α ^ ) H k ) 0 .
To show (19), we denote U = H k ( α ^ ) H k , V = H k α ^ g ( H k ) , W = H k , and Y = g ( H k ) . Then, we have:
g ( H k ) , ( H k ( α ^ ) H k ) = s t u s t y s t ,
and
u s t y s t = α ^ y s t 2 , v s t 0 ; w s t y s t , v s t < 0 .
For negative v s t = w s t α ^ y s t , we have 0 w s t < α ^ y s t ; thus, w s t y s t 0 always holds. Therefore, each u s t y s t is non-positive and (19) is true. As α ^ does not satisfy (16), we further obtain:
g ( H k ( α ^ ) ) g ( H k ) > ν g ( H k ) , ( H k ( α ^ ) H k ) g ( H k ) , ( H k ( α ^ ) H k ) ,
which implies that
g ( H k ( α ^ ) ) g ( H k ) g ( H k ) , ( H k ( α ^ ) H k ) > 0 .
Hence, q ( α ) opens upward, and its minimizer can be directly obtained as (18).    □
To keep c from being too large or too small, we modify c to:
c ^ = min { max { c , α ^ τ 1 } , α ^ τ 2 } } ,
where τ 1 and τ 2 satisfying 0 < τ 1 < τ 2 < 1 , are two parameters. If  c ^ satisfies (16), we set α k = c ^ . Otherwise, we update:
α ^ = c ^ ,
and repeat (18), (20), and (21), until we find that α k satisfies (16).
We summarize the whole procedure of IPG method in Algorithm 2, with the parameters set as ν = 0.1 , τ 1 = 0.01 , τ 2 = 0.1 , and k max ( 2 ) = 5000 . We also terminate the iteration while the optimal gap [23] is sufficiently small; that is:
H k [ H k ( H k H k T A ) H k ] + < ϵ 2 ,
where [ · ] + represents the non-negative part of a matrix, · represents the largest absolute value of all entries in a matrix, and  ϵ 2 = 10 8 . It has been shown in [31] that this gap is equal to zero if and only if a stationary point of (2) is achieved. IPG guarantees convergence to the stationary point of (2). We have the following theorem:
Theorem 1.
Any accumulation point H * * generated by Algorithm 2 is a stationary point of problem (2).
Proof. 
For the current H k , we claim that the step length α k satisfying (16) can be found within a finite number of iterations. If the initial α ^ = max { 2 α k 1 , γ } satisfies (16), α k is directly obtained. Otherwise, by (20) and (21), it is easy to see that α ^ approaches 0 with updates. Meanwhile, there exists an interval ( 0 , ϵ ) , such that (16) holds for any α in it, where ϵ is a sufficiently small positive number. Hence, one can always find α k satisfying (16) within finite iterations. In addition, there exists α ( α k , α k τ 1 ] , such that α does not satisfy (16). The above characteristics of α k satisfy the convergence condition of Theorem 11.5.5 in [28], and the correctness of Theorem 1 can be directly obtained.    □
Algorithm 2 Interpolation projected gradient (IPG) method for SymNMF.
Require: 
initial point H, parameters ν ( 0 , 1 ) , 0 < τ 1 < τ 2 < 1 , ϵ 2 > 0 , k = 0 , and the maximum number of iterations k max ( 2 ) .
Ensure: 
an approximate solution H * .
1:
While k k max ( 2 ) ,
2:
    Compute g ( H k ) and g ( H k ) . If  H k [ H k ( H k H k T A ) H k ] + < ϵ 2 , termin     -ate the iteration.
3:
    Set α ^ = max { 2 α k 1 , 10 3 } .
4:
    Alternately repeat (18), (20), and (21), until α ^ satisfies (16). Then, set α k = α ^     and update H k + 1 = [ H k α k g ( H k ) ] + .
5:
    Set k = k + 1 .
6:
End while
7:
H * = H k .
Compared with the classical PG method, IPG performs better. To see this, we generated 20 synthetic symmetric non-negative matrices and tested IPG and PG on them. Each of the generated matrices A = H H T was constructed using a random non-negative factor H R 200 × 50 , where the entries of H were uniformly distributed in [ 0 , 1 ] . We set τ 1 = 0.01 and τ 2 = 0.1 for IPG, and selected τ = 0.01 , 0.02 , 0.05 , 0.1 for PG. The left subgraph of Figure 1 shows the average SymNMF error
E ( H ) = A H H T F A F
of each algorithm over the CPU time. The efficiency of IPG was always higher than that of PG, regardless of the parameters of PG.

2.3. The Two-Phase Method for Robust SymNMF

Combining Algorithms 1 and 2, we propose our two-phase method (TPM) for SymNMF. Each of these two phases has a unique effect on obtaining a stable SymNMF. The first phase alleviates the non-convexity of the problem by discarding the non-negativity constraints. Meanwhile, in this phase, we designed an NCG algorithm to quickly obtain a symmetric non-negative approximation of A. Although the convergence point of the first phase is not a stationary point of (2), it can be used as a good initial point for the second phase. In the second phase, we proposed the IPG algorithm, which continues to reduce the SymNMF error and converge to the stationary point of (2) very fast. The combination of these two phases enables us to efficiently solve the SymNMF problem. We summarize the whole procedure of TPM in Algorithm 3.
We also compared TPM with IPG on the synthetic symmetric non-negative matrices generated above. We selected λ = 10 , 20 , 50 , 100 for TPM. The right subgraph of Figure 1 shows the SymNMF error of each algorithm over the CPU time. The efficiency of TPM was always higher than that of IPG, regardless of the parameters of TPM.
Algorithm 3 Two-Phase Method (TPM) for SymNMF.
Require: 
initial point H; parameters λ , 0 < ρ < σ < 1 , μ ( 0 , 1 ) , ν = 0.1 , 0 < τ 1 < τ 2 < 1 , ϵ 1 > 0 , ϵ 2 > 0 ; and maximum number of iterations for each phase k max ( 1 ) , k max ( 2 ) .
Ensure: 
a SymNMF factor of A: H * * .
1:
Starting with H, run Algorithm 1 until f ( H k ) < ϵ 1 or the number of iterations exceeds k max ( 1 ) . Denote the last H k obtained by Algorithm 1 as H * .
2:
Set H + * = max { H * , 0 } .
3:
Starting with H + * , run Algorithm 2 until H k [ H k ( H k H k T A ) H k ] + < ϵ 2 or the number of iterations exceeds k max ( 2 ) . Denote the last H k obtained by Algorithm 2 as H * * .

3. Numerical Experiments and Comparisons

In this section, we report the numerical performance of our method on both synthetic data and real-world data, compared to five state-of-the-art SymNMF approaches. The algorithms used for comparison included alternating non-negative least squares (ANLS) [14], block coordinate descent (BCD) [22], inexact block coordinate descent (IBCD) [23], hierarchical alternating non-negative least squares (HALS) [24], and accelerated multiplicative update (AMU) algorithms [20]. For synthetic data sets, we were concerned about the decline of the objective function and the convergence behavior near the stationary point. For real-world data sets, besides the above measurements, we focused on the clustering performance, in terms of accuracy and stability.
For ANLS, BCD, IBCD, and HALS, we used the respective MATLAB code provided by the authors. For AMU, we wrote the code ourselves and adopted the default setting for the parameters recommended by the authors. The parameter λ of TPM could be automatically set based on the sparsity of the matrix to be decomposed. Generally, one can set λ as
λ 10 n n z n 2 ,
where n n z is the number of non-zeros entries in the matrix. In our experiments, according to (22), we simply set λ = 10 for synthetic examples (dense matrices) and λ = 0.01 for real-world data sets (sparse matrices). We implemented all the experiments in MATLAB 2020b on a Windows system using a PC with a 1.60 GHz Intel Core i5-8250U CPU and 8 GB of RAM.

3.1. Synthetic Data

For the synthetic data sets, as in Section 2.2, we generated 20 symmetric non-negative matrices A = H H T by randomly constructing a non-negative factor H R n × r , where n = 200 , r = 50 , and the entries of H are uniformly distributed in [ 0 , 1 ] . Note that A is a completely positive matrix [32]; that is, A has an exact SymNMF.
To comprehensively evaluate the various algorithms, we adopted two criteria for the performance of SymNMF. The first one was the SymNMF error:
E ( t ) = A H ( t ) H ( t ) T F A F ,
where H ( t ) represents the factor H achieved by an algorithm for a given initialization within t seconds. The other criterion was the optimal gap:
G ( t ) = H ( t ) [ H ( t ) ( H ( t ) H ( t ) T A ) H ( t ) ] +
mentioned in Section 2.2. Recall that G ( t ) is equal to zero if and only if H ( t ) is a stationary point of (2). For each matrix, we executed all algorithms 10 times with different initial points, and stopped them when their elapsed time exceeded 5 s. The initial point was randomly generated in the form of κ * H 0 [22], where H 0 is uniformly distributed in [ 0 , 1 ] and κ * satisfies
κ * = arg min κ > 0 A ( κ H 0 ) ( κ H 0 ) T F = A H 0 , H 0 H 0 T H 0 F 2 .
Figure 2 shows the average E ( t ) and G ( t ) over 200 trials for each algorithm. We observed the following:
  • The efficiency of TPM was much better than that of all other algorithms. This advantage was not only reflected in the decrease in the SymNMF error, but also in the speed at which the iteration approached a stationary point.
  • TPM was capable to obtain SymNMF with better quality, compared to other algorithms. The average SymNMF error of TPM was less than 10 5 , while the best result of the other SymNMF algorithms could only achieve an average SymNMF error of about 10 3 .
The above observations demonstrate that TPM not only converged faster than other approaches but also could obtain SymNMF with lower error.

3.2. Real-World Data

In this subsection, we evaluated the performance of TPM and other SymNMF algorithms in real-world data sets from three perspectives: the SymNMF error, the clustering accuracy, and the stability. Given the affinity matrix A R n × n of a data set { x 1 , , x n } , SymNMF searched for a non-negative factor H R n × r satisfying A H H T , where r is the number of classes in the data set. The label of H was determined by the position of the largest component of each row of H; that is,
( x i ) = arg max j { h i 1 , . . . , h i j , . . . , h i r } .
where h i j ( 1 i n , 1 j r ) is the entry of the i-th row and j-th column of H.
We conducted experiments on four data sets: COIL-20 [33] (https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php, accessed on 29 March 2021), ORL [34] (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html, accessed on 19 April 2021), PIE [35] (http://www.ri.cmu.edu/research_project_detail.html?project_id=418&menu_id=261, accessed on 27 April 2021), and TDT2 [36] (https://www.ldc.upenn.edu/collaborations/past-projects, accessed on 7 May 2021). Detailed information of these data sets is given in Table 1.
The affinity matrix for each data set was constructed following the procedures given in [14]; that is,
  • For the image data sets COIL-20, ORL, and PIE, we defined:
    e i j = exp ( x i x j 2 2 σ i σ j ) , i N k ( j ) or j N k ( i ) ; 0 , otherwise ,
    where N k ( i ) is the set of k-nearest neighbors of x i , and σ i is the Euclidean distance between x i and its k ^ -th neighbor. We adopted k ^ = 7 , as suggested in [37]. The affinity between x i and x j was defined by
    a i j = d i 1 / 2 e i j d j 1 / 2 ,
    where d i = s = 1 n e i s .
  • For the document data set TDT2, all the document vectors x i were normalized to have unit 2-norm, and we defined:
    e i j = x i T x j , i N k ( j ) or j N k ( i ) ; 0 , otherwise .
    The affinity between x i and x j was also defined as in (26).
The parameter k in N k ( i ) determined the sparseness of the affinity matrix. unlike the recommended k = [ log 2 ( n ) ] + 1 in [38] , we set k as [ log 2 ( n n c ) ] + 1 , where n c is the class number of the data set. Compared with the original setting, our k was smaller, and the connected data points were more likely to be class consistent.
We executed all algorithms 20 times with different randomly constructed initial points, and stopped them when their elapsed time exceeds the time limit. For COIL-20, ORL, PIE, and TDT2, according to their sizes, the time limits were set as 20, 10, 50, and 120 s, respectively. To compare the average performance of the different algorithms, we denoted
E ^ ( t ) = E ( t ) E min ,
where E ( t ) is defined as in (23) and E min is the smallest error obtained by all algorithms over all initializations. Figure 3 displays the average E ^ ( t ) for each algorithm. In all cases, TPM performed best and can generate the solution with the lowest SymNMF error among all of the algorithms.
Besides the SymNMF error, we also compared the clustering performance for each algorithm. We adopted two criteria—average correction (AC) and normalized mutual information (NMI) [39]—to measure the clustering performance. The average values over all the 20 trials are listed in Table 2. We observed that TPM performed better than the other algorithms. On COIL-20, ORL, PIE, and TDT2, we increased the highest clustering accuracy of existing methods by 8.46%, 3.09%, 6.87%, and 6.15% respectively. In order to further illustrate the clustering stability of all algorithms, we drew the results of 20 runs from different starting points in a box plot, as shown in Figure 4. Unlike the larger variance of other algorithms, the clustering results of TPM were very concentrated. This phenomenon demonstrated that TPM was very robust to the initial point.

4. Conclusions

In this paper, we proposed a two-phase algorithm for SymNMF. Our method is more efficient and robust than state-of-the-art algorithms, and performed well in real-world clustering tasks. Moreover, our algorithm is guaranteed to converge to the stationary solutions of the SymNMF problem. However, there are still some issues that need to be explored. First, although our algorithm has high efficiency, a theoretical analysis of the convergence rate should be conducted. Second, in real-world clustering tasks, the structure of the graph can significantly affect the clustering results obtained by the SymNMF algorithm. Thirdly, we need to consider how to reduce the time complexity and the storage requirements of the algorithm when processing large-scale data. The distributed-memory parallel strategy proposed in [40] provides a good inspiration. In the future, we intend to carry out further research on the above-mentioned issues.

Author Contributions

Funding acquisition, Z.Z.; methodology B.L.; supervision, Z.Z.; validation, B.L. and X.S.; writing—original draft, B.L.; writing—review and editing, B.L., X.S. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported in part by NSFC project 11971430 and Major Scientific Research Project of Zhejiang Lab (No. 2019KB0AB01).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
  2. Liu, W.; Zheng, N. Non-negative matrix factorization based methods for object recognition. Pattern Recognit. Lett. 2004, 25, 893–897. [Google Scholar] [CrossRef]
  3. David, G.; Jordi, V. Non-negative Matrix Factorization for Face Recognition. In Proceedings of the 5th Catalonian Conference on AI: Topics in Artificial Intelligence, Castellón, Spain, 24–25 October 2002; pp. 336–344. [Google Scholar]
  4. Zdunek, R.; Cichocki, A. Non-negative Matrix Factorization with Quasi-Newton Optimization. In Proceedings of the Artificial Intelligence and Soft Computing, Zakopane, Poland, 25–29 June 2006; pp. 870–879. [Google Scholar]
  5. Jose, M.B.D.; Plaza, A.; Dobigeon, N.; Parente, M.; Du, Q.; Gader, P.; Chanussot, J. Hyperspectral Unmixing Overview: Geometrical, Statistical, and Sparse Regression-Based Approaches. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 354–379. [Google Scholar]
  6. Hoyer, P.O. Non-Negative Matrix Factorization with Sparseness Constraints. J. Mach. Learn. Res. 2004, 5, 1457–1469. [Google Scholar]
  7. Ding, C.; Li, T.; Peng, W.; Park, H. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 126–135. [Google Scholar]
  8. Cai, D.; He, X.; Han, J.; Huang, T.S. Graph Regularized Nonnegative Matrix Factorization for Data Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1548–1560. [Google Scholar]
  9. Wang, Y.; Zhang, Y. Nonnegative Matrix Factorization: A Comprehensive Review. IEEE Trans. Knowl. Data Eng. 2013, 25, 1336–1353. [Google Scholar] [CrossRef]
  10. Esposito, F. A Review on Initialization Methods for Nonnegative Matrix Factorization: Towards Omics Data Experiments. Mathematics 2021, 9, 1006. [Google Scholar] [CrossRef]
  11. Yang, M.S. A survey of fuzzy clustering. Math. Comput. Model. 1993, 18, 1–16. [Google Scholar] [CrossRef]
  12. Zass, R.; Shashua, A. A unifying approach to hard and probabilistic clustering. In Proceedings of the IEEE International Conference on Computer Vision, Beijing, China, 17–21 October 2005; pp. 294–301. [Google Scholar]
  13. Raviteja, V.; Kim, Ø.R.; Gopinath, C.; Boian, A. Determination of the number of clusters by symmetric non-negative matrix factorization. In Proceedings of the Big Data III: Learning, Analytics, and Applications, Online, 12 April 2021; Volume 11730, p. 15. [Google Scholar]
  14. Kuang, D.; Yun, S.; Park, H. SymNMF: Nonnegative low-rank approximation of a similarity matrix for graph clustering. J. Glob. Optim. 2015, 62, 545–574. [Google Scholar] [CrossRef]
  15. Ma, Y.; Hu, X.; He, T.; Jiang, X. Hessian Regularization Based Symmetric Nonnegative Matrix Factorization for Clustering Gene Expression and Microbiome Data. Methods 2016, 111, 80–84. [Google Scholar] [CrossRef]
  16. Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
  17. Jolliffe, I.T. Principal Component Analysis. J. Mark. Res. 2002, 87, 513. [Google Scholar]
  18. Ng, A.Y.; Jordan, M.I.; Weiss, Y. On Spectral Clustering: Analysis and an Algorithm. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 14 April 2001; pp. 849–856. [Google Scholar]
  19. He, Z.; Xie, S.; Zdunek, R.; Zhou, G.; Cichocki, A. Symmetric Nonnegative Matrix Factorization: Algorithms and Applications to Probabilistic Clustering. IEEE Trans. Neural Netw. 2011, 22, 2117–2131. [Google Scholar]
  20. Wang, P.; He, Z.; Lu, J.; Tan, B.; Bai, Y.; Tan, J.; Liu, T.; Lin, Z. An Accelerated Symmetric Nonnegative Matrix Factorization Algorithm Using Extrapolation. Symmetry 2020, 12, 1187. [Google Scholar] [CrossRef]
  21. Wright, S.J. Coordinate Descent Algorithms. Math. Program. 2015, 151, 3–34. [Google Scholar] [CrossRef]
  22. Vandaele, A.; Gillis, N.; Lei, Q.; Zhong, K.; Dhillon, I. Efficient and Non-Convex Coordinate Descent for Symmetric Nonnegative Matrix Factorization. IEEE Trans. Signal Process. 2016, 64, 5571–5584. [Google Scholar] [CrossRef]
  23. Shi, Q.; Sun, H.; Lu, S.; Hong, M.; Razaviyayn, M. Inexact Block Coordinate Descent Methods for Symmetric Nonnegative Matrix Factorization. IEEE Trans. Signal Process. 2017, 65, 5995–6008. [Google Scholar] [CrossRef] [Green Version]
  24. Zhu, Z.; Li, X.; Liu, K.; Li, Q. Dropping Symmetry for Fast Symmetric Nonnegative Matrix Factorization. arXiv 2018, arXiv:1811.05642. [Google Scholar]
  25. Polyak, B.T. The conjugate gradient method in extremal problems. USSR Comput. Math. Math. Phys. 1969, 9, 94–112. [Google Scholar] [CrossRef]
  26. Nocedal, J.; Wright, S. Numerical Optimization; Springer Science & Business Media: New York, NY, USA, 2006. [Google Scholar]
  27. Zhang, Z.; Li, B. Exterior Point Method for Completely Positive Factorization. arXiv 2021, arXiv:2102.08048. [Google Scholar]
  28. Sun, W.; Yuan, Y. Optimization Theory and Methods: Nonlinear Programming; Springer Science & Business Media: New York, NY, USA, 2006. [Google Scholar]
  29. Kuang, D.; Ding, C.; Park, H. Symmetric Nonnegative Matrix Factorization for Graph Clustering. In Proceedings of the 2012 SIAM International Conference on Data Mining (SDM), Anaheim, CA, USA, 26–28 April 2012; pp. 106–117. [Google Scholar]
  30. Calamai, P.; More, J. Projected gradient methods for linearly constrained problems. Math. Program. 1987, 39, 93–116. [Google Scholar] [CrossRef]
  31. Razaviyayn, M.; Hong, M.; Luo, Z.Q.; Pang, J.S. Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1440–1448. [Google Scholar]
  32. Shaked-Monderer, N.; Berman, A. Copositive and Completely Positive Matrices; World Scienfic: Hackensack, NJ, USA, 2021. [Google Scholar]
  33. Nene, S.; Nayar, S.; Murase, H. Columbia Object Image Library (COIL-20); Tech. Rep. CUCS-005-96; Columbia University: New York, NY, USA, 1996. [Google Scholar]
  34. Samaria, F.; Harter, A. Parameterisation of a stochastic model for human face identification. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, Sarasota, FL, USA, 5–7 December 1994; pp. 138–142. [Google Scholar]
  35. Sim, T.; Baker, S.; Bsat, M. The CMU Pose, Illumination, and Expression (PIE) database. In Proceedings of the Fifth IEEE International Conference on Automatic Face Gesture Recognition, Washinton, DC, USA, 20–21 May 2002; pp. 46–51. [Google Scholar]
  36. Fiscus, J.; Doddington, G.; Garofolo, J.; Alvin, M. NIST’s 1998 Topic Detection and Tracking evaluation (TDT2). In Proceedings of the 1999 DARPA Broadcast News Workshop, Herndon, VA, USA, 28 February–3 March 1999; pp. 19–24. [Google Scholar]
  37. Zelnik-Manor, L.; Perona, P. Self-Tuning Spectral Clustering. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004; pp. 1601–1608. [Google Scholar]
  38. Luxburg, U.V. A Tutorial on Spectral Clustering. Stat. Comput. 2004, 17, 395–416. [Google Scholar] [CrossRef]
  39. Wu, W.; Jia, Y.; Kwong, S.; Hou, J. Pairwise Constraint Propagation-Induced Symmetric Nonnegative Matrix Factorization. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 6348–6361. [Google Scholar] [CrossRef]
  40. Eswar, S.; Hayashi, K.; Ballard, G.; Kannan, R.; Vuduc, R.; Park, H. Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization. In Proceedings of the SC20: The International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 9–19 November 2020; pp. 1–14. [Google Scholar]
Figure 1. The average SymNMF error versus run–time on the synthetic data.
Figure 1. The average SymNMF error versus run–time on the synthetic data.
Symmetry 13 01757 g001
Figure 2. The average E ( t ) and G ( t ) (from left to right) of six SymNMF algorithms versus run–time on the synthetic data.
Figure 2. The average E ( t ) and G ( t ) (from left to right) of six SymNMF algorithms versus run–time on the synthetic data.
Symmetry 13 01757 g002
Figure 3. The average E ^ ( t ) of six SymNMF algorithms versus run–time on real–world data sets.
Figure 3. The average E ^ ( t ) of six SymNMF algorithms versus run–time on real–world data sets.
Symmetry 13 01757 g003
Figure 4. Box-plot of the clustering results of six algorithms on four real-world data sets.
Figure 4. Box-plot of the clustering results of six algorithms on four real-world data sets.
Symmetry 13 01757 g004
Table 1. Data sets used in our experiments.
Table 1. Data sets used in our experiments.
Data NameTypeDimension# Data Points# Clusters
COIL-20object32 × 32144020
ORLface 92 × 112 40040
PIEface 32 × 32 285668
TDT2document36,771874120
Table 2. Clustering performance measured by AC (%) and NMI (%) on real-world databases.
Table 2. Clustering performance measured by AC (%) and NMI (%) on real-world databases.
MetricData SetTPMANLSBCDIBCDHALSAMU
ACCOIL-2082.9874.5260.2565.7756.1671.04
ORL78.0074.5772.5574.9174.3074.29
PIE86.9180.0448.3250.2756.6178.80
TDT293.3087.1556.3169.5733.5979.93
NMICOIL-2091.6388.1378.7581.0073.2286.61
ORL90.2089.0187.8288.8088.6788.77
PIE94.9693.1176.5677.5381.2692.55
TDT288.3385.8162.1870.5334.6980.92
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, B.; Shi, X.; Zhang, Z. A Two-Phase Algorithm for Robust Symmetric Non-Negative Matrix Factorization. Symmetry 2021, 13, 1757. https://doi.org/10.3390/sym13091757

AMA Style

Li B, Shi X, Zhang Z. A Two-Phase Algorithm for Robust Symmetric Non-Negative Matrix Factorization. Symmetry. 2021; 13(9):1757. https://doi.org/10.3390/sym13091757

Chicago/Turabian Style

Li, Bingjie, Xi Shi, and Zhenyue Zhang. 2021. "A Two-Phase Algorithm for Robust Symmetric Non-Negative Matrix Factorization" Symmetry 13, no. 9: 1757. https://doi.org/10.3390/sym13091757

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop