Robust principal component analysis using facial reduction

Ma, Shiqian; Wang, Fei; Wei, Linchuan; Wolkowicz, Henry

doi:10.1007/s11081-019-09476-9

Robust principal component analysis using facial reduction

Research Article
Open access
Published: 27 November 2019

Volume 21, pages 1195–1219, (2020)
Cite this article

Download PDF

You have full access to this open access article

Optimization and Engineering Aims and scope Submit manuscript

Robust principal component analysis using facial reduction

Download PDF

Shiqian Ma¹,
Fei Wang²,
Linchuan Wei³ &
…
Henry Wolkowicz³

2077 Accesses
3 Citations
Explore all metrics

Abstract

We introduce a novel approach for robust principal component analysis (RPCA) for a partially observed data matrix. The aim is to recover the data matrix as a sum of a low-rank matrix and a sparse matrix so as to eliminate erratic noise (outliers). This problem is known to be NP-hard in general. A classical approach to solving RPCA is to consider convex relaxations. One such heuristic involves the minimization of the (weighted) sum of a nuclear norm part, that promotes a low-rank component, with an $\ell _1$ norm part, to promote a sparse component. This results in a well-structured convex problem that can be efficiently solved by modern first-order methods. However, first-order methods often yield low accuracy solutions. Moreover, the heuristic of using a norm consisting of a weighted sum of norms may lose some of the advantages that each norm had when used separately. In this paper, we propose a novel nonconvex and nonsmooth reformulation of the original NP-hard RPCA model. The new model adds a redundant semidefinite cone constraint and solves small subproblems using a PALM algorithm. Each subproblem results in an exposing vector for a facial reduction technique that is able to reduce the size significantly. This makes the problem amenable to efficient algorithms in order to obtain high-level accuracy. We include numerical results that confirm the efficacy of our approach.

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Article 05 April 2024

Francis Bach

Tutorial on PCA and approximate PCA and approximate kernel PCA

Article Open access 31 October 2022

Sanparith Marukatat

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

Jianchao Bai, Linyuan Jia & Zheng Peng

1 Introduction

Principal component analysis (PCA) seeks a low-rank approximation to an input data matrix. It is arguably the most popular statistical tool in data analysis and is very effective for handling the effect of small random Gaussian noise. PCA can be accomplished easily via the singular value decomposition (SVD). However, it is also well-known that PCA lacks robustness with respect to large erratic noise, i.e., a single corrupted entry can result in an approximation that is far away from the true solution. Robust PCA has been proposed to remove the effect of sparse gross errors. In e.g., Candes et al. (2010), Chandrasekaran et al. (2009), Cheng et al. (2015), the intractable RPCA problem and its convex relaxation are introduced. It is shown that the two problems are equivalent, with high probability, under certain conditions of the input data. Specifically, RPCA aims to express a given data matrix $Z\in {\mathbb {R}}^{m\times n}$ as the sum $Z=L^*+S^*$, where $L^*$ is the low-rank approximation to Z, and $S^*$ is a sparse matrix that captures the additive erratic noise. Throughout this paper, we assume that $r={{\,\mathrm{{rank}}\,}}(L^*)$$\ll \min (m,n)$, and $\mu >0$ is a given positive parameter. RPCA is in a sense a multicriteria (two criteria) problem in that it attempts to find a low rank approximation as well a sparse approximation. Finding Pareto optimal points can be formulated as the following weighted optimization problem:

$$\begin{aligned} \begin{aligned} \min&{{\,\mathrm{{rank}}\,}}(L) + \mu \Vert S\Vert _0\\ \text {s.t. }&L+S = Z, \end{aligned} \end{aligned}$$

(1)

where the cardinality function $\Vert S\Vert _0$ counts the number of nonzeros of S. This problem is NP-hard due to the combinatorial nature of the ${{\,\mathrm{{rank}}\,}}$ and the cardinality functions. It is shown in e.g., Candes et al. (2010), Chandrasekaran et al. (2009), that under certain conditions (1) is equivalent, with high probability, to the following convex program:

$$\begin{aligned} \begin{aligned} \min&\Vert L\Vert _* + \mu \Vert S\Vert _1\\ \text {s.t. }&L+S = Z. \end{aligned} \end{aligned}$$

(2)

Here $\Vert L\Vert _*$ is the nuclear norm of L, the sum of the singular values of L, and the (vector) $\ell _1$-norm is $\Vert S\Vert _1:=\sum _{ij}|S_{ij}|$. The convex program (2) is known as robust principal component pursuit, RPCP.

In practice, it is possible that the matrix Z is only partially observed. That is, there exists a subset $\hat{E}$ of the indices such that only entries $Z_{ij}$ for $(i,j)\in {\hat{E}}$ are given. In this case, RPCA (1) and RPCP (2) need to be changed, respectively, to

$$\begin{aligned} ({\mathbf{RPCA}}) \qquad \begin{array}{lll} {F_{1}(L,S)}&{} := \min &{} {{\,\mathrm{{rank}}\,}}( L ) +\mu \Vert S \Vert _0 \\ &{}\quad \text {s.t. }&{} {\mathcal {P}} _{\hat{E}}(L+S) = z, \end{array} \end{aligned}$$

(3)

and

$$\begin{aligned} ({\mathbf{RPCP}}) \qquad \begin{array}{cl} \min &{} \Vert L\Vert _* + \mu \Vert S\Vert _1 \\ \text {s.t. }&{} {\mathcal {P}} _{\hat{E}}(L+S) = z, \end{array} \end{aligned}$$

(4)

where ${\mathcal {P}} _{\hat{E}}:{\mathbb {R}}^{m\times n}\rightarrow {\mathbb {R}}^{{\hat{E}}}$ denotes the projection onto the components with indices in ${\hat{E}}$, and the data $z = {\mathcal {P}} _{\hat{E}}(Z)$.

The convex relaxations (2) and (4) can be reformulated as semidefinite programming problems (SDP), e.g., Recht et al. (2010), Candes et al. (2010). Thus, they can be efficiently solved by e.g., interior point methods. However, RPCA arising from real applications is usually of huge scale and interior point methods generally do not scale well. As a result, current research on algorithms for RPCA and RPCP is focused on first-order methods, see e.g., the recent survey papers (Bouwmans and Zahzah 2014; Aybat 2016; Ma and Aybat 2018). But, it is also known that the first-order methods cannot generally provide high accuracy solutions.

1.1 Main contributions

In this paper we propose a new approach to solve RPCA with partially observed data, (3). We use semidefinite programming and a novel cone facial reduction (FR) technique applied to a reformulation of (3). As a result, the size of the feasible region is significantly reduced and an efficient algorithm is used to obtain highly accurate solutions. This extends the approach in Huang and Wolkowicz (2018) for low-rank matrix completions where no sparse gross errors are considered. In addition, our facial reduction technique is applied on the non-convex model (3) directly, rather than within a nuclear norm mimimization model as done in Huang and Wolkowicz (2018). A key ingredient here is the use of the exposing vector approach for characterizing faces of the semidefinite cone, ${{\mathcal {S}} ^n_+\,}$. see Drusvyatskiy and Wolkowicz (2017), Drusvyatskiy et al. (2017), Huang and Wolkowicz (2018). Here we develope a new technique to grow bicliques by submatrix completion. Our tests show that this new technique greatly increases the speed of the completion process and allows us to find more missing entries. In particular, we see that in many cases, and even under relatively low density for the sampled data, we get exact recovery without using any SDP solver. This provides a distinct improvement over simply using the convex relaxation approach, RPCP.

1.2 Connections and differences with previous work

Our work is an extension of the previous work in Huang and Wolkowicz (2018) that only considers facial reduction on the convex approximation of the rank minimization problem. In this paper, we show in Lemma 1 that the nonconex RPCA problem (3) is equivalent to the nonconvex SDP problem (5). And in Proposition 1 we get the exposing vectors for the optimal face of the nonconvex SDP problem, rather than the convex approximation in Huang and Wolkowicz (2018). Note that our goal is to solve the nonconvex problem in order to recover the true matrices. It is clear after facial reduction that the optimal solution lies in a smaller space. Hence the convex relaxation of the now smaller-size problem generally has a better chance to recover the true matrices than the convex relaxation of the original large-size problem. This is validated by the numerical tests in Sect. 6. Secondly, we extend the method in Huang and Wolkowicz (2018) to include the sparse component S and we show we can reduce the size of S using exposing vectors obtained from the low rank component L. Thirdly, we proposed a novel approach of growing bicliques in Sect. 5. The technique to grow bicliques is much more efficient when compared to the previous work in the literature.

1.2.1 Organization

Further background on RPCA is given in Sect. 2; this includes SDP reformulations and graph representations. In Sect. 3 we discuss how to obtain accurate solutions for small fully sampled submatrices. This is the key in obtaining high accurate exposing vectors for faces used in reducing the size of the complete problem, while still maintaining low error growth that could arise due to noise. Section 4 presents the main results used in the paper. This includes facial reduction, bicliques and exposing vectors, as well as the algorithm for solving the RPCA problem accurately. In Sect. 5 we discuss a heuristic on growing the size of bicliques using a submatrix approach. We present the numerical results in Sect. 6 for both noiseless and noisy data. We conclude in Sect. 7.

2 Background

In this paper we focus on the nonconvex and nonsmooth model RPCA in (3). Our approach starts with a reformulation using an additional semidefinite cone constraint.

2.1 Reformulating RPCA with SDP

In this section we show that (3) is equivalent to the following nonconvex and nonsmooth SDP problem:

$$\begin{aligned} \begin{aligned} {F_2(Y,S)} :=&\displaystyle {\min _{Y,S}} \, {{\,\mathrm{{rank}}\,}}(Y)+ \mu \Vert S\Vert _0\\&\text {s.t. }\,\, {\mathcal {P}} _{\hat{E}}(L + S) = z\\&\quad Y = \begin{bmatrix} W_1&L \\ L^\top&W_2 \end{bmatrix} \succeq 0. \end{aligned} \end{aligned}$$

(5)

Lemma 1

Problems (3) and (5) areequivalent in the sense that they have the same optimal solutionpair$(L^*,S^*)$and the same optimal objective value. Inparticular, the optimal$L^*$is a submatrix of the optimal$Y^*$.

Proof

Suppose $(L^*,S^*)$ is the optimal solution of (3) with ${{\,\mathrm{{rank}}\,}}(L^*)=r<\min (m,n)$, and the compact SVD of $L^*$ is given by $L^*=U\varSigma V^\top$, where both $U\in {\mathbb {R}}^{m\times r}$ and $V\in {\mathbb {R}}^{n\times r}$ have orthonormal columns, and $\varSigma \in {\mathbb {R}}_{++}^{r\times r}$ is a diagonal matrix with the positive singular values of $L^*$ on its diagonal. We now see that this is equivalent to $(Y^*,S^*)$ is an optimal solution pair of (5) with

$$\begin{aligned} Y^* = \begin{bmatrix} U \\ V \end{bmatrix} \varSigma \begin{bmatrix} U \\ V \end{bmatrix}^\top = \begin{bmatrix} U\varSigma U^\top&U\varSigma V^\top \\ V\varSigma U^\top&V\varSigma V^\top \end{bmatrix} = \begin{bmatrix} U\varSigma U^\top&L^* \\ (L^*)^\top&V\varSigma V^\top \end{bmatrix}, \end{aligned}$$

(6)

from which we get that ${{\,\mathrm{{rank}}\,}}(Y^*)={{\,\mathrm{{rank}}\,}}(L^*)=r$, and $L^*$ is the upper right corner block of $Y^*$. This shows that for an optimal $Y^*$ we have ${{\,\mathrm{{rank}}\,}}(Y^*)={{\,\mathrm{{rank}}\,}}(L^*)=r$; and it emphasizes that the only change in the two problems is the addition of the semidefinite constraint.

The equivalence follows from: if there exists a better solution $(\hat{Y},\hat{S})$, with the spectral decomposition of $\hat{Y}$ given by

$$\begin{aligned}{\hat{Y}} = \begin{bmatrix} \hat{U} \\ \hat{V} \end{bmatrix} \hat{\varSigma } \begin{bmatrix} \hat{U} \\ \hat{V} \end{bmatrix}^\top = \begin{bmatrix} \hat{U}\hat{\varSigma }\hat{U}^\top&\hat{U}\hat{\varSigma }\hat{V}^\top \\ \hat{V}\hat{\varSigma }\hat{U}^\top&\hat{V}\hat{\varSigma }\hat{V}^\top \end{bmatrix},\end{aligned}$$

then we have

$$\begin{aligned} F_1(\hat{U}\hat{\varSigma }\hat{V}^\top , \hat{S}) \le F_2(\hat{Y},\hat{S}) < F_2(Y^*,S^*) = r+\mu \Vert S^*\Vert _0 = F_1(L^*,S^*), \end{aligned}$$

where the first inequality is due to the fact ${{\,\mathrm{{rank}}\,}}(\hat{U}\hat{\varSigma }\hat{V}^\top )\le {{\,\mathrm{{rank}}\,}}(\hat{\varSigma })={{\,\mathrm{{rank}}\,}}(\hat{Y})$. This is a contradiction. $\square$

Remark 1

Lemma 1 indicates that, in order to solve (3), we can solve (5) instead to obtain $(Y^*,S^*)$. Then $(L^*,S^*)$ solves (3), where $L^*$ is obtained from the upper right corner block of $Y^*$. It appears that (5) is harder than (3) as it is much larger. However, the fact that the optimal $L^*$ and $Y^*$ are both of low rank enables us to reduce the size of (5) greatly to a new problem whose size is much smaller than (3). This is done by pursuing the facial structure of the semidefinite cone in (5). This is the main motivation for our approach.

2.2 Outline of our algorithm

We now present a brief outline of our algorithm based on the graph representation of the partially observed matrix Z and the facial reduction of the reformulation (5).

For a given partially observed data matrix $Z\in {\mathbb {R}}^{m\times n}$, we aim to find the low-rank-plus-sparse decomposition $Z = L^*+S^*$ with $r={{\,\mathrm{{rank}}\,}}(L^*)\ll \min (m,n)$. The main steps of our algorithm are as follows:

(i)
Associate a bipartite graph to Z using the observed entries, and find a biclique associated to a completely determined submatrix of Z. Denote this submatrix as $\bar{Z}\in {\mathbb {R}}^{p\times q}$, where p, q are generally much smaller than m, n, respectively.
(ii)
Find the low-rank-plus-sparse decomposition for $\bar{Z}$, i.e., $\bar{Z}=\bar{L}+\bar{S}$. Note that since $p\ll m, q\ll n$, this problem is much easier than the original problem.
(iii)
From $\bar{L}$ we are able to find an exposing vector for a face of the SDP cone that contains an optimal $Y^*$ that solves (5), i.e., an exposing vector of the optimal face. Note that $\bar{L}$ is a submatrix of $L^*$ and thus a submatrix of $Y^*$.
(iv)
We use the fact that the sum of exposing vectors of faces is an exposing vector of the intersection of the faces, see (9) below. We can then reformulate (5) into a new problem whose size is much smaller than (3). This new problem can then be solved efficiently and accurately.

2.3 Graph representation of a partially observed matrix Z

We associate a bipartite graph $G_Z((U_m,V_n),{\hat{E}})$ to Z, whose node set corresponds to the union of the two sets of rows and columns of Z

$$\begin{aligned} U_m=\{1,\ldots ,m\},\; V_n=\{1,\ldots ,n\}, \end{aligned}$$

and there is an edge $(i,j) \in {\hat{E}}$, with $i\in U_m$ and $j\in V_n$ if $Z_{ij}$ is observed. Note that a biclique of $G_Z$ corresponds to a submatrix of Z whose entries are all observed. To find a biclique of $G_Z((U_m,V_n),{\hat{E}})$, we can relate it to finding cliques in the graph $G=(V,E)$ whose node set is $V=\{1,\ldots ,m,m+1,\ldots ,m+n\}$ and the edge set E is

$$\begin{aligned}&E := \{ (i,j) \in V \times V: (i,j-m) \in {\hat{E}} \} \cup \{(i,j)\in V\times V: 1\le i< j\le m\}\\&\quad \cup \{(i,j)\in V\times V: m+1\le i < j\le m+n\}. \end{aligned}$$

Suppose we find a non-trivial clique of G denoted by $C=\{i_1,\ldots , i_k\}$ whose cardinality $k=p+q$ satisfies

$$\begin{aligned} |C\cap \{1,\ldots , m\}|=p \ne 0, \quad |C\cap \{m+1,\ldots , m+n\}|=q \ne 0. \end{aligned}$$

By removing the edges in C that have both nodes in $\{1,\ldots ,m\}$ or $\{m+1,\ldots ,m+n\}$, we obtain a biclique of $G_Z$. We use $\bar{C}$ to denote this biclique. This biclique then corresponds to a submatrix of Z, with entries corresponding to edges in $\bar{C}$ being kept and other entries of Z are removed. We use $\bar{Z}$ to denote this matrix whose size is $p\times q$. Note that $\bar{Z}$ is a submatrix of Z, and all entries of $\bar{Z}$ are observed. Moreover, we generally maintain the size of $\bar{Z}$ much smaller than the size of Z, $p\ll m$ and $q\ll n$.

2.4 Heuristics for finding cliques

Finding all the cliques from a graph is a NP-hard problem, (Blair and Peyton 1993); the clique decision problem is one of Karp’s 21 NP-complete problems, (Karp 1972). Moreover, the cost of finding the SVD needed for each submatrix corresponding to each clique found becomes expensive for large cliques and large submatrices. Therefore, we use a heuristic algorithm that is proposed in Krislock and Wolkowicz (2010) and Drusvyatskiy et al. (2017, Algorithm 2) to efficiently find many small cliques. This heuristic algorithm is briefly outlined in Algorithm 1.

Algorithm 1 is very efficient in practice because it only goes through each node in the graph G once. In total, we need to add at most $K_{max}|V|$ nodes which corresponds to $K_{max}|V|$ row operations on the adjacency matrix of the graph. Since it is a heuristic, it is not guaranteed that all the cliques are found. However, if we need more cliques, we could apply a random permutation to V and use Algorithm 1 again.

3 Decomposing the submatrix $\bar{Z}$ using PALM

From Sects. 2.3 and 2.4 we know that by finding a biclique of $G_Z$, we get a submatrix of Z, denoted by $\bar{Z}\in {\mathbb {R}}^{p\times q}$ with $K_{min} \le p+q \le K_{max}$, whose entries are all known (sampled). Now we want to find a low-rank-plus-sparse decomposition of $\bar{Z}$. This problem can be formulated as

$$\begin{aligned} \begin{aligned} \displaystyle {\min _{{\bar{L}},{\bar{S}}}}&\frac{1}{2}\Vert \bar{L}+\bar{S}-\bar{Z} \Vert _F^2 \\ \text {s.t. }&{{\,\mathrm{{rank}}\,}}(\bar{L}) \le \bar{r}, \quad \Vert \bar{S}\Vert _0 \le \bar{s}, \end{aligned} \end{aligned}$$

(7)

where $\bar{r}$ and $\bar{s}$ are given parameters to control the rank of $\bar{L}$ and sparsity of $\bar{S}$, respectively. There are two reasons that (7) is much easier to solve than (3). The first reason is that all entries of $\bar{Z}$ are known, while those in Z are only partially known. The second reason is that the size of $\bar{Z}$ is much smaller than the size of Z. We adopt the proximal alternating linearization method (PALM) (Bolte et al. 2014) to solve (7). This is summarized in Algorithm 2.

By introducing indicator functions, (7) can be equivalently written as

$$\begin{aligned} \min _{{\bar{L}},{\bar{S}}} \ \varPsi (\bar{L},\bar{S}) := \frac{1}{2}\Vert \bar{L}+\bar{S}-\bar{Z} \Vert _F^2 + {\mathbb {1}}(\bar{L}\mid {{\,\mathrm{{rank}}\,}}(\bar{L})\le \bar{r}) + {\mathbb {1}}(\bar{S}\mid \Vert \bar{S}\Vert _0 \le \bar{s}), \end{aligned}$$

(8)

where the indicator function ${\mathbb {1}}(X\mid \mathcal {X})$ equals 0 if $X\in \mathcal {X}$, and equals $+\infty$ otherwise. It is easy to verify that the objective function $\varPsi (\bar{L},\bar{S})$ satisfies the so-called Kurdyka-${\L }$ojasiewicz property (Kurdyka 1998; Łojasiewicz 1963; Bolte et al. 2014). As a result, we have the following convergence result for Algorithm 2 that follows directly from Bolte et al. (2014, Theorem 1).

Theorem 1

The sequence$\{\bar{L}^k, \bar{S}^k\}_{k \in \mathbb {N}}$generatedby Algorithm 2 converges to a critical point of problem (7).

Suppose we solve (7) and obtain the optimal value zero that guarantees a global optimum. To ensure the correct $\bar{L}$ is recovered, we need this global optimum to be unique. This happens with high probability when the rank of $\bar{L}$ is much smaller than the size of the matrix, and the incoherence conditions hold, see e.g., Candes et al. (2010), Chandrasekaran et al. (2009). Therefore, we make the following assumption throughout the remainder of the paper.

Assumption

We assume that $\bar{r} \ll \min (p,q)$ and, in addition, that (7) has a unique global optimal solution $\bar{L}$ that is a submatrix of $L^*$ and has the same rank as $L^*$.

The uniqueness also happens with higher probability when the sparsity parameter $\bar{s}$ is smaller. Let $D_S$ denote the density of the sparse matrix $S^*$. We vary $\bar{s}$ from 1 to $\max \{1, pqD_S \}$ to control the sparsity in problem (7). We also set the target rank $\bar{r} = r$. Thus we have the following algorithm:

4 Facial reduction, bicliques and exposing vectors

4.1 Preliminaries on faces

We now present some of the geometric facts we need. More details can be found in e.g., Drusvyatskiy and Wolkowicz (2017), Pataki (2000), Krislock and Wolkowicz (2010), Drusvyatskiy et al. (2017). Recall that the set K is a proper convex cone if

$$\begin{aligned} K+K\subseteq K, \,\, \lambda K\subseteq K, \forall \lambda \ge 0, \,\, {\mathrm{int}}\,(K)\ne \emptyset . \end{aligned}$$

The dual cone, $K^*$, is defined by

$$\begin{aligned} K^*=\{\phi \in {\mathbb {R}}^n: \langle \phi ,k\rangle \ge 0, \forall k\in K\}. \end{aligned}$$

A subcone $F\subseteq K$ is a face, $F\unlhd K$, of the convex cone K if

$$\begin{aligned} x,y \in K, x+y\in F \implies x,y \in F. \end{aligned}$$

The conjugate face, $F^*$, is defined by $F^*=F^\perp \cap K^*$, where $F^\perp$ denotes the orthogonal complement of F. A face $F\unlhd K$ is an exposed face if there exists $\phi \in K^*$ such that $F=\phi ^\perp \cap K$; and $\phi$ is an exposing vector. Let T be a subset of the convex cone K, then ${{\,\mathrm{face}\,}}(T)$ is the smallest face of K containing T. It is known that: a face of a face is a face; an intersection of faces is a face; and essential for our algorithm is the following for finding an intersection of exposed faces $F_i\unlhd K, i=1,\ldots ,k$, see Drusvyatskiy et al. (2017),

$$\begin{aligned} F_i= K \cap \phi _i^\perp , \forall i \implies \cap _{i=1}^k F_i = \left( \sum _{i=1}^k \phi _i\right) ^\perp \cap K. \end{aligned}$$

(9)

If $K={{\mathcal {S}} ^n_+}$, i.e., the positive semidefinite cone, then the facial structure is well understood. Faces are characterized by the ranges or nullspaces of the matrices in the face. Let $X\in {{\mathcal {S}} ^n_+\,}$ be rank r and

$$\begin{aligned} X=\begin{bmatrix} P&Q \end{bmatrix} \begin{bmatrix} D&0 \\ 0&0 \end{bmatrix} \begin{bmatrix} P&Q \end{bmatrix}^\top \end{aligned}$$

be the (orthogonal) spectral decomposition with $D\in {\mathcal {S}} ^r_{++}$ being a diagonal matrix. Then the smallest face of ${\mathcal {S}} ^n_+$ containing X is

$$\begin{aligned} {{\,\mathrm{face}\,}}(X) = P{\mathcal {S}} ^r_+ P^\top = {{\mathcal {S}} ^n_+\,}\cap (QQ^\top )^\perp . \end{aligned}$$

The matrix $QQ^\top$ is an exposing vector for ${{\,\mathrm{face}\,}}(X)$. Moreover, the relative interior satisfies

$$\begin{aligned} {{\,\mathrm{{relint}}\,}}({{\,\mathrm{face}\,}}(X))= P{\mathcal {S}} ^r_{++}P^\top = {{\,\mathrm{{relint}}\,}}({{\,\mathrm{face}\,}}(\hat{X})), \quad \forall \hat{X} \in {{\,\mathrm{{relint}}\,}}({{\,\mathrm{face}\,}}(X)), \end{aligned}$$

i.e., the face and the exposing vectors are characterized by the eigenspace of any $\hat{X}$ in the relative interior of the face.

For our application we use the following view of facial reduction and exposed faces.

Theorem 2

(Drusvyatskiy et al. 2015, Theorem 4.1; Drusvyatskiy and Wolkowicz 2017) Consider alinear transformation$\mathcal {M}:{\mathcal {S}} ^n\rightarrow {\mathbb {R}}^m$and a nonemptyfeasible set

$$\begin{aligned} \mathcal {F}:=\{X\in {{\mathcal {S}} ^n_+}: {\mathcal {M}}(X)=b\}, \end{aligned}$$

for some$b\in {\mathbb {R}}^m$. Then a vectorvexposes a proper face of${\mathcal {M}}({{\mathcal {S}} ^n_+\,})$containingbif, and only if,vsatisfies theauxiliary system

$$\begin{aligned} 0\ne \mathcal {M}^*v\in {{\mathcal {S}} ^n_+\,}\quad {\text { and }}\quad \langle v,b\rangle =0. \end{aligned}$$

LetNdenote the smallest face of${\mathcal {M}}({{\mathcal {S}} ^n_+\,})$containingb. Then the following statements are true.

1.
We always have${{\mathcal {S}} ^n_+\,}\cap {\mathcal {M}}^{-1}N={{\,\mathrm{face}\,}}(\mathcal {F})$, thesmallest face containing$\mathcal {F}$.
2.
For any vector$v\in {\mathbb {R}}^m$the following equivalence holds:
$$\begin{aligned} v {{ exposes }} N \quad \Longleftrightarrow \quad \mathcal {M}^*v {{ exposes }} {{\,\mathrm{face}\,}}(\mathcal {F}). \end{aligned}$$
(10)

The result in (10) details the facial reduction process for the matrix completion problem using exposing vectors. More precisely, if $B\succeq 0$ is a principal submatrix of the data and ${{\,\mathrm{{trace}}\,}}(VB)=0, V\succeq 0, V \ne 0$, then V provides an exposing vector for the image of the coordinate projection onto the submatrix. We can then complete V with zeros (adjoint of the coordinate projection) to get $Y\in {{\mathcal {S}} ^n_+\,}$, an exposing vector for $\mathcal {F}$. Define the triangular number, $t(n)=n(n+1)/2$, and the isometry ${{\,\mathrm{{s2vec}}\,}}:{\mathcal {S}} ^n\rightarrow {{\mathbb {R}}^{{t(n)}}}$ that vectorizes the upper-triangular part of a symmetric matrix columnwise.

Corollary 1

Suppose that$1<k<n$and${\mathcal {M}}$in Theorem 2is thecoordinate projection onto the leading principal submatrix of order$k, m=t(k)$. Let$B\in {\mathcal {S}} _+^k,\, b={{\,\mathrm{{s2vec}}\,}}(B)\in {\mathbb {R}}^{t(k)}$, i.e., for$X\in {\mathcal {S}} ^n$, we have

$$\begin{aligned} {\mathcal {M}}(X)_{ij}= b_{ij}, \quad \forall 1\le i\le j\le k. \end{aligned}$$

Let

$$\begin{aligned} 0 \ne V \in {\mathcal {S}} _+^k, \,\, {{\,\mathrm{{trace}}\,}}(VB)=0, \, v={{\,\mathrm{{s2vec}}\,}}V. \end{aligned}$$

Then$Y={\mathcal {M}}^*v$is an exposing vector for the feasible set$\mathcal {F}$, i.e.,

$$\begin{aligned} {{\,\mathrm{{trace}}\,}}(Y(\mathcal {F})) = 0. \end{aligned}$$

Proof

The proof follows immediately from Theorem 10 as v exposes N and $Y={\mathcal {M}}^*v$ is an exposing vector for ${{\,\mathrm{face}\,}}(\mathcal {F})$. $\square$

4.2 Exposing vector

The following lemma shows a generic rank property of a matrix with its submatrix.

Lemma 2

(Generic rank property, Lemma 3.6 in Huang and Wolkowicz (2018)) Letrbe a positive integer and$Z_1 \in {\mathbb {R}}^{m\times r}$and$Z_2 \in {\mathbb {R}}^{n\times r}$be continuousrandom variables with i.i.d. entries. Set$Z=Z_1Z_2^\top$and let$X\in {\mathbb {R}}^{p \times q}$be any submatrix ofZwith$\min (p,q)\ge r$. Then${{\,\mathrm{{rank}}\,}}(X) = r$with probability 1.

Based on Lemma 2, we can assume that $\bar{L}$ returned by Algorithm 3 has the same rank as the targeting low-rank matrix $L^*$ if $\min (p,q)> r$, i.e.,

$$\begin{aligned} \bar{Z} = \bar{L} + \bar{S}, \qquad {{\,\mathrm{{rank}}\,}}(\bar{L})= r. \end{aligned}$$

(11)

That is, we solved (7) to global optimality with objective value being 0.

Proposition 1

Under Assumption 3, suppose that PALM (Algorithm 2)returns$\bar{L}$and$\bar{S}$such that (11) holds. Without loss of generality, wefurther assume the targeting low-rank matrix$L^*$can bepartitioned as$L^* =\left[ \begin{array}{c|c} L_1 &{} L_2 \\ \hline \bar{L} &{} L_3 \end{array}\right]$(after a permutation if needed), where$\bar{L}\in {\mathbb {R}}^{p\times q}$and$r\le \min (p,q)$, and the SVD of$\bar{L}$is given by

$$\begin{aligned} \bar{L} = \begin{bmatrix} \bar{P}&\bar{U} \end{bmatrix} \begin{bmatrix} \varSigma _r&0 \\ 0&0\end{bmatrix} \begin{bmatrix} \bar{Q}&\bar{V}\end{bmatrix}^\top ,\quad \varSigma _r\in {\mathcal {S}} _{++}^{r}. \end{aligned}$$

(12)

By adding appropriate blocks of zeros to$\bar{U} {\bar{U}}^\top$and$\bar{V}{\bar{V}}^\top$(after a permutation if needed), we get thefollowing matrixW, which is an exposing vector for${{\,\mathrm{face}\,}}(Y^*)$, i.e., ${{\,\mathrm{{trace}}\,}}(Y^* W) = 0$, where$Y^*$is the optimal solution of (5).

$$\begin{aligned} W = \left[ \begin{array}{c|ccc|c} 0 &{} 0 &{} &{} 0&{} 0\\ \hline 0 &{} \bar{U} {\bar{U}}^\top &{} &{} 0&{} 0\\ 0 &{} 0&{} &{} 0 &{} 0\\ \hline 0 &{} 0 &{} &{} 0&{} 0\\ \end{array}\right] + \left[ \begin{array}{c|ccc|c} 0 &{} 0 &{} &{} 0&{} 0\\ \hline 0 &{} 0 &{} &{} 0&{} 0\\ 0 &{} 0&{} &{} \bar{V} {\bar{V}}^\top &{} 0\\ \hline 0 &{} 0 &{} &{} 0&{} 0\\ \end{array}\right] = \left[ \begin{array}{c|ccc|c} 0 &{} 0 &{} &{} 0&{} 0\\ \hline 0 &{} {\bar{U}} {\bar{U}}^\top &{} &{} 0&{} 0\\ 0 &{} 0&{} &{} {\bar{V}} {\bar{V}}^\top &{} 0\\ \hline 0 &{} 0 &{} &{} 0&{} 0\\ \end{array}\right] . \end{aligned}$$

Proof

Lemma 1 shows the property

$$\begin{aligned} {{\,\mathrm{{rank}}\,}}(Y^*) = {{\,\mathrm{{rank}}\,}}(L^*) = r. \end{aligned}$$

Without loss of generality, after a permutation if needed, we assume that

$$\begin{aligned} Y^*= & {} \begin{bmatrix} U \\ P \\ Q \\ V \end{bmatrix} D \begin{bmatrix} U \\ P \\ Q \\ V \end{bmatrix}^\top = \left[ \begin{array}{c|c|c|c} U D U^\top &{} UDP^\top &{} UDQ^\top &{} UDV^\top \\ \hline PDU^\top &{} PDP^\top &{} PDQ^\top &{} PDV^\top \\ \hline QDU^\top &{} QDP^\top &{} QDQ^\top &{} QDV^\top \\ \hline VDU^\top &{} VDP^\top &{} VDQ^\top &{} VDV^\top \end{array}\right] , \\ L^*= & {} \left[ \begin{array}{c|c} U D Q^\top &{} U D V^\top \\ \hline P D Q^\top &{} P D V^\top \end{array}\right] , \text { with } PDQ^\top = \bar{L} \, \hbox {and}\, D \in {\mathcal {S}} ^r_{++}. \end{aligned}$$

Now, since $PDQ^\top = \bar{L}$, we have ${{\,\mathrm{Range}\,}}(\bar{L})\subseteq {{\,\mathrm{Range}\,}}(P)$, ${{\,\mathrm{Range}\,}}({\bar{L}}^\top ) \subseteq {{\,\mathrm{Range}\,}}(Q)$. Next, from ${\bar{L}} = \bar{P} \varSigma _r {\bar{Q}}^\top$, we have ${{\,\mathrm{Range}\,}}(\bar{L})\subseteq {{\,\mathrm{Range}\,}}(\bar{P})$, ${{\,\mathrm{Range}\,}}({\bar{L}}^\top ) \subseteq {{\,\mathrm{Range}\,}}({\bar{Q}})$. Therefore, since ${\bar{P}}, P, \bar{Q}, Q$ all have only r columns and ${{\,\mathrm{{rank}}\,}}(\bar{L}) = r$. It follows that ${{\,\mathrm{Range}\,}}(P) = {{\,\mathrm{Range}\,}}(\bar{P}) = {{\,\mathrm{Range}\,}}(\bar{L})$ and ${{\,\mathrm{Range}\,}}(Q) = {{\,\mathrm{Range}\,}}(\bar{Q}) = {{\,\mathrm{Range}\,}}({\bar{L}}^\top )$. We conclude that $P^\top \bar{U} \bar{U}^\top = \bar{P}^\top \bar{U} \bar{U}^\top = 0$ and $Q^\top \bar{V} \bar{V}^\top = \bar{Q}^\top \bar{V} \bar{V}^\top =0$, i.e., that $Y^* \cdot W = 0$. Therefore W is an exposing vector of the optimal face, the face that contains $Y^*$, the optimal solution of (5). $\square$

Remark 2

The reason we apply facial reduction to (5) (or equivalently (3)) is because applying facial reduction to (5) is better than applying facial reduction to the convex relaxation formulation. Since the goal is to solve the nonconvex problem (5) to recover the true matrices, when we apply facial reduction to (5), we can show that the exposing vector is actually the optimal face of the nonconvex problem (5) in Proposition 1, therefore the optimal solution lies in a smaller space and we have a better chance to obtain the optimal solution of the nonconvex problem (5). If we apply facial reduction to the convex approximation, we may not obtain the exposing vector of the “optimal face”.

4.3 Reducing problem size using FR

From Proposition 1, we get Algorithm 4 to find an exposing vector $Y_{expo}$ for ${{\,\mathrm{face}\,}}(Y^*)$.

Remark 3

We emphasize that the size of the bicliques in $\varTheta$ are small. A large biclique means an expensive SVD when finding the corresponding exposing vector. Adding many exposing vectors results in one exposing vector corresponding to the union of the bicliques after completion.

From $Y_{expo}$ we can also find the basis for ${{\,\mathrm{Null}\,}}(Y_{expo})$ which is given by the columns of

$$\begin{aligned} V={{\,\mathrm{Null}\,}}(Y_{expo}) =\begin{bmatrix} V_P&0 \\ 0&V_Q \end{bmatrix}, \quad V_P^\top V_P=I_{{r_p}}, \,\, V_Q^\top V_Q=I_{{r_q}}, \end{aligned}$$

(13)

where $V_P\in {\mathbb {R}}^{m\times r_p}$ and $V_Q\in {\mathbb {R}}^{n\times r_q}$. We denote $r_v=r_p+r_q < m+n$. Therefore $Y^*$ (optimal solution of (5)) can be expressed as

$$\begin{aligned} Y^* = VRV^\top = \begin{bmatrix}V_P {R_p}V_P^\top&V_P {R_{{ pq}}}V_Q^\top \\ V_QR_{pq}^\top V_P^\top&V_Q {R_q}V_Q^\top \end{bmatrix}, \quad \text{ for } \text{ some } R = \begin{bmatrix}R_p&R_{pq} \\ R_{pq}^\top&R_q \end{bmatrix} \in {\mathcal {S}} ^{r_v}. \end{aligned}$$

(14)

From Lemma 1 and (13) we have ${{\,\mathrm{{rank}}\,}}(Y^*) = {{\,\mathrm{{rank}}\,}}(V_P R_{pq}V_Q^\top ) = {{\,\mathrm{{rank}}\,}}(R_{pq})$. Therefore, (5) reduces to the following problem which has a much smaller size for the low-rank component:

$$\begin{aligned} \begin{aligned} \displaystyle {\min _{R_{pq} \in {\mathbb {R}}^{r_p\times r_q},S\in {\mathbb {R}}^{m\times n}}}&{{\,\mathrm{{rank}}\,}}(R_{pq}) +\mu \Vert S\Vert _0 \\ \text {s.t. }&{\mathcal {P}} _{\hat{E}}( V_PR_{pq}V_Q^\top ) + {\mathcal {P}} _{\hat{E}}(S) = z. \end{aligned} \end{aligned}$$

(15)

4.4 Further reducing the size of the problem

Note that the size of S in (15) can be further reduced. In the decomposition (11), we know that $\bar{L}$ and $\bar{S}$ are exactly recovered by PALM when successful. Therefore, there is a subset of entries of $S^*$ that has been successfully recovered. We can remove this subset of entries from the linear constraints in (15). We let $\hat{E}_S$ denote the set of indices of entries of $S^*$ that are exactly recovered by PALM, and let $\hat{E}_{S^c} := \hat{E}\backslash \hat{E}_S$ to get:

$$\begin{aligned} \begin{aligned} \min&{{\,\mathrm{{rank}}\,}}(R_{pq})+\mu \Vert s\Vert _0 \\ \text {s.t. }&{\mathcal {P}} _{{\hat{E}}_{S}} ( V_PR_{pq}V_Q^\top ) = z_{{\hat{E}}_S} \\&{\mathcal {P}} _{{\hat{E}}_{S^c}}( V_PR_{pq}V_Q^\top ) + s = z_{{\hat{E}}_{S^c}}\\&{R_{pq} \in {\mathbb {R}}^{r_p\times r_q},\, s\in {\mathbb {R}}^{{\hat{E}}_{S^c}}}. \end{aligned} \end{aligned}$$

(16)

Moreover, the number of linear constraints in (16) can, and should, be further reduced to remove any redundant linear constraints that arose. We use A and B to denote the matrix representations of the coefficient matrices in ${\mathcal {P}} _{{\hat{E}}_{S}}( V_PR_{pq}V_Q^\top )$ and ${\mathcal {P}} _{{\hat{E}}_{S^c}}( V_PR_{pq}V_Q^\top )$, respectively. If the ith row of B (denoted by $B_i$) is in the row space of A, then $B_i$ can be removed so the size of B gets smaller. The algorithm that we use for this purpose is outlined in Algorithm 5.

Since FR typically results in very small values of $r_p$ and $r_q$ in (13), the first set of linear constraints in (16) , i.e.,

$$\begin{aligned} {\mathcal {P}} _{{\hat{E}}_{S}} ( V_PR_{pq}V_Q^\top ) = z_{{\hat{E}}_S}, \end{aligned}$$

(17)

is usually an overdetermined linear system. As a result, we can obtain $L^*$ by only solving this linear system.

Lemma 3

The target matrix$L^*$is unique if${\mathcal {P}} _{{\hat{E}}_{S}}( V_PR_{pq}V_Q^\top ) = z_{{\hat{E}}_{S^c}}$in (16) has a unique solution.

Lemma 4

If$L^*$is the unique optimal solution of (3), then (16) has a unique optimal solution$R_{pq}^*$that recovers$L^*$:

$$\begin{aligned} L^* = V_P R_{pq}^* V_Q^\top . \end{aligned}$$

When we have enough bicliques, our numerical tests generally successfully recover $L^*$ by solving the linear system (17). If we do not have enough bicliques, then we have to solve the nonconvex and nonsmooth problem (16). In this case, we can solve the following convex relaxation of (16):

$$\begin{aligned} \begin{aligned} \displaystyle {\min _{R_{pq} \in {\mathbb {R}}^{r_p\times r_q},s\in {\mathbb {R}}^{{\hat{E}}_{S^c}}}}&\Vert R_{pq}\Vert _* + \mu \Vert s\Vert _1 \\ \text {s.t. }&{\mathcal {P}} _{{\hat{E}}_{S}} ( V_PR_{pq}V_Q^\top ) = z_{{\hat{E}}_S}\\&{\mathcal {P}} _{{\hat{E}}_{S^c}}( V_PR_{pq}V_Q^\top ) + s = z_{{\hat{E}}_{S^c}}. \end{aligned} \end{aligned}$$

(18)

Standard first-order methods such as alternating direction method of multipliers can be used to solve (18). Moreover, since the size of (18) is very small, we can also use interior point methods to solve it. The convex relaxation (18) after facial reduction is usually tighter than (4) and is more likely to successfully recover $L^*$. We do not go into more details in this paper.

4.5 Detecting the target rank

So far we have assumed that we know the target rank in advance, while in practice we may not know the target rank. We now assume that we know a range of the target rank and the density of the sparse part.

To obtain the correct target, we can just sample many submatrices from the observed data and run Algorithm 6. We then choose the rank which appears most correct. We tested Algorithm 6 on random instances with different size, ranks and density of sparsity. The results are presented in Table 1. For example, the third row of Table 1 presents average results for 100 instances of size $20 \times 20$, density $D_S =0.01$ and orignal rank $r = 2$. Algorithm 6 found estimated ranks $r=[1,2,3,4,5]$ in [0, 85, 12, 3, 0] percent, respectively. We conclude that Algorithm 6 estimates the correct rank in most cases.

Table 1 Rank estimation; average of 100 instances

Full size table

5 Growing bicliques by submatrix completion

We observe that when there are not enough bicliques, we may have many zero rows and zero columns in the exposing vector $Y_{expo}$. Let the set of indices for nonzero rows and columns in the upper-left block of $Y_{expo}$ be $J_1$ and the set of indices for nonzero rows and columns in the bottom-right block of $Y_{expo}$ be $J_2$. Now, if we consider the submatrix $Z_{J_1, J_2}$ which is obtained by taking the rows of Z whose indices are in $J_1$ and columns of Z whose indices are in $J_2$, we see that the size of the problem (16) is much smaller, i.e., since we remove all the zero rows and columns, we observe that the exposing vectors now cover all the remaining rows and columns, i.e.,those of $Z_{J_1,J_2}$. Hence we have a better chance of recovering $Z_{J_1,J_2}$. In fact, we often get a unique solution.

After we recover the submatrix $Z_{J_1,J_2}$, we can add the indices for all the entries in $Z_{J_1,J_2}$ to the sampled indices set ${\hat{E}}$. The sampled indices set ${\hat{E}}$ contains a larger biclique $J_1 \times J_2$ and has more elements. We then use Algorithm 1 again to find more bicliques. This process can be repeated until all the rows and columns in Z are recovered.

We illustrate the growth of bicliques and the exposing vector $Y_{expo}$ with an example. We choose the size of Z to be $200 \times 200$, the density for the sampled data is 0.18 and the minimum size for the bicliques is $4 \times 4$ with the target rank $r = 2$. The original low rank matrix L is recovered after 3 iterations by our algorithm. The example is shown in Figs. 1 and 2.

Our main algorithm for recovering the low rank matrix is summarized in Algorithm 7.

Remark 4

The motivation to grow bicliques arises from the desire to improve efficiency. We note that Algorithm 1 is only a heuristic and finding all the bicliques in a graph is an NP-hard problem. If there are not enough bicliques being identified, the equation in (17) may be underdetermined. Therefore we have to solve the convex approximation which can lead to a failure to recover the true low-rank matrix. The growing bicliques approach uses the existing bicliques to grow a big biclique. And when we add this big biclique to the sampled elements and run Algorithm 1 again, we are able to find many more bicliques efficiently. We note that the heuristic Algorithm 1 performs well when a big biclique exists. Our experiments show that this process is much more efficient than simply running Algorithm 1 many times. In fact this process is critical for the success of the method. Without the biclique growing procedure the proposed method takes much longer and will often fail to recover the target matrices if the density of the samplings is low.

6 Numerical experiments

6.1 Noiseless case

We applied our Algorithm 7 to solving random noiseless instances of (3). The input data were generated in the following manner. The low-rank matrix $L^*$ was integer valued and obtained using $L^*=\text {round}(L_r L_c^\top )$, where $L_r \in {\mathbb {R}}^{m \times r}$ and $L_c \in {\mathbb {R}}^{r \times n}$ are from the standard normal distribution, N(0, 10). The density of the sparse matrix $S^*\in {{\mathbb {R}}^{m \times n}}$ is denoted by $D_S$, and the nonzero elements were integer valued and obtained by rounding the uniformly distributed random elements. Matrix Z is set as $Z = L^* + S^*$. We then randomly sample elements in Z using the density $\delta$, i.e., the ratio of the observed components of Z is $\delta$. We use $\hat{L}$ to denote the recovery returned by our Algorithm 7.

For each set of data parameters $(m,n,\delta )$, we randomly generated 20 instances and reported the averaged performance in Tables 2 and 3. In these tables, $D_S=0.01$, $\mu =c/(\delta \sqrt{m})$ where c is a chosen constant in [1, 10], the cpu times are in seconds, $K_{min}$ denotes the smallest size of the cliques we found, $K_{max}$ denotes the largest size of the cliques we found, $\hat{r} = {{\,\mathrm{{rank}}\,}}(\hat{L})$ and “Succ” denotes the number of successfully recovered instances out of 20 randomly generated ones. Here we claim that matrix $L^*$ is successfully recovered if there is no difference between $\hat{L}$ and $L^*$, i.e., $\Vert \hat{L}-L^*\Vert _F=0$. Note that this is possible because all entries of $L^*$ are integer valued. From these results we see that we can get exact recovery with very low density. Moreover, our algorithm is very efficient, and for very large problems with size $5000 \times 5000$, we can solve it in about two minutes.

Table 2 Results of Algorithm 7 for solving (3) with target rank $r=2$

Full size table

Table 3 Results of Algorithm 7 for solving (3) with target rank $r=3$

Full size table

We also compare our method with the fast alternating linearization method (FALM) proposed in Goldfarb et al. (2013). The results are reported in Table 4, where $D_S=0.01$. From these results we see that after preprocessing by facial reduction, our reduced model (16) has a higher probability for recovering the low-rank matrix than solving the convex relaxation (4) by FALM. Moreover, our algorithm, although using a second-order interior point solver, is significantly faster than the first order method FALM. Our code is available at http://www.math.uwaterloo.ca/~hwolkowi//henry/reports/Codeforthetables.

Table 4 Comparison of Algorithm 7 and FALM

Full size table

Recently there has been a trend to use matrix factorization to solve the low-rank problem, such as Sun and Luo (2016), Yi et al. (2016). Therefore we also compared our method with the methods based on matrix factorization. Since in Sun and Luo (2016) no sparse gross errors are considered, we only compare the fast gradient descent method in Yi et al. (2016)denoted as FGD with our method. We note that in Yi et al. (2016) only the noiseless cases are considered therefore we only consider the noiseless case in the comparisons. We note in Yi et al. (2016), the authors make two assumptions about the data: (i) The low rank matrix M needs to be $\mu$-incoherent. (ii) The entries of the sparse component S are “spread out”, i.e., for $\alpha \in [0,1)$, S contains at most $\alpha$-fraction nonzero entries per row and column. In this paper, we don’t make these assumptions. Therefore we first compare our method with their method when these assumptions hold, secondly, we compare our method with their method without these assumptions being hold. The results are in Table 5 and 6. From the results in the tables we can see that the fast gradient descent method usually works better for large scale problems when these two assumptions hold. However when these two assumptions do not hold, our method can recover the true low rank matrix with a high success rate while the fast gradient descent method fails to recover any of them.

Table 5 Comparison of Algorithm 7 and FGD without assumptions (i), (ii)

Full size table

Table 6 Comparison of Algorithm 7 and FGD with assumptions (i), (ii)

Full size table

6.2 Noisy case

In the noisy case, we consider $Z = L + S + \mathcal {E}$ where L, S are no longer integer and all the elements of the matrix $\mathcal {E}$ are from random noise, generated from a Gaussian distribution with standard deviation $\sigma$. We still assume S is sparse, with density $D_S$.

The results are in Tables 7 and 8. Each row in the tables presents the average of 20 instances.

Table 7 Comparison of Algorithm 7 and FALM ($\sigma =10^{-4}$, $D_S= 0.01$, CPU times in seconds)

Full size table

Table 8 Comparison of Algorithm 7 and FALM ($\sigma =10^{-5}$, $D_S = 0.01$, CPU times in seconds)

Full size table

We see that when the noise level is small ($\sigma = 10^{-4}$ or $10^{-5}$), the CPU time of our Algorithm 7 is significantly smaller than that of FALM; and the residual is smaller as well which is pa great advantage of our algorithm.

In the case when the noise level is relatively large, our algorithm does not perform as well as FALM. This appears to be due to the failure of recovering the correct submatrices $\bar{L}, \bar{S}$ from $\bar{Z}$. However in general, if Algorithm 7 succeeds in finding a solution with the correct target rank, then our algorithm generally has a smaller residual than the first order method FALM. If Algorithm 7 fails to find the solution with the correct target rank, FALM is generally better. We conclude that we should use Algorithm 7 first and check if the correct target rank is achieved with a satisfactory tolerance. If not, then we should fall back on the first order method FALM.

We note there are popular algorithms such as the stochastic gradient descent method (SGD) (Gemulla et al. 2011; Recht and Ré 2013; Zhuang et al. 2013) and block coordinate descent type methods (Pilászy et al. 2010; Yu et al. 2012) to solve the large scale low-rank matrix completion problems. Our method proposed in the paper can be modified to solve the low-rank matrix completion problems as well and in the future it will be very interesting to compare our methods to those methods for quality improvement when solving the general low-rank matrix completion problems. Also in Sun and Luo (2016), a regularization term is added to control incoherence in the iterates, it will be very interesting to study whether a regularization term will help to improve the recovery quality for RPCA problems.

7 Conclusion

In this paper we have shown that we can apply a facial reduction approach in combination with the nuclear norm heuristic to efficiently solve the robust PCA problem with partially observed data. This exploits the implicit degeneracy at the optimal solution resulting from the low-rank and sparse structure.

Specifically, whenever enough complete bipartite subgraphs in the data are available, we are able to find a proper face of the semidefinite cone that contains the optimal solution and results in a significant reduction in dimension. If we cannot find enough bicliques, the matrix can still be partially completed. Having an insufficient number of bicliques is indicative of not having enough initial data to recover the unknown elements for our algorithm. This is particularly true for large rank r, where larger bicliques are needed. Throughout we see that the facial reduction both regularizes the problem and reduces the size and often allows for a solution without any refinement or need for an SDP solver.

Our preliminary numerical results are promising as they efficiently and accurately recover large scale problems. The numerical tests are ongoing with improvements in using biclique algorithms rather than clique algorithms. At the current stage, we are only testing randomly generated examples. In practice most of the problems from the real world, for example from surveillance video with moving objects, the gross errors are not sparse enough, or in other words, the random noise is not significantly smaller than the sparse gross errors. To solve practical problems from the real world, we need to further refine our algorithm to better separate the random noise from the sparse gross errors, which is part of our current ongoing work.

Theoretical results on exact recovery are discussed in many papers, e.g., Candes et al. (2010), Chandrasekaran et al. (2009). They use the so-called incoherence conditions, which are difficult to verify. It appears from our work above that exact recovery guarantees can be found from rigidity results in the graph of Z, i.e., in the number and density of the bicliques. Moreover, there are interesting questions on how to extend these results from the simple matrix completion to general solutions of linear equations, ${\mathcal {A}}(Z)=b$, where ${\mathcal {A}}$ is some linear transformation.

References

Aybat NS (2016) Algorithms for stable PCA. In: Bouwmans T, Aybat NS, Zahzah E (eds) Handbook of robust low rank and sparse matrix decomposition: applications in image and video processing. CRC Press, Cambridge
MATH Google Scholar
Blair JRS, Peyton B (1993) An introduction to chordal graphs and clique trees. In: Graph theory and sparse matrix computation, volume 56 of IMA volumes in mathematics and its applications. Springer, New York, pp 1–29
Bolte J, Sabach S, Teboulle M (2014) Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math Program 146(1–2):459–494
Article MathSciNet Google Scholar
Bouwmans T, Zahzah EH (2014) Robust pca via principal component pursuit: a review for a comparative evaluation in video surveillance. Comput Vis Image Underst 122:22–34
Article Google Scholar
Candes EJ, Li X, Ma Y, Wright J (2010) Robust principal component analysis. J ACM 58(1):1–37
MathSciNet MATH Google Scholar
Chandrasekaran V, Sanghavi S, Parrilo PA, Willsky AS (2009) Rank-sparsity incoherence for matrix decomposition. Technical Report, MIT, Boston, MA. arXiv:0906.2220
Cheng J, Chen K, Sacchi MD (2015) Application of robust principal component analysis (RPCA) to suppress erratic noise in seismic records. In: SEG technical program expanded, pp 4646–4651, SEG
Drusvyatskiy D, Wolkowicz H (2017) The many faces of degeneracy in conic optimization. Found Trends$^{\textregistered }$ Optim 3(2):77–170
Drusvyatskiy D, Pataki G, Wolkowicz H (2015) Coordinate shadows of semidefinite and Euclidean distance matrices. SIAM J Optim 25(2):1160–1178
Article MathSciNet Google Scholar
Drusvyatskiy D, Krislock N, Voronin Y-LC, Wolkowicz H (2017) Noisy Euclidean distance realization: robust facial reduction and the Pareto frontier. SIAM J Optim 27(4):2301–2331
Article MathSciNet Google Scholar
Gemulla R, Nijkamp E, Haas PJ, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 69–77
Goldfarb D, Ma S, Scheinberg K (2013) Fast alternating linearization methods for minimizing the sum of two convex functions. Math Program 141(1–2, Ser. A):349–382
Article MathSciNet Google Scholar
Huang S, Wolkowicz H (2018) Low-rank matrix completion using nuclear norm with facial reduction. J Glob Optim 72(1):5–26
Article MathSciNet Google Scholar
Karp RM (1972) Reducibility among combinatorial problems. In: Miller RE, Thatcher JW (eds) Complexity of computer computation. Plenum Press, New York, pp 85–103
Chapter Google Scholar
Krislock N, Wolkowicz H (2010) Explicit sensor network localization using semidefinite representations and facial reductions. SIAM J Optim 20(5):2679–2708
Article MathSciNet Google Scholar
Kurdyka K (1998) On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier 146:769–783
Article MathSciNet Google Scholar
Łojasiewicz S (1963) Une propriété topologique des sous-ensembles analytiques réels, Les Équations aux Dérivées Partielles. Éditions du centre National de la Recherche Scientifique, Paris
Ma S, Aybat NS (2018) Efficient optimization algorithms for robust principal component analysis and its variants. Proc IEEE 106(8):1411–1426
Article Google Scholar
Pataki G (2000) Geometry of semidefinite programming. In: Wolkowicz H, Saigal R, Vandenberghe L (eds) Handbook of semidefinite programming: theory, algorithms, and applications. Kluwer Academic Publishers, Boston, MA
MATH Google Scholar
Pilászy I, Zibriczky D, Tikk D (2010) Fast ALS-based matrix factorization for explicit and implicit feedback datasets. In: Proceedings of the fourth ACM conference on Recommender systems. ACM, pp 71–78
Recht B, Ré C (2013) Parallel stochastic gradient algorithms for large-scale matrix completion. Math Program Comput 5(2):201–226
Article MathSciNet Google Scholar
Recht B, Fazel M, Parrilo P (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev 52(3):471–501
Article MathSciNet Google Scholar
Sun R, Luo Z-Q (2016) Guaranteed matrix completion via non-convex factorization. IEEE Trans Inf Theory 62(11):6535–6579
Article MathSciNet Google Scholar
Yi X, Park D, Chen Y, Caramanis C (2016) Fast algorithms for robust PCA via gradient descent. In: Advances in neural information processing systems, pp 4152–4160
Yu H, Hsieh C, Si S, Dhillon I (2012) Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In: 2012 IEEE 12th international conference on data mining. IEEE, pp 765–774. https://doi.org/10.1109/ICDM.2012.168
Zhuang Y, Chin W-S, Juan Y-C, Lin C-J (2013) A fast parallel SGD for matrix factorization in shared memory systems. In: Proceedings of the 7th ACM conference on Recommender systems. ACM, pp 249–256

Download references

Acknowledgements

Open access funding provided by Royal Institute of Technology.

Author information

Authors and Affiliations

Department of Mathematics, University of California, Davis, CA, 95616, USA
Shiqian Ma
Division of Optimization and System Theory, Department of Mathematics, Royal Institute of Technology, Stockholm, Sweden
Fei Wang
Department of Combinatorics and Optimization, Faculty of Mathematics, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Linchuan Wei & Henry Wolkowicz

Authors

Shiqian Ma
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Linchuan Wei
View author publications
You can also search for this author in PubMed Google Scholar
Henry Wolkowicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Shiqian Ma: Research supported by a startup package from Department of Mathematics at UC Davis. Linchuan Wei: Research supported by the Undergraduate Student Research Awards Program, Natural Sciences and Engineering Research Council of Canada. Henry Wolkowicz: Research supported by The Natural Sciences and Engineering Research Council of Canada.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Ma, S., Wang, F., Wei, L. et al. Robust principal component analysis using facial reduction. Optim Eng 21, 1195–1219 (2020). https://doi.org/10.1007/s11081-019-09476-9

Download citation

Received: 08 January 2019
Revised: 19 November 2019
Accepted: 19 November 2019
Published: 27 November 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11081-019-09476-9

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust principal component analysis using facial reduction

Abstract

Similar content being viewed by others

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Tutorial on PCA and approximate PCA and approximate kernel PCA

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

1 Introduction

1.1 Main contributions

1.2 Connections and differences with previous work

1.2.1 Organization

2 Background

2.1 Reformulating RPCA with SDP

Lemma 1

Proof

Remark 1

2.2 Outline of our algorithm

2.3 Graph representation of a partially observed matrix Z

2.4 Heuristics for finding cliques

3 Decomposing the submatrix \(\bar{Z}\) using PALM

Theorem 1

Assumption

4 Facial reduction, bicliques and exposing vectors

4.1 Preliminaries on faces

Theorem 2

Corollary 1

Proof

4.2 Exposing vector

Lemma 2

Proposition 1

Proof

Remark 2

4.3 Reducing problem size using FR

Remark 3

4.4 Further reducing the size of the problem

Lemma 3

Lemma 4

4.5 Detecting the target rank

5 Growing bicliques by submatrix completion

Remark 4

6 Numerical experiments

6.1 Noiseless case

6.2 Noisy case

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation