An Inexact Penalty Decomposition Method for Sparse Optimization

Dong, Zhengshan; Lin, Geng; Chen, Niandong

doi:https://doi.org/10.1155/2021/9943519

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Preliminaries Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Sparse Representation for Machine Learning

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 9943519 | https://doi.org/10.1155/2021/9943519

An Inexact Penalty Decomposition Method for Sparse Optimization

Zhengshan Dong,¹Geng Lin,¹and Niandong Chen²

Academic Editor: Henry Man Fai Leung

Received23 Mar 2021

Accepted30 Jun 2021

Published15 Jul 2021

Abstract

The penalty decomposition method is an effective and versatile method for sparse optimization and has been successfully applied to solve compressed sensing, sparse logistic regression, sparse inverse covariance selection, low rank minimization, image restoration, and so on. With increase in the penalty parameters, a sequence of penalty subproblems required being solved by the penalty decomposition method may be time consuming. In this paper, an acceleration of the penalty decomposition method is proposed for the sparse optimization problem. For each penalty parameter, this method just finds some inexact solutions to those subproblems. Computational experiments on a number of test instances demonstrate the effectiveness and efficiency of the proposed method in accurately generating sparse and redundant representations of one-dimensional random signals.

1. Introduction

In this paper, we consider solving the following sparse optimization problem by an inexact penalty decomposition (iPD) method:where controls the sparsity of the solution, is a closed convex set in the -dimensional Euclidean space , are continuously differentiable convex functions, is an affine function, and denotes the number of nonzero components of .

Sparse optimization is to solve some problems whose solutions are sparse or compressed. And it has attracted considerable attention in the past ten years since its broad applications, such as signal (image) processing [1–3], linear regression [4], inverse problem [5], model selection [6], and machine learning [6, 7]. In those applications, most information of interest has or can be coded by much low dimension though its own dimension is high.

However, problem (1) is NP hard even though for some simple special cases [8]. Even so, many methods have been proposed for some special cases of problem (1). These methods can be classified into four categories: (1) greedy methods: matching pursuit [9, 10] and greedy coordinate descent [11]; (2) -norm relaxation methods: gradient projection [12, 13], iterative shrinkage-thresholding [5, 14], iterative reweighted method [15], alternating direction method [16], and homotopy method [17–20]; (3) -norm relaxation methods [1, 2, 21]; and (4) -norm based methods, e.g., penalty decomposition method [22], block decomposition method [23], iterative hard thresholding method [22, 24–29], and so on. In this paper, we mainly discuss the PD method.

The PD method was proposed for solving the general -norm minimization problem (1) by Lu and Zhang in [22]. And it had been successfully applied to solve compressed sensing [22], sparse logistic regression [22], sparse inverse covariance selection [22], low rank minimization [30], image restoration [3] problems, and so on. Moreover, the PD method is theoretically sound. Lu et al. stated that any accumulation point of the sequence generated by the PD method satisfies the first-order optimality conditions of problem (1) when the Robinson condition holds. Hence, the PD method is an effective and versatile method for sparse optimization. However, since the PD method found exact solutions of subproblems for each penalty parameter, it may be time consuming in practice.

In this paper, an inexact penalty decomposition (iPD) method is proposed for the sparse optimization problem (1). The iPD method just finds some inexact solutions to those subproblems for each penalty parameter. In more detail, for the first convex subproblem, the iPD method just takes one gradient step and then goes to solve the second nonconvex subproblem. The second subproblem can be solved by the iterative hard thresholding method [26]. After the two steps, the penalty parameter is updated. Computational experiments on a number of random instances demonstrate the effectiveness of the proposed method in accurately generating sparse and redundant representations of one-dimensional random signals.

The rest of this paper is organized as follows. Section 2 is the preliminary, in which some notations and the basic method are described. Section 3 presents the iPD method. Computational experiments are presented in Section 4, and conclusions are drawn in Section 5.

2. Preliminaries

2.1. Notations

In this subsection, some notations are presented to simplify presentation. The transpose of a vector is denoted by . If without special statement, all norms used are the Euclidean norm, denoted by . denotes projection on a set . Given a vector , the nonnegative part of is denoted by , i.e., . The index of nonzero components of a vector is denoted by (called support set) and . The size of is denoted as .

Now, let us consider problem (1). It is easy to verify that problem (1) is equivalent to the following problem:

And the relative penalty function of problem (2) is defined aswhere is the penalty parameter.

For simplicity, we also denote

2.2. The PD Method

In this subsection, we show the PD method proposed in [22]. First, the outline of the PD method is as presented in Algorithm 1. Then, we explain why the PD method is time consuming by a random example.

Input:, , ;
Output:;
(1)	initialization , ;
(2)	repeat
(3)	;
(4)	repeat
(5)	;
(6)	;
(7)	;
(8)	until
(9)	;
(10)	;
(11)	;
(12)	;
(13)	until some termination conditions reach
(14)	;

Remark 1. (i) The termination condition in Step 8 of Algorithm 1 is used to establish the global convergence of the PD method. In practice, the termination criterion is based on the relative change of the sequence such as the sequence satisfyingfor some . In addition, the PD method terminates the outer iterations whenholds for some .
(ii) The second subproblem, i.e., in Step 6 of Algorithm 1,has a closed-form solution [26].where denotes the -th entry of a vector, .

In Step 5 of Algorithm 1, minimizing the function with respect to is a convex problem. There exist many efficient methods for this purpose if is simple. However, for each penalty parameter, the PD method solves the penalty subproblems a few times until some termination conditions are reached, which is time consuming.

Consider a special case—compressed sensing [31]. One important task of compressed sensing is to find the sparsest solution to the underdetermined linear system, which is formulated aswhere is the sensing matrix and is the observation data. For this special problem, . The value of is called data fidelity, and it can measure the feasibility of a solution . is the penalty function of problem (9).

Example 1. We generate a sparse vector with length and nonzero components. These components independently follow the standard Gaussian distribution, and their locations are assigned randomly to . Then, we create a Gaussian random matrix with size , and let . Then, we solve this instance by the PD method package, and the process data are presented as Figure 1.
Figure 1 shows that the value of decreases slowly. It decreases steep just at the first few steps for each penalty parameter. There are many almost null steps during the process. And the value of the penalty function increases too much when updating the penalty parameter. Hence, we can just take one or a few iterations for each penalty parameter to save some time. In Section 3, we will improve the PD method by the above observations.

(a)

(b)

3. The Proposed Method

In this section, we describe the process of the iPD method. From the outline of Algorithm 1, we find that, for each penalty parameter , the block coordinate descent method needs to alternately solve two minimization subproblems many times, and an example in Section 2 shows that there are many almost null step for each penalty parameter. Hence, the original PD method may be time consuming if convergence speed of the block coordinate descent is slow.

Motivated by the analysis in Section 2 and the above demonstration, we accelerate the PD method by alteratively solving the two penalty subproblems once a time after updating the penalty parameter. For solving the first penalty subproblem, a gradient step is taken, and its step-length is searched by the backtracking line search method.

Now, we present the outline of the accelerated penalty decomposition method as follows.

Remark 2. A practical termination criterion in Step 11 of Algorithm 2 can befor some .

	Input:, , , , , ;
	Output:;
(1)	initialization , ;
(2)	repeat
(3)	whiledo
(4)
(5)	;
(6)	end while
(7)	;
(8)
(9)
(10)	;
(11)	until some termination conditions reach
(12)	;

Theorem 1. If the gradient of the function with respect to is Lipschitz continuous (its Lipschitz constant is denoted as ), then the line search between Steps 3–6 can be terminated in a finite number of iterations.

Proof. Since satisfiesit together with implies thatThen, if ,holds, which means that the while loop in Algorithm 2 terminates if . Let be the final value of after the while loop. Then, holds, i.e., . Let be the number of iterations in the while loop at the -th iteration. Then, one can get thatwhere is the initial value of in the line search. Therefore,

4. Experiments

In this section, we implement the proposed accelerated PD method to solve the compressed sensing problem. To verify the efficiency of PD empirically, a large number of computational experiments are performed on one-dimensional random signals. We mainly compare the performance of our improved PD method with that of the original PD method [22]. All experiments were performed on a personal computer with an Intel(R) Core(TM)i7-7700HQ CPU (2.80 GHz) and 8 GB memory, using a MATLAB toolbox (version R2018b).

We compare the performance of the compared methods by the CPU time (in seconds) required, the size of the support set of the reconstructed data , and the mean squared error (MSE) with respect to , which is defined asand the data fidelity of is defined asand NS as the number of successfully recovered instances. We say a signal is successfully recovered if the positions of the nonzero components of are the same as and the corresponding MSE value is less than .

4.1. Data Generation and Parameter Setting

Each instance is generated randomly with size , where is the dimension of matrix and is the sparsity level, such as , , and . The elements of matrix follow the Gaussian distribution. The vector is generated with the same distribution at randomly chosen coordinates. Finally, the vector is generated by .

Unless otherwise stated, all parameters in the PD method are set as default, and parameters in the IPD package are set as in Table 1.

4.2. Compare with the Original PD Method

Firstly, we compare the iteration process of the iPD method with that of the PD method on a random instance. All parameters are set as before, and the problem size is . Figure 2 describes the data fidelity and the penalty function value over the iteration process. From Figure 2(a), we find that the iPD method does not have many null steps, and the values of data fidelity generated by the iPD method decrease much fast than those of the original PD method. Furthermore, the iPD method just requires about 150 steps while the original PD method requires about 400 steps. And the running time of the iPD method is about 7 seconds, which is less than half of the time required by the original PD method. Moreover, the penalty function value generated by the iPD method is much stable than that by the original PD method.

(a)

(b)

In the second experiment, we compare the accelerated PD method with its original PD method at different sampling numbers. We fix the dimension and the sparsity level . For each sampling number , 100 instances are generated, and the averaged performance of the two methods is presented in Figure 3.

(a)

(b)

(c)

(d)

(e)

From Figure 3(a), we see that the accelerated PD method requires not more than 10 seconds while the original PD method requires much more time. And the time required by the accelerated PD method is stable at different sampling numbers. Figure 3(b) shows that the recovered rate by the accelerated PD method is higher than that by the original PD method when is bigger than 600. When the sampling number is bigger than 700, the accelerated PD method can recover all signals successfully. We find that the MSE value and the DF value generated by the accelerated PD method are lower than those generated by the original PD method. The averaged number of nonzero components also shows that the accelerated PD method performs better.

In the next experiment, we compare the accelerated PD method with its original version for solving the compressed sensing problem with different sparsity levels . All parameters are set as the same value as those stated before. The averaged computational results on 100 instances are presented in Table 2.

From Table 2, we find that the PD method not works well when the sparsity level is greater than 150, especially when it is greater than 200. However, the sparsity level recovered by the iPD method can reach 200. When the two methods can recover sparse signals, the iPD method just needs about one third of the time required by the PD method. Moreover, the recovered rate of the iPD method is higher than that of the original PD method. From MSE and DF value, we see that the signals recovered by the iPD method are more exact than those recovered by the PD method. When , there is one instance not recovered exactly by the iPD method since there exist several very small components and one of them is not recovered.

5. Conclusions

In this paper, we have proposed an acceleration of the penalty decomposition for the sparse approximation problem. The proposed method does not solve the penalty subproblems exactly and alternately solve penalty subproblems once a time after updating penalty parameters. We show that this method enhances the performance of the penalty decomposition method by computational experiments on a number of random instances for solving the compressed sensing problem. The experiments demonstrate that the proposed method indeed improves the original PD method since it recovers better solutions with less running time.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded partly by the Natural Science Foundation of Fujian Province of China, under grant 2020J01843, and the Science and Technology Project of the Education Bureau of Fujian, China, under grant JAT200403.

References

R. Chartrand, “Exact reconstruction of sparse signals via nonconvex minimization,” IEEE Signal Processing Letters, vol. 14, no. 10, pp. 707–710, 2007.
View at: Publisher Site | Google Scholar
X. Xiaojun Chen, M. K. Ng, and C. Chao Zhang, “Non-lipschitz -regularization and box constrained model for image restoration,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4709–4721, 2012.
View at: Publisher Site | Google Scholar
Y. Zhang, B. Dong, and Z. Lu, “ minimization for wavelet frame based image restoration,” Mathematics of Computation, vol. 82, no. 282, pp. 995–1015, 2013.
View at: Google Scholar
G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear regression,” IEEE Transactions on Signal Processing, vol. 58, no. 10, pp. 5262–5276, 2010.
View at: Publisher Site | Google Scholar
A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
View at: Publisher Site | Google Scholar
Z. Zechao Li, J. Jing Liu, Y. Yi Yang, X. Xiaofang Zhou, and H. Hanqing Lu, “Clustering-guided sparse structural learning for unsupervised feature selection,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2138–2150, 2014.
View at: Publisher Site | Google Scholar
Z. Noorie and F. Afsari, “Sparse feature selection: relevance, redundancy and locality structure preserving guided by pairwise constraints,” Applied Soft Computing, vol. 87, 2020.
View at: Publisher Site | Google Scholar
B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM Journal on Computing, vol. 24, no. 2, pp. 227–234, 1995.
View at: Publisher Site | Google Scholar
S. G. Mallat and Z. Zhifeng Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3415, 1993.
View at: Publisher Site | Google Scholar
G. Davis, S. Mallat, and M. Avellaneda, “Adaptive greedy approximations,” Constructive Approximation, vol. 13, no. 1, pp. 57–98, 1997.
View at: Publisher Site | Google Scholar
H. Fang, Z. Fan, Y. Sun, and M. Friedlander, “Greed meets sparsity: Understanding and improving greedy coordinate descent for sparse optimization,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pp. 434–444, Palermo, Italy, June 2020.
View at: Google Scholar
M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems,” IEEE Journal of Selected Topics in Signal Processing, vol. 1, no. 4, pp. 586–597, 2007.
View at: Publisher Site | Google Scholar
Y. Nesterov, “Gradient methods for minimizing composite functions,” Mathematical Programming, vol. 140, no. 1, pp. 125–161, 2013.
View at: Publisher Site | Google Scholar
P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forward-backward splitting,” Multiscale Modeling & Simulation, vol. 4, no. 4, pp. 1168–1200, 2005.
View at: Publisher Site | Google Scholar
E. J. Candès, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted minimization,” Journal of Fourier Analysis and Applications, vol. 14, no. 5-6, pp. 877–905, 2008.
View at: Publisher Site | Google Scholar
J. Yang and Y. Zhang, “Alternating direction algorithms for $\ell_1$-problems in compressive sensing,” SIAM Journal on Scientific Computing, vol. 33, no. 1, pp. 250–278, 2011.
View at: Publisher Site | Google Scholar
D. M. Malioutov and A. S. Willsky, “Homotopy continuation for sparse signal representation,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 733–736, 2005.
View at: Google Scholar
S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo, “Sparse reconstruction by separable approximation,” IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2479–2493, 2009.
View at: Publisher Site | Google Scholar
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” The Annals of Statistics, vol. 32, no. 2, pp. 407–499, 2004.
View at: Publisher Site | Google Scholar
E. T. Hale, W. Yin, and Y. Zhang, “Fixed-point continuation for $\ell_1$-minimization: methodology and convergence,” SIAM Journal on Optimization, vol. 19, no. 3, pp. 1107–1130, 2008.
View at: Publisher Site | Google Scholar
M.-J. Lai, Y. Xu, and W. Yin, “Improved iteratively reweighted least squares for unconstrained smoothed $\ell_q$ minimization,” SIAM Journal on Numerical Analysis, vol. 51, no. 2, pp. 927–957, 2013.
View at: Publisher Site | Google Scholar
Z. Lu and Y. Zhang, “Sparse approximation via penalty decomposition methods,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2448–2478, 2013.
View at: Publisher Site | Google Scholar
G. Yuan, L. Shen, and W. Zheng, “A block decomposition algorithm for sparse optimization,” in KDD’20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 275–285, CA, USA, July 2020.
View at: Publisher Site | Google Scholar
T. Blumensath and M. E. Davies, “Iterative thresholding for sparse approximations,” Journal of Constructive Approximation, vol. 14, no. 5-6, pp. 629–654, 2008.
View at: Publisher Site | Google Scholar
T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,” Applied and Computational Harmonic Analysis, vol. 27, no. 3, pp. 265–274, 2009.
View at: Publisher Site | Google Scholar
Z. Lu, “Iterative hard thresholding methods for regularized convex cone programming,” Mathematical Programming, vol. 147, no. 1-2, pp. 125–154, 2014.
View at: Publisher Site | Google Scholar
M. Nikolova, “Description of the minimizers of least squares regularized with $\ell_0$-norm. Uniqueness of the global minimizer,” SIAM Journal on Imaging Sciences, vol. 6, no. 2, pp. 904–937, 2013.
View at: Publisher Site | Google Scholar
T. Blumensath and M. E. Davies, “Normalized iterative hard thresholding: Guaranteed stability and performance,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 2, pp. 298–309, 2010.
View at: Publisher Site | Google Scholar
T. Blumensath, “Accelerated iterative hard thresholding,” Signal Processing, vol. 92, no. 3, pp. 752–756, 2012.
View at: Publisher Site | Google Scholar
Z. Lu and Y. Zhang, “Penalty decomposition methods for rank minimization,” Optimization Method and Software, 2015.
View at: Publisher Site | Google Scholar
R. Baraniuk, “Compressive Sensing,” IEEE Signal Processing Magazine, vol. 24, no. 4, pp. 118–121, 2007.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Zhengshan Dong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

272

Downloads

821

Citations