Learning Robust Low-Rank Representations

Size: px

Start display at page:

Download "Learning Robust Low-Rank Representations"

Barbara Logan
6 years ago
Views:

1 Learning Robust Low-Rank Representations Pablo Sprechmann Dept. of Elec. and Comp. Eng. Duke University Alexander M. Bronstein School of Elec. Eng. Tel Aviv University Guillermo Sapiro Dept. of Elec. and Comp. Eng. and Dept. of Comp. Scien. Duke University Abstract In this paper we present a comprehensive framework for learning robust low-rank representations by combining and extending recent ideas for learning fast sparse coding regressors with structured non-convex optimization techniques. This approach connects robust principal component analysis (RPCA) with dictionary learning methods and allows its approximation via trainable encoders. We propose an efficient feed-forward architecture derived from an optimization algorithm designed to exactly solve robust low dimensional projections. This architecture, in combination with different training objective functions, allows the regressors to be used as online approximants of the exact offline RPCA problem or as RPCA-based neural networks. Simple modifications of these encoders can handle challenging extensions, such as the inclusion of geometric data transformations. We present several examples with real data from audio and video processing. When used to approximate RPCA, our basic implementation shows several orders of magnitude speedup compared to the exact solvers with almost no performance degradation. We show the strength of the inclusion of learning into the RPCA approach on a music source separation application, where the encoders outperform the exact RPCA algorithms, which are already reported to produce state-of-the-art results on a benchmark database. Video applications are demonstrated as well. Our preliary implementation on an ipad shows faster-than-real-time performance with imal latency. Introduction Principal component analysis (PCA) is the most widely used statistical technique for dimensionality reduction. However, its performance is highly sensitive to the presence of outliers, that is, samples not following the underlying low rank model. In a series of recent works Candès et al. (20) and Xu et al. (200) developed an elegant solution to this problem, in which the low rank matrix is detered as the imizer of a convex program. In this work, we concentrate in the noisy version of robust PCA (RPCA) obtained by solving the unconstrained program L,O R m n 2 X L O 2 F + λ L + λ O, () where X R m n is the data matrix, L is a low rank matrix and O is an error matrix with a sparse number of non-zero coefficients (with arbitrarily large magnitude). L denotes the matrix nuclear norm, defined as the sum of the singular values of L. The parameters λ and λ are a positive scalar parameters controlling the rank of L and the sparsity level in the outliers, respectively, and can be set automatically as in Ramirez & Sapiro (202). This particular formulation of RPCA has attracted significant interest in the machine learning, computer vision, and signal processing communities, and was successfully used in various applications such as Wagner et al. (20); Peng et al. (200); Zhou et al. (200); Mu et al. (20). A challenge often encountered in modern settings is that the flow of new input data is permanent. Then, the

2 robust low rank model needs to be adapted constantly since the principal directions can change over time, calling for developing efficient online techniques recently proposed in Qiu & Vaswani (20); Tan et al. (20); Mateos & Giannakis (202); Balzano et al. (200). Significant amount of effort has been devoted to developing optimization algorithms for efficiently solving () and its noiseless formulation Candès et al. (20); Lin et al. (2009); Ma et al. (20); Recht & Ré (20). Despite the permanent progress reported in the literature, state-of-the-art algorithms still have prohibitive complexity and latency for real-time processing. In the sparse coding domain, a similar situation was encountered a few years ago. This motivated significant effort in the deep learning community, e.g., by Ranzato et al. (2007) and Goodfellow et al. (2009), aig at overcog this problem. Works by Jarrett et al. (2009) and Kavukcuoglu et al. (200) proposed learning non-linear regressors capable of producing good approximations of the true sparse codes in a fixed amount of time. Gregor & LeCun (200) introduced an approach in which the regressors are multilayer artificial neural networks with an architecture inspired by first order optimization algorithms for solving sparse coding problems. This work was later extended for handling more general structured sparse models by Sprechmann et al. (202a). With this motivation, in this paper we propose to extend these ideas to the RPCA context. We propose to design regressors capable of approximating online RPCA in a very fast way. To the best of our knowledge, this type of encoders have never been developed before. Our RPCA encoders are learned by imizing various objective functions that allow their use in several different contexts. In Section 2, we present our approach to RPCA and discus exact optimization algorithms to solve it. In Section 3, we introduce the new robust encoders and the new objective functions used for their training. In Section 4, we present several experimental results. Conclusions are drawn in Section 5. For further details, the reader is referred to Sprechmann et al. (202b). 2 Online RPCA via non-convex factorization Using the result from Srebro & Shraibman (2005) and Recht et al. (200), Mateos & Giannakis (202) proposed to reformulate () as U,S,O 2 X US O 2 F + λ 2 ( U 2 F + S 2 F ) + λ O, (2) with U R m q, S R q n, and O R m n. Here, q is rough upper bound on rank(l) q. This decomposition reveals a lot of structure hidden in the problem. The low rank component can now be thought of as an under-complete dictionary U, with q atoms, multiplied by a matrix S containing in its columns the corresponding coefficients for each data vector in X. This interpretation brings our problem close to that of dictionary learning in the sparse modeling domain; see Mairal et al. (2009). While this new factorized formulation drastically reduces the number of optimization variables, problem (2) is no longer convex. Fortunately, Mardani et al. (20) showed that any stationary point of (2), {U, S, O}, satisfying X US O 2 λ is an optimal solution of (2). Thus, problem (2) can be solved using an alternating imization or block coordinate scheme, in which the cost function is imized with respect to each individual optimization variable while keeping the other ones fixed, without the risk of falling into a stationary point that may not be globally optimal. This will be exploited to design our fast encoders. Robust low dimensional projections: Let us assume that we have already learned a low dimensional model, U R m q, from some data X US + O R m n. Suppose that we are given a new input vector x R m drawn from the same distribution as X. Then x can be decomposed as x Us + o, where Us represents the low dimensional component, n is a small perturbation and o is a sparse outlier vector. This can be done by solving s R q,o R m 2 x Us o λ 2 s λ o. (3) Unlike dictionary learning problems, e.g. in Mairal et al. (2009), here the columns of the dictionary U are not constrained to have unit norm. The robust low dimensional projection (3) is a convex program that can be solved using several methods. We are interested in choosing an optimization algorithm that can be further used to define the architecture of trainable encoders for simultaneously estimating s and o. With this in d, we choose to use the block-coordinate imization scheme. 2

3 The solution of (3) is given by s = (U T U λ I) U T (x t o) and o = π λ (ˆx Us), when fixing o and s respectively. Here π λ is the scalar soft-thresholding operator with parameter λ R m, which applies a soft-threshold λ i to each component of the input vector. In our case, λ = λ. Note that we can write the estimates in a recursive manner as, o k+ = π λ (b k ), b k+ = b k + W(o k+ o k ), where k is the iteration number, and W = UHU T and H = (U T U λ I). After convergence, s can be obtained as s = H(x o). Online RPCA: So far, we assumed that the entire data matrix X was available a priori. We now address the case when the data samples {x t } t N, x t R m, arrive sequentially; the index t should be understood as a temporal variable. Online RPCA aims at estimating and refining the model as the data come in. An alternating imization algorithm for solving the online counterpart of (2) goes as follows: When a new data vector x t is received, we first obtain its representation {s t, o t } as a low dimensional projection (3) given the current model estimate, U t, which can be done using strategies such as block-coordinate descent methods with warm restarts described Mairal et al. (2009), or recursive least squares Mateos & Giannakis (202). 3 Online RPCA via fast trainable encoders The main contributions of this paper is the construction of trainable regressors that can be used for approximating the solution of (3). The main idea is to build a parametric regressor z = (s, o) = h(x, Θ), with some set of parameters, collectively denoted as Θ. Thus, we need to define an architecture for h and a learning algorithm in order to detere Θ. Following the fast sparse coding methods in Gregor & LeCun (200) and Sprechmann et al. (202a), we propose to use feed-forward multi-layer architecture where each layer implements a single iteration of the block-coordinate imization scheme described in Section 2. The parameters of the network are the matrices W and H and the thresholds λ (extra flexibility is obtained by learning different thresholds λ i for each component). The encoder architecture is depicted in Sprechmann et al. (202b). Each layer essentially consists of the nonlinear thresholding operator π λ followed by a linear operation W. The network parameters are initialized to replicate a fixed number of iterations of the exact algorithm. As a learning strategy, we propose to select the set of parameters Θ that imizes the loss function, L(Θ) = n L(Θ, x j ) (4) n on a training set X = {x,..., x n }. Here, L(Θ, x j ) is a function that measures the quality of the code z j = h(x j, Θ) produced for the data point x j. The selection of the objective function L sets the type of regressor that we are going to obtain and it is application-dependent. One of the simplest choices is to use L(Θ, x j ) = z j z j, with z j = (s j, o j ) being the j-th columns of the decomposition of the data X = (x,..., x n ) into X = US + O by the exact RPCA. This essentially trains the encoder to approximate the exact solution of the RPCA problem. In many applications, the data may not completely adhere to the assumptions of the RPCA model, and the exact solution is, therefore, not necessarily the best one. This is the case, for example, in the source separation problem presented in Huang et al. (202), where RPCA gives a very good separation of the spectrally sparse singing voice (modeled as outliers) and repetitive low-rank background accompaniment, yet the obtained signals are still not equal to the true voice and background tracks. In this case, one could use a collection of clean voice and background tracks, {s j } and {o j } respectively, to supervise the training; this leads to better separation results as reported in Section 4. In a fully online setting, one does not have access either to the exact RPCA decomposition nor the true ground truth separation. In that case, we propose to directly use L(Θ, x j ) = f(x, z), where f is the objective function in the low dimensional projection (3). In this fully unsupervised case the encoders can be viewed as stand alone RPCA regressors, which brings them close in spirit to the sparse auto-encoders proposed by Goodfellow et al. (2009). Several applications of RPCA rely on the critical assumption that the given input vectors are aligned with respect to a group of geometric transformations. Translating the work by Peng Same could be done solving (3) using proximal methods instead of block-coordinate imization. j= 3

4 Figure : Robust PCA representation of several frames (top-to-bottom) from the surveillance sequence obtained using the algorithm from Lin et al. (2009) (left), and a five layer network encoder (right). Columns in each group are, leftto-right: the reconstructed frame, its lowrank approximation (background), and the sparse outlier (foreground). Method GNSDR GSNR GSAR GSIR Ideal freq. mask ADMoM RPCA Huang et al. (202) Proximal RPCA NN RPCA Untrained NN RPCA Unsupervised NN RPCA Supervised Table : Performance of audio separation methods on the MIR-K dataset. et al. (200) to our setting, we propose to use the training objective function defined in (4) with L Tr (Θ, x j, α j ) = f(t αj (x j ), h(t αj (x j ), Θ)), where T α is a parametrized operator (with a set of parameters collectively denoted as α) that applies different geometric transformation, T αj, to each training vector x j. In the online setting, when a new data vector x t arrives, we compute its robust low rank projection by imizing L Tr (Θ, x t, α t ) over a the vector α t parameterizing the geometric transformation. In the same way, as new data arrive, the transformation of all the previously seen training vectors is updated through the imization of a loss function L(Θ, α) with respect to α. The imization of a loss function L(Θ) with respect to Θ or α is performed using a stochastic gradient descent, see Gregor & LeCun (200) and Sprechmann et al. (202a) for details. 4 Experimental results Video separation: Figure shows background and foreground separation via robust PCA on the surveillance video sequence Hall of a business building taken from Li et al. (2004). The sequence consists of images of an indoor scene shot by a static camera in a mall. The scene has a nearly constant background and walking people in the foreground. We used networks with five layers and q = 5 trained to approximate the output of the exact RPCA on a subset of the frames in the sequence. The separation produced by the fast encoder is nearly identical to the output of the exact algorithm and to the output of the code from Lin et al. (2009), used as reference, while being considerably faster. Our Matlab implementation with built-in GPU acceleration executed on an NVIDIA Tesla C2070 GPU propagates a frame through a single layer of the network in merely 92µsec. This is several orders of magnitude faster than iterative solver executed on the CPU. Audio separation: We evaluate the separation performance of the proposed methods on the MIR- K dataset from Hsu & Jang (200), containing 000 clips extraced from 0 Chinese karaoke songs. The experimental settings closely followed that of Lin et al. (2009), to which the reader is referred for further details. As the evaluation criteria, we used the BSS-EVAL metrics developed in Vincent et al. (2006), which calculate the global normalized source-to-distortion ratio (GNSDR),source-to-artifacts ratio (GSAR), source-to-interference ratio (GSIR), and signal-tonoise ratio (GSNR). All networks used 20 layers with q = 25. The following training regimes were compared: untrained parameters initialized according to Section 2 (Untrained); unsupervised learning (Unsupervised); and training supervised by the clean voice and background tracks (Supervised). Table summarizes the obtained separation performance. While unsupervised training makes fast RPCA encoders on par with the exact RPCA (at a fraction of the computational complexity and latency of the latter), significant improvement is achieved by using the supervised regime. 5 Conclusion By combining ideas from structured non-convex optimization with multi-layer neural networks, we have developed a comprehensive framework for the online learning of robust low-rank representations in real time and capable of handling large scale applications. A basic implementation already achieves several order of magnitude speedups when compared to exact solvers, opening the door for practical algorithms in various applications. 4

5 References Balzano, L., Nowak, R., and Recht, B. Online identification and tracking of subspaces from highly incomplete information. In Proc. of 48th Allerton Conf., 200. Candès, E., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM, 58(3), 20. Goodfellow, I., Le, Q., Saxe, A., Lee, H., and Ng, A. Y. Measuring invariances in deep networks. In In NIPS, pp Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. In ICML, pp , 200. Hsu, C.L. and Jang, J.S.R. On the improvement of singing voice separation for monaural recordings using the MIR-K dataset. IEEE Trans. on Audio, Speech, and Lang. Proc., 8(2):30 39, 200. Huang, P-S., Chen, S.D., Smaragdis, P., and Hasegawa-Johnson, M. Singing-voice separation from monaural recordings using robust principal component analysis. In ICASSP, 202. Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Y. What is the best multi-stage architecture for object recognition? In CVPR, pp , Kavukcuoglu, K., Ranzato, M.A., and LeCun, Y. Fast inference in sparse coding algorithms with applications to object recognition. arxiv: , 200. Li, L., Huang, W., Gu, I. Yu-Hua, and Tian, Q. Statistical modeling of complex backgrounds for foreground object detection. IEEE Trans. Image Process., 3(): , Lin, Z., Ganesh, A., Wright, J., Wu, L., Chen, M., and Ma, Y. Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix. preprint, Ma, S., Goldfarb, D., and Chen, L. Fixed point and Bregman iterative methods for matrix rank imization. Math. Program., 28(-2):32 353, 20. Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online dictionary learning for sparse coding. In ICML, pp , Mardani, M., Mateos, G., and Giannakis, G. B. Unveiling network anomalies in large-scale networks via sparsity and low rank. In Proc. of 44th Asilomar Conf. on Signals, Systems, and Computers, 20. Mateos, G. and Giannakis, G. B. Robust PCA as bilinear decomposition with outlier-sparsity regularization. IEEE Trans. on Signal Process., 60(0): , 202. Mu, Y., Dong, J., Yuan, X., and Yan, S. Accelerated low-rank visual recovery by random projection. In CVPR, pp , 20. Peng, Y., Ganesh, A., Wright, J., Xu, W., and Ma, Y. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images. In CVPR, pp , 200. Qiu, C. and Vaswani, N. Real-time robust principal components pursuit. arxiv.org:.788, 20. Ramirez, I. and Sapiro, G. Low-rank data modeling via the imum description length principle. In ICASSP, 202. Ranzato, M., Huang, F. J., Boureau, Y-L., and LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, Recht, B. and Ré, C. Parallel stochastic gradient algorithms for large-scale matrix completion. 20. Recht, B., Fazel, M., and Parrilo, P. A. Guaranteed imum-rank solutions of linear matrix equations via nuclear norm imization. SIAM Rev., 52(3):47 50, 200. Sprechmann, P., Bronstein, A. M., and Sapiro, G. Learning efficient structured sparse models. In ICML, 202a. Sprechmann, P., Bronstein, A. M., and Sapiro, G. Learning Robust Low-Rank Representations. arxiv.org: , 202b. Srebro, N. and Shraibman, A. Rank, trace-norm and max-norm. In COLT, pp , Tan, W., Cheung, G., and Ma, Y. Face recovery in conference video streag using robust principal component analysis. In ICIP, pp , 20. Vincent, E., Gribonval, R., and Févotte, C. Performance measurement in blind audio source separation. IEEE Trans. on Audio, Speech, and Lang. Proc., 4(4): , Wagner, A., Wright, J., Ganesh, A., Zhou, Z., Mobahi, H., and Ma, Y. Towards a practical face recognition system: Robust alignment and illuation via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell., 34(2): , 20. Xu, H., Caramanis, C., and Sanghavi, S. Robust PCA via outlier pursuit. In NIPS, pp Zhou, Z., Li, X., Wright, J., Candès, E. J., and Ma, Y. Stable principal component pursuit. In ISIT, pp ,

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Minh Dao 1, Xiang Xiang 1, Bulent Ayhan 2, Chiman Kwan 2, Trac D. Tran 1 Johns Hopkins Univeristy, 3400