Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent

Size: px

Start display at page:

Download "Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent"

Rosalind Allen
5 years ago
Views:

1 Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent Yichao Lu and Dean P. Foster Department of Statistics Wharton, University of Pennsylvania Philadelphia, PA, Abstract We propose a new two stage algorithm LING for large scale regression problems. LING has the same risk as the well known Ridge Regression under the fixed design setting and can be computed much faster. Our experiments have shown that LING performs well in terms of both prediction accuracy and computational efficiency compared with other large scale regression algorithms like Gradient Descent, Stochastic Gradient Descent and Principal Component Regression on both simulated and real datasets. INTRODUCTION Ridge Regression () is one of the most widely applied penalized regression algorithms in machine learning problems. Suppose X is the n p design matrix and Y is the n response vector, ridge regression tries to solve the problem ˆβ = arg min β R p Xβ Y + nλ β () which has an explicit solution ˆβ = (X X + nλ) X Y () However, for modern problems with huge design matrix X, computing () costs O(np ) FLOPS. When p > n one can consider the dual formulation of () which also has an explicit solution as mentioned in (Lu et al., 3; Saunders et al., 998) and the cost is O(n p) FLOPS. In summary, trying to solve () exactly costs O(np min {n, p}) FLOPS which can be very slow. There are faster ways to approximate () when computational cost is the concern. One can view as an optimization problem and use Gradient Descent () which yichaolu@wharton.upenn.edu foster@wharton.upenn.edu takes O(np) FLOPS in every iteration. However, the convergence speed for depends on the spectrum of X and the magnitude of λ. When X is ill conditioned and λ is small, requires a huge number of iterations to converge which makes it very slow. For huge datasets, one can also apply stochastic gradient descent (S) (Zhang, 4; Johnson and Zhang, 3; Bottou, ), a powerful tool for solving large scale optimization problems. Another alternative for regression on huge datasets is Principal Component Regression () as mentioned in (Artemiou and Li, 9; Jolliffe, 5), which runs regression only on the top k principal components (PCs) of the X matrix. PCA for huge X can be computed efficiently by randomized algorithms like (Halko et al., a,b). The cost for computing top k PCs of X is O(npk ) FLOPS. The connection between and is well studied by (Dhillon et al., 3). The problem of is that when a large proportion of signal sits on the bottom PCs, it has to regress on a lot of PCs which makes it both slow and inaccurate (see later sections for detailed explanations). In this paper, we propose a two stage algorithm LING which is a faster way of computing the solution (). A detailed description of the algorithm is given in section. In section 3, we prove that LING has the same risk as under the fixed design setting. In section 4, we compare the performance of,, S and LING in terms of prediction accuracy and computational efficiency on both simulated and real data sets. THE ALGORITHM. DESCRIPTION OF THE ALGORITHM LING is a two stage algorithm. The intuition of LING is quite straight forward. We start with the observation that regressing Y on X (OLS) is essentially projecting Y onto the span of X. Let U denote the top k PCs (left singular vectors) of X and let U denote the remaining PCs. The projection of Y onto the span of X can be decom- LING is the Chinese of ridge 55

2 Algorithm LING Input : Data matrix X,Y. U, an orthonormal matrix consists of top k PCs of X. d, d,...d k, top k singular values of X. Regularization parameter λ, an initial vector ˆγ, and number of iterations n for. Output : ˆγ,s, ˆγ, the regression coefficients..regress Y on U, let ˆγ = U Y..Compute the residual of previous regression problem. Let Y r = Y U ˆγ. 3.Compute the residual of X regressing on U. Use X r = X U U X to denote the residual of X. 4.Use gradient descent with optimal step size with initial value ˆγ, (see algorithm 3) to solve the problem minˆγ R p X rˆγ Y r + nλ ˆγ. 5. Compute a shrinkage version of ˆγ by (ˆγ,s ) i = d i d +nλ(ˆγ ) i i 6.The final estimator is Ŷ = U ˆγ,s + X rˆγ. posed into two orthogonal parts, the projection onto U and the projection onto U. In the first stage, we pick a k p and the projection onto U can be computed directly by Ŷ = U U Y which is exactly the same as running a on top k PCs. For huge X, computing the top k PCs exactly is very slow, so we use a faster randomized SVD algorithm for computing U which is proposed by (Halko et al., a) and described below. In the second stage, we first compute Y r = Y Ŷ and X r = X U U X which are the residual of Y and X after projecting onto U. Then we compute the projection of Y onto the span of U by solving the optimization problem minˆγ R p X rˆγ Y r with (Algorithm 3). Finally, since shrinks the projection of Y onto X (the OLS solution) via regularization, we also shrink the projections in both stages accordingly. Shrinkage in the first stage is performed directly on the estimated regression coefficients and shrinkage in the second stage is performed by adding a regularization term to the optimization problem mentioned above. Detailed description of LING is shown in Algorithm. Remark. LING can be regarded as a combination of and. The first stage of LING is a very crude estimation of Y and the second stage adds a correction to the first stage estimator. Since we do not need a very accurate estimator in the first stage it suffices to pick a very small k in contrast with the k PCs needed for. In the second stage, the design matrix X r is a much better conditioned matrix than the original X since the directions with largest are removed. As introduced in section., Algorithm 3 converges much faster with a better conditioned matrix. Hence in the second stage of LING converges faster than directly applying for solving (). The above property guarantees that LING is both fast and accurate compared with and. More details about Algorithm Random SVD Input : design matrix X, target dimension k, number of power iterations i. Output : U n k, the matrix of top k left singular vectors of X, d, d,...d k, the top k of X..Generate random matrix R p k with i.i.d standard Gaussian entries..estimate the span of top k left singular vectors of X by A = (XX ) i XR. 3.Use QR decomposition to compute Q which is an orthonormal basis of the column space of A. 4.Compute SVD of the reduced matrix Q X = U D V. 5.U = Q U gives the top k singular vectors of X and the diagonal elements of D gives the. Algorithm 3 Gradient Descent with Optimal Step Size () Goal : Solve the ridge problem minˆγ R p Xˆγ Y + nλ ˆγ. Input : Data matrix X, Y, regularization parameter λ, number of iterations n, an initial vector ˆγ Output : ˆγ for t = to n do Q = X X + nλi w t = X Y Qˆγ t s t = w t wt w. s t Qwt t is the step size which makes the target function decrease the most in direction w t. ˆγ t+ = ˆγ t + s t w t. end for on the computational cost will be discussed in section.. Remark. Algorithm is essentially approximating the subspace of top left singular vectors by random projection. It provides a fast approximation of the top and vectors for large X when computing the exact SVD is very slow. Theoretical guarantees and more detailed explanations can be found in (Halko et al., a). Empirically we find in the experiments, Algorithm may occasionally generate a bad subspace estimator due to randomness which makes perform badly. On the other hand, LING is much more robust since in the second stage it compensate for the signal missing in the first stage. In all the experiments, we set i =. The shrinkage step (step 5) in Algorithm is only necessary for theoretical purposes since the goal is to approximate Ridge Regression which shrinks the Least Square estimator over all directions. In practice shrinkage over the top k PCs is not necessary. Usually the number of PCs selected (k ) is very small. From the bias variance trade off perspective, the variance reduction gained from the shrinkage over top k PCs is at most O( k n ) under 56

3 the fixed design setting (Dhillon et al., 3) which is a tiny number. Moreover, since the top of X X are usually very large compared with nλ in most real problems, the shrinkage factor will be pretty d i +nλ close to for top. We use shrinkage in Algorithm because the risk of the shrinkage version of LING is exactly the same as as proved in section 3. Algorithm can be further simplified if we skip the shrinkage step mentioned in previous paragraph. Instead of computing the top k PCs, the only thing we need to know is the subspace spanned by these PCs since the first stage is essentially projecting Y onto this subspace. In other words, we can replace U in step,, 3 of Algorithm with Q obtained in step 3 of Algorithm and directly let Ŷ = Q ˆγ + X rˆγ. In the experiments of section 4 we use this simplified version.. COMPUTATIONAL COST d i We claim that the cost of LING is O ( np(k + n ) ) where k is the number of PCs used in the first stage and n is the number of iterations of in the second stage. According to (Halko et al., a), the dominating step in Algorithm is computing (XX ) i XR and Q X which costs O(npk ) FLOPS. Computing ˆγ and Y r cost less than O(npk ). Computing X r costs O(npk ). So the computational cost before the step is O(npk ). For the stage, note that in every iteration Q never needs to be constructed explicitly. While computing w t and s t, always multiplying matrix and vector first gives a cost of O(np) for every iteration. So the cost for stage is O(n np). Add all pieces together the cost of LING is O ( np(k +n ) ) FLOPS. Let n be the number of iterations required for solving () directly by and k be the number of PCs used for. It s easy to check that the cost for is O(n np) FLOPS and the cost for is O(npk ). As mentioned in remark, the advantage of LING over and is that k and n might have to be really large to achieve high accuracy but much smaller values of the pair (k, n ) will work for LING. In the remaining part of the paper we use "signal on certain PCs" to refer to the projection of Y onto certain principal components of X. Consider the case when the signal is widely spread among all PCs (i.e. the projection of Y onto the bottom PCs of X is not very small) instead of concentrating on the top ones, k needs to be large to make perform well since the signal on bottom PCs are discarded by. But LING does not need to include all the signal in the first stage regression since the signal left over will be estimated in the second stage. Therefore LING is able to recover most of the signal even with a small k. In order to understand the connection between accuracy and number of iterations in Algorithm 3, we state the following theorem in A.Epelman (7): Theorem. Let g(z) = z Mz + q z be a quadratic function where M is a PSD matrix. Suppose g(z) achieves minimum at z. Apply Algorithm 3 to solve the minimization problem. Let z t be the z value after t iterations, then the gap between g(z t ) and g(z ), the minimum of the objective function satisfies g(z t+ ) g(z ) g(z t ) g(z ) C = ( ) A a (3) A + a where A, a are the largest and smallest eigenvalue of the M matrix. Theorem shows that the sub optimality of the target function decays exponentially as the number of iterations increases and the speed of decay depends on the largest and smallest singular value of the PSD matrix that defines the quadratic objective function. If we directly apply to solve (), Let f (β) = Xβ Y + nλ β. Assume f reaches its minimum at ˆβ. Let ˆβ t be the coefficient after t iterations and let d i denote the i th singular value of X. Applying theorem, we have f ( ˆβ t+ ) f ( ˆβ) f ( ˆβ t ) f ( ˆβ) C = ( ) d d p d + (4) d p + nλ Similarly for the second stage of LING, Let f (γ ) = X r γ Y r + nλ γ. Assume f reaches its minimal at ˆγ. We have f (ˆγ,t+ ) f (ˆγ ) f (ˆγ,t ) f (ˆγ ) C = ( ) d k+ d k + nλ (5) + In most real problems, the top few of X X are much larger than the other and nλ. Therefore the constant C obtained in (4) can be very close to which makes algorithm converges very slowly. On the other hand, removing the top few PCs will make C in (5) significantly smaller than. In other words, may take a lot of iterations to converge when solving () directly while the second stage of LING takes much less iterations to converge. This can also be seen in the experiments of section 4. 3 THEOREMS In this section we compute the risk of LING estimator (explained below) under the fixed design setting. For simplicity, assume U, D generated by Algorithm give exactly the top k left singular vectors and singular values of X and in step 4 of Algorithm converges to the optimal solution. Let X = UDV be the SVD of X where U = (U, U ) and V = (V, V ). Here U, V are top k singular vectors and U, V are bottom 57

4 p k singular vectors. Let D = diag(d, D ) where D k k contains top k denoted by d d... d k and D p k p k contains bottom p k. Let D 3 = diag(, D ) (replace D in D by a zero matrix of the same size). E(ˆγ,s ) = Sγ. n E U γ U ˆγ,s = n E U E(ˆγ,s ) U ˆγ,s + n U γ U E(ˆγ,s ) 3. THE FIXED DESIGN MODEL Assume X, Y comes from the fixed design model Y = Xβ + ɛ where ɛ n are i.i.d noise with mean and variance σ. Here X is fixed and the randomness of Y only comes from ɛ. Note that X = U D V + X r, the fixed design model can also be written as Y = (U D V + X r )β + ɛ = U γ + X r γ + ɛ where γ = D V β and γ = β. We use the l distance between E(Y X) (the best possible prediction given X under l loss) and Ŷ = U ˆγ,s + X rˆγ (the prediction by LING) as the loss function, which is called risk in the following discussions. Actually E(Y X) = Xβ is linear in X under fixed design model. The risk of LING can be written as = n E E(Y X) U ˆγ,s X rˆγ n E U γ + X r γ U ˆγ,s X rˆγ We can further decompose the risk into two terms: n E U γ + X r γ U ˆγ,s X rˆγ = n E U γ U ˆγ,s + (6) n E X rγ X rˆγ because U X r =. Note that here the expectation is taken with respect to ɛ. Let s calculate the two terms in (6) separately. For the first term we have: = n E U SU ɛ + n γ Sγ = n ETr(U S U ɛɛ ) + n γ Sγ = n ETr(S )σ + n γ Sγ = n k j= d 4 j σ + γ,j n λ (d j + nλ) Now consider the second term in (6). Note that X r = UD 3 V The residual Y r after the first stage can be represented by Y r = Y U ˆγ = (I U U )Y = X r γ +(I U U )ɛ and the optimal coefficient obtained in the second stage is ˆγ = (X r X r + nλi) X r Y r For simplicity, let ɛ = (I U U )ɛ. Lemma. E X r γ X rˆγ = p i=k + where α i is the i th element of α = V γ Proof. First define (d i + nλ) (d4 i σ +nλ d i α i ) X λ = X r X r + nλi D λ = D 3 + nλi (8) Lemma. n E U γ U ˆγ,s = n k j= Here γ,j is the j th element of γ. d 4 j σ + γ,j n λ (d j + nλ) (7) Proof. Let S k k be the diagonal matrix with S j,j = d j d j +nλ. So we have ˆγ,s = SU Y = Sγ +SU ɛ, E X r γ X rˆγ = X r γ X r E(ˆγ ) (9) Consider (9) and () separately. + E X r E(ˆγ ) X rˆγ () (9) = X r X λ X r X r γ X r γ = UD 3 D λ D 3V γ UD 3 V γ = D 3 D λ D 3α D 3 α p = αi d nλ i ( d i + nλ) i=k + 58

5 Note that () = E ɛ X r X λ X r ɛ = E ɛ Tr ( X r X λ X r X r X λ X r ɛ ɛ ) = E ɛ Tr ( D 3 D λ D 3D λ D 3U ɛ ɛ U ) = Tr ( D 3 D λ D 3D λ D 3E ɛ [U ɛ ɛ U] ) E ɛ [U ɛ ɛ U] = diag(, I p k )σ (diag(, I p k )replace the top k k block of the identity matrix with ), p d 4 i () = (d i + nλ) σ () i=k + Add the two terms together finishes the proof. Plug (7) (8) into (6) we have Theorem. The risk of LING algorithm under fixed design setting is n k j= d 4 j σ + γ,j n λ (d j + nλ) + n p i=k + d 4 i σ + n λ d i α i (d i + nλ) () Remark 3. This risk is the same as the risk of ridge regression provided by Lemma in (Dhillon et al., 3). Actually, LING gets exactly the same prediction as on the training dataset. This is very intuitive since on the training set LING is essentially decomposing the solution into the first stage shrinkage predictor on top k PCs and the second stage predictor over the residual spaces as explained in section. 4 EXPERIMENTS In this section we compare the accuracy and computational cost (evaluated in terms of FLOPS) of 3 different algorithms for solving Ridge Regression: Gradient Descent with Optimal step size (), Stochastic Variance Reduction Gradient () (Johnson and Zhang, 3) and LING. Here is an improved version of stochastic gradient descent which achieves exponential convergence with constant step size. We also consider Principal Component Regression () (Artemiou and Li, 9; Jolliffe, 5) which is another common way for running large scale regression. Experiments are performed on 3 simulated models and real datasets. In general, LING performs well on all 3 simulated datasets while, and fails in some cases. For two real datasets, all algorithms give reasonable performance while and LING are the best. Moreover, both stages of LING require only a moderate amount of matrix multiplications each cost O(np), much faster to run on matlab compared with which contains a lot of loops. Table : parameter setup for simulated data MODEL MODEL MODEL 3,,3,6,3,5,3,5, k 3,5,,5,4 5,4,,3 6,,5,,4,6,8, n 5,8, 3,5,8 5,,3 5,,8,5 k,,3,5,4,6,8,,4,6,8, n 8,3, 5,,3 5,3 n 3,5,8,5 4. SIMULATED DATA 5,, 3,5 5,,5,5 4,6,9 Three different datasets are constructed based on the fixed design model Y = Xβ +ɛ where X is of size 5. In the three experiments X and β are generated randomly in different ways (more details in following sections) and i.i.d Gaussian noise is added to Xβ to get Y. Then,, and LING are performed on the dataset. For, we try different number of iterations n. For, we vary the number of passes through data denoted by n. The number of iterations takes equals the number of passes through data times sample size and each iteration takes O(p) FLOPS. The step size of is chosen by cross validation but this cost is not considered when evaluating the total computational cost. Note that one advantage of and LING is that due to the simple quadratic form of the target function, their step size can be computed directly from the data without cross validation which introduces extra cost. For we pick different number of PCs (k ). For LING we pick top k PCs in the first stage and try different number of iterations n in the second stage. The computational cost and the risk of the four algorithms are computed. The above procedure is repeated over random generation of X, β and Y. The risk and computational cost of the traditional solution () for every dataset is also computed as a benchmark. The parameter set up for the three datasets are listed in table. 4.. MODEL In this model the design matrix X has a steep spectrum. The top 3 of X decay exponentially as.3 i where i = 4, 39, 38...,. The spectrum of X is shown in figure 4. To generate X, we fix the diagonal matrix D e with the designed spectrum and construct X by X = U e D e Ve where U e, V e are two random orthonormal matrices. The elements of β are sampled uniformly from interval [.5,.5]. Under this set up, most of the energy of the X matrix lies in top PCs since the top singular values are much larger than the remaining ones so works well. But as indicated by (4), the convergence of 59

6 is very slow. The computational cost and average risk of the four algorithms and also the solution () over repeats are shown in figure. As shown in figure both and LING work well by achieving risk close to at less computational cost. is worse than and LING but much better than. Risk Flat Spectral Case LING Steep Spectral Case LING 5e+7 e+8 e+8 5e+8 e+9 e+9 5e+9 e+ FLOPS in log scale 4 Risk Figure : Model, Risk VS. Computational Cost plot. approaches the risk very fast. and LING are slower than but still achieves risk close to at less cost. is slow and has huge risk. 5e+8 e+9 e+9 5e+9 e+ FLOPS in log scale Figure : Model, Risk VS. Computational Cost plot. and LING approaches the risk very fast. also approaches risk but cost more than the previous two. is very slow and inaccurate. 4.. MODEL In this model the design matrix X has a flat spectrum. The of X are sampled uniformly from [, ]. The spectrum of X is shown in figure 5. To generate X, we fix the diagonal matrix D e with the designed spectrum and construct X by X = U e D e Ve where U e, V e are two random orthonormal matrices. The elements of β are sampled uniformly from interval [.5,.5]. Under this set up, the signal are widely spread among all PCs since the spectrum of X is relatively flat. breaks down because it fails to catch the signal on bottom PCs. As indicated by (4), converges relatively fast due to the flat spectrum of X. The computational cost and average risk of the four algorithms and also the solution () over repeats are shown in figure. As shown by the figure works best since it approaches the risk of at the the lowest computational cost. LING and also works by achieving reasonably low risk with less computational cost. works poorly as explained before MODEL 3 This model presented a special case where both and will break down. The of X 5 are constructed by first uniformly sample from [, ]. The top 5 sampled values are then multiplied by. The top of X are shown in figure 6. To generate X, we fix the diagonal matrix D e with the designed spectrum and construct X by X = U e D e where U e is a random orthonormal matrix. The first 5 and last elements of the coefficient vector β 5 are sampled uniformly from interval [.5,.5] and other elements of β remains. In this set up, X has orthogonal columns which are the PCs, and the signal lies only on the top 5 and bottom PCs. won t work since a large proportion of signal lies on the bottom PCs. On the other hand, won t work as well since the top few singular values are too large compared with other, which makes converges very slowly. The computational cost and risk of the four algorithms and also the solution () over repeats are shown in figure 3. As shown by the figure LING works best in this set up. is slightly worse than LING but still approaching with less cost. In this case, converges slowly and is completely off target as explained before. 4. REAL DATA In this section we compare,, and LING with the solution () on two real datasets. 4.. GISETTE DATASET The first is the gisette data set (Guyon et al., 4) from the UCI repository which is a bi-class classification task. Every row of the design matrix X 6 5 consists of pixel features of a single digit "4" or "9" and Y gives the class label. Among the 6 samples, we use 5 for training and for testing. The classification error rate for solution () is.9. Since the goal is to compare different algorithms for regression, we don t care about achieving the state of the art accuracy for this dataset as long as regression works reasonably well. When running 53

7 Extreme Case 45 Spectrum of X 45 Spectrum of X LING Risk Figure 5: Singular values of X in Model Figure 6: Top singular values of X in Model 3 5e+8 e+9 e+9 5e+9 e+ FLOPS in log scale Figure 3: Model 3, Risk VS. Computational Cost plot. LING approaches risk the fastest. is slightly slower than LING. also approaches risk but cost more than LING. has a huge risk no matter how many PCs are selected. 4 x 4 Spectrum of X Error Rate Gisette LING5 LING e+8 e+9 3e+9 e+ 3e+ e+ 3e+ FLOPS in log scale Figure 4: Top of X in Model, we pick top k =,, 4, 8, 5, 3, 4 PCs and in we iterate n =, 5,, 5,, 3, 5,, 5 times. For we try n =,, 3, 5,,, 4, 8 passes through the data. For LING we pick k = 5, 5 PCs in the first stage and try n =,, 4, 8,, 5,, 3, 5 iterations in the second stage. The computational cost and average classification error of the four algorithms and also the solution () on test set over 6 different train test splits are shown in figure 7. The top 5 of X are shown in figure 9. As shown in the figure, gets close to the error very fast. The two curves of LING with k = 5, 5 are slower than since some initial FLOPS are spent on computing top PCs but after that they approach error very fast. also converges to but cost more than the previous two algorithms. performs worst in terms of error and computational cost. 4.. BUZZ IN SOCIAL MEDIA The second dataset is the UCI buzz in social media dataset which is a regression task. The goal is to predict popularity (evaluated by the mean number of active discussion) of a certain topic on Twitter over a period. The original feature matrix contains some statistics about this topic over that period like number of discussions created and new authors interacting at the topic. The original feature dimen- Figure 7: Gisette, Error Rate VS. Computational Cost plot. achieves small error rate fastest. Two LING lines with different n spent some FLOPS on computing top PCs first, but then converges very fast to a lower error rate. and also provide reasonably small error rate and are faster than, but suboptimal compared with and LING. sion is 77. We add quadratic interactions to make it 38. To save time, we only used a subset of 8 samples. The samples are split into 6 train and test. We use MSE on the test data set as the error measure. For we pick k =,, 3, 5,, 5 PCs and in we iterate n =,, 4, 6, 8,, 5,, 3, 4, 6, times. For we try n =,, 3, 5,, 5,, 4, 8 passes through the dataset and for LING we pick k = 5, 5 in the first stage and n =,,, 4, 6, 8,, 5,, 5 iterations in the second stage. The computational cost and average MSE on test set over 5 different train test splits are shown in figure 8. The top 5 of X are shown in figure. As shown in the figure, approaches MSE of very fast. LING spent some initial FLOPS for computing top PCs but after that converges fast. and also achieves reasonable performance but suboptimal compared with and LING. The MSE of first decays when we add more PCs into regression but finally goes up due to overfit. 53

8 MSE Buzz LING5 LING5 4 3e+8 e+9 3e+9 e+ 3e+ e+ 3e+ FLOPS in log scale Figure 8: Buzz, MSE VS. Computational Cost plot. and two LING lines with different n achieves small MSE fast. is slower than LING and. reaches its smallest MSE at k = 5 then overfits x 5 spectrum of X 5 5 Figure 9: Top 5 singular values of X in Gisette Dataset 5 SUMMARY Spectrum of X 5 5 Figure : Top 5 singular values of X in Social Media Buzz Dataset In this paper we present a two stage algorithm LING for computing large scale Ridge Regression which is both fast and robust in contrast to the well known approaches and. We show that under the fixed design setting LING actually has the same risk as Ridge Regression assuming convergence. In the experiments, LING achieves good performances on all datasets when compare with three other large scale regression algorithms. We conjecture that same strategy can be also used to accelerate the convergence of stochastic gradient descent when solving Ridge Regression since the first stage in LING essentially removes the high variance directions of X, which will lead to variance reduction for the random gradient direction generated by S. Andreas Artemiou and Bing Li. On principal components and regression: a statistical explanation of a natural phenomenon. Statistica Sinica, 9(4): , 9. ISSN Léon Bottou. Large-Scale Machine Learning with Stochastic Gradient Descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of the 9th International Conference on Computational Statistics (COMP- STAT ), pages 77 87, Paris, France, August. Springer. Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. A risk comparison of ordinary least squares vs ridge regression. Journal of Machine Learning Research, 4:55 5, 3. Isabelle Guyon, Asa Ben Hur, Steve Gunn, and Gideon Dror. Result analysis of the nips 3 feature selection challenge. In Advances in Neural Information Processing Systems 7, pages MIT Press, 4. N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53():7 88, May a. ISSN Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert. An algorithm for the principal component analysis of large data sets. SIAM J. Scientific Computing, 33(5):58 594, b. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems (NIPS), 3. Ian Jolliffe. Principal Component Analysis. Encyclopedia of Statistics in Behavioral Science. John Wiley & Sons, 5. Yichao Lu, Paramveer S. Dhillon, Dean Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in Neural Information Processing Systems (NIPS), 3. G. Saunders, A. Gammerman, and V. Vovk. Ridge regression learning algorithm in dual variables. In Proc. 5th International Conf. on Machine Learning, pages Morgan Kaufmann, San Francisco, CA, 998. Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML 4: Proceedings of the twenty-first International Conference on Machine Learning. OMNIPRESS, pages 99 96, 4. References Marina A.Epelman. Rate of convergence of steepest descent algorithm. Lecture Notes, 7. 53

arxiv: v1 [stat.ml] 15 May 2014

arxiv: v1 [stat.ml] 15 May 2014 Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent Yichao Lu and Dean P. Foster arxiv:45.395v [stat.ml] 5 May 4 Abstract We propose a new two stage algorithm LING for