Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent

Size: px
Start display at page:

Download "Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent"

Transcription

1 Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent Yichao Lu and Dean P. Foster Department of Statistics Wharton, University of Pennsylvania Philadelphia, PA, Abstract We propose a new two stage algorithm LING for large scale regression problems. LING has the same risk as the well known Ridge Regression under the fixed design setting and can be computed much faster. Our experiments have shown that LING performs well in terms of both prediction accuracy and computational efficiency compared with other large scale regression algorithms like Gradient Descent, Stochastic Gradient Descent and Principal Component Regression on both simulated and real datasets. INTRODUCTION Ridge Regression () is one of the most widely applied penalized regression algorithms in machine learning problems. Suppose X is the n p design matrix and Y is the n response vector, ridge regression tries to solve the problem ˆβ = arg min β R p Xβ Y + nλ β () which has an explicit solution ˆβ = (X X + nλ) X Y () However, for modern problems with huge design matrix X, computing () costs O(np ) FLOPS. When p > n one can consider the dual formulation of () which also has an explicit solution as mentioned in (Lu et al., 3; Saunders et al., 998) and the cost is O(n p) FLOPS. In summary, trying to solve () exactly costs O(np min {n, p}) FLOPS which can be very slow. There are faster ways to approximate () when computational cost is the concern. One can view as an optimization problem and use Gradient Descent () which yichaolu@wharton.upenn.edu foster@wharton.upenn.edu takes O(np) FLOPS in every iteration. However, the convergence speed for depends on the spectrum of X and the magnitude of λ. When X is ill conditioned and λ is small, requires a huge number of iterations to converge which makes it very slow. For huge datasets, one can also apply stochastic gradient descent (S) (Zhang, 4; Johnson and Zhang, 3; Bottou, ), a powerful tool for solving large scale optimization problems. Another alternative for regression on huge datasets is Principal Component Regression () as mentioned in (Artemiou and Li, 9; Jolliffe, 5), which runs regression only on the top k principal components (PCs) of the X matrix. PCA for huge X can be computed efficiently by randomized algorithms like (Halko et al., a,b). The cost for computing top k PCs of X is O(npk ) FLOPS. The connection between and is well studied by (Dhillon et al., 3). The problem of is that when a large proportion of signal sits on the bottom PCs, it has to regress on a lot of PCs which makes it both slow and inaccurate (see later sections for detailed explanations). In this paper, we propose a two stage algorithm LING which is a faster way of computing the solution (). A detailed description of the algorithm is given in section. In section 3, we prove that LING has the same risk as under the fixed design setting. In section 4, we compare the performance of,, S and LING in terms of prediction accuracy and computational efficiency on both simulated and real data sets. THE ALGORITHM. DESCRIPTION OF THE ALGORITHM LING is a two stage algorithm. The intuition of LING is quite straight forward. We start with the observation that regressing Y on X (OLS) is essentially projecting Y onto the span of X. Let U denote the top k PCs (left singular vectors) of X and let U denote the remaining PCs. The projection of Y onto the span of X can be decom- LING is the Chinese of ridge 55

2 Algorithm LING Input : Data matrix X,Y. U, an orthonormal matrix consists of top k PCs of X. d, d,...d k, top k singular values of X. Regularization parameter λ, an initial vector ˆγ, and number of iterations n for. Output : ˆγ,s, ˆγ, the regression coefficients..regress Y on U, let ˆγ = U Y..Compute the residual of previous regression problem. Let Y r = Y U ˆγ. 3.Compute the residual of X regressing on U. Use X r = X U U X to denote the residual of X. 4.Use gradient descent with optimal step size with initial value ˆγ, (see algorithm 3) to solve the problem minˆγ R p X rˆγ Y r + nλ ˆγ. 5. Compute a shrinkage version of ˆγ by (ˆγ,s ) i = d i d +nλ(ˆγ ) i i 6.The final estimator is Ŷ = U ˆγ,s + X rˆγ. posed into two orthogonal parts, the projection onto U and the projection onto U. In the first stage, we pick a k p and the projection onto U can be computed directly by Ŷ = U U Y which is exactly the same as running a on top k PCs. For huge X, computing the top k PCs exactly is very slow, so we use a faster randomized SVD algorithm for computing U which is proposed by (Halko et al., a) and described below. In the second stage, we first compute Y r = Y Ŷ and X r = X U U X which are the residual of Y and X after projecting onto U. Then we compute the projection of Y onto the span of U by solving the optimization problem minˆγ R p X rˆγ Y r with (Algorithm 3). Finally, since shrinks the projection of Y onto X (the OLS solution) via regularization, we also shrink the projections in both stages accordingly. Shrinkage in the first stage is performed directly on the estimated regression coefficients and shrinkage in the second stage is performed by adding a regularization term to the optimization problem mentioned above. Detailed description of LING is shown in Algorithm. Remark. LING can be regarded as a combination of and. The first stage of LING is a very crude estimation of Y and the second stage adds a correction to the first stage estimator. Since we do not need a very accurate estimator in the first stage it suffices to pick a very small k in contrast with the k PCs needed for. In the second stage, the design matrix X r is a much better conditioned matrix than the original X since the directions with largest are removed. As introduced in section., Algorithm 3 converges much faster with a better conditioned matrix. Hence in the second stage of LING converges faster than directly applying for solving (). The above property guarantees that LING is both fast and accurate compared with and. More details about Algorithm Random SVD Input : design matrix X, target dimension k, number of power iterations i. Output : U n k, the matrix of top k left singular vectors of X, d, d,...d k, the top k of X..Generate random matrix R p k with i.i.d standard Gaussian entries..estimate the span of top k left singular vectors of X by A = (XX ) i XR. 3.Use QR decomposition to compute Q which is an orthonormal basis of the column space of A. 4.Compute SVD of the reduced matrix Q X = U D V. 5.U = Q U gives the top k singular vectors of X and the diagonal elements of D gives the. Algorithm 3 Gradient Descent with Optimal Step Size () Goal : Solve the ridge problem minˆγ R p Xˆγ Y + nλ ˆγ. Input : Data matrix X, Y, regularization parameter λ, number of iterations n, an initial vector ˆγ Output : ˆγ for t = to n do Q = X X + nλi w t = X Y Qˆγ t s t = w t wt w. s t Qwt t is the step size which makes the target function decrease the most in direction w t. ˆγ t+ = ˆγ t + s t w t. end for on the computational cost will be discussed in section.. Remark. Algorithm is essentially approximating the subspace of top left singular vectors by random projection. It provides a fast approximation of the top and vectors for large X when computing the exact SVD is very slow. Theoretical guarantees and more detailed explanations can be found in (Halko et al., a). Empirically we find in the experiments, Algorithm may occasionally generate a bad subspace estimator due to randomness which makes perform badly. On the other hand, LING is much more robust since in the second stage it compensate for the signal missing in the first stage. In all the experiments, we set i =. The shrinkage step (step 5) in Algorithm is only necessary for theoretical purposes since the goal is to approximate Ridge Regression which shrinks the Least Square estimator over all directions. In practice shrinkage over the top k PCs is not necessary. Usually the number of PCs selected (k ) is very small. From the bias variance trade off perspective, the variance reduction gained from the shrinkage over top k PCs is at most O( k n ) under 56

3 the fixed design setting (Dhillon et al., 3) which is a tiny number. Moreover, since the top of X X are usually very large compared with nλ in most real problems, the shrinkage factor will be pretty d i +nλ close to for top. We use shrinkage in Algorithm because the risk of the shrinkage version of LING is exactly the same as as proved in section 3. Algorithm can be further simplified if we skip the shrinkage step mentioned in previous paragraph. Instead of computing the top k PCs, the only thing we need to know is the subspace spanned by these PCs since the first stage is essentially projecting Y onto this subspace. In other words, we can replace U in step,, 3 of Algorithm with Q obtained in step 3 of Algorithm and directly let Ŷ = Q ˆγ + X rˆγ. In the experiments of section 4 we use this simplified version.. COMPUTATIONAL COST d i We claim that the cost of LING is O ( np(k + n ) ) where k is the number of PCs used in the first stage and n is the number of iterations of in the second stage. According to (Halko et al., a), the dominating step in Algorithm is computing (XX ) i XR and Q X which costs O(npk ) FLOPS. Computing ˆγ and Y r cost less than O(npk ). Computing X r costs O(npk ). So the computational cost before the step is O(npk ). For the stage, note that in every iteration Q never needs to be constructed explicitly. While computing w t and s t, always multiplying matrix and vector first gives a cost of O(np) for every iteration. So the cost for stage is O(n np). Add all pieces together the cost of LING is O ( np(k +n ) ) FLOPS. Let n be the number of iterations required for solving () directly by and k be the number of PCs used for. It s easy to check that the cost for is O(n np) FLOPS and the cost for is O(npk ). As mentioned in remark, the advantage of LING over and is that k and n might have to be really large to achieve high accuracy but much smaller values of the pair (k, n ) will work for LING. In the remaining part of the paper we use "signal on certain PCs" to refer to the projection of Y onto certain principal components of X. Consider the case when the signal is widely spread among all PCs (i.e. the projection of Y onto the bottom PCs of X is not very small) instead of concentrating on the top ones, k needs to be large to make perform well since the signal on bottom PCs are discarded by. But LING does not need to include all the signal in the first stage regression since the signal left over will be estimated in the second stage. Therefore LING is able to recover most of the signal even with a small k. In order to understand the connection between accuracy and number of iterations in Algorithm 3, we state the following theorem in A.Epelman (7): Theorem. Let g(z) = z Mz + q z be a quadratic function where M is a PSD matrix. Suppose g(z) achieves minimum at z. Apply Algorithm 3 to solve the minimization problem. Let z t be the z value after t iterations, then the gap between g(z t ) and g(z ), the minimum of the objective function satisfies g(z t+ ) g(z ) g(z t ) g(z ) C = ( ) A a (3) A + a where A, a are the largest and smallest eigenvalue of the M matrix. Theorem shows that the sub optimality of the target function decays exponentially as the number of iterations increases and the speed of decay depends on the largest and smallest singular value of the PSD matrix that defines the quadratic objective function. If we directly apply to solve (), Let f (β) = Xβ Y + nλ β. Assume f reaches its minimum at ˆβ. Let ˆβ t be the coefficient after t iterations and let d i denote the i th singular value of X. Applying theorem, we have f ( ˆβ t+ ) f ( ˆβ) f ( ˆβ t ) f ( ˆβ) C = ( ) d d p d + (4) d p + nλ Similarly for the second stage of LING, Let f (γ ) = X r γ Y r + nλ γ. Assume f reaches its minimal at ˆγ. We have f (ˆγ,t+ ) f (ˆγ ) f (ˆγ,t ) f (ˆγ ) C = ( ) d k+ d k + nλ (5) + In most real problems, the top few of X X are much larger than the other and nλ. Therefore the constant C obtained in (4) can be very close to which makes algorithm converges very slowly. On the other hand, removing the top few PCs will make C in (5) significantly smaller than. In other words, may take a lot of iterations to converge when solving () directly while the second stage of LING takes much less iterations to converge. This can also be seen in the experiments of section 4. 3 THEOREMS In this section we compute the risk of LING estimator (explained below) under the fixed design setting. For simplicity, assume U, D generated by Algorithm give exactly the top k left singular vectors and singular values of X and in step 4 of Algorithm converges to the optimal solution. Let X = UDV be the SVD of X where U = (U, U ) and V = (V, V ). Here U, V are top k singular vectors and U, V are bottom 57

4 p k singular vectors. Let D = diag(d, D ) where D k k contains top k denoted by d d... d k and D p k p k contains bottom p k. Let D 3 = diag(, D ) (replace D in D by a zero matrix of the same size). E(ˆγ,s ) = Sγ. n E U γ U ˆγ,s = n E U E(ˆγ,s ) U ˆγ,s + n U γ U E(ˆγ,s ) 3. THE FIXED DESIGN MODEL Assume X, Y comes from the fixed design model Y = Xβ + ɛ where ɛ n are i.i.d noise with mean and variance σ. Here X is fixed and the randomness of Y only comes from ɛ. Note that X = U D V + X r, the fixed design model can also be written as Y = (U D V + X r )β + ɛ = U γ + X r γ + ɛ where γ = D V β and γ = β. We use the l distance between E(Y X) (the best possible prediction given X under l loss) and Ŷ = U ˆγ,s + X rˆγ (the prediction by LING) as the loss function, which is called risk in the following discussions. Actually E(Y X) = Xβ is linear in X under fixed design model. The risk of LING can be written as = n E E(Y X) U ˆγ,s X rˆγ n E U γ + X r γ U ˆγ,s X rˆγ We can further decompose the risk into two terms: n E U γ + X r γ U ˆγ,s X rˆγ = n E U γ U ˆγ,s + (6) n E X rγ X rˆγ because U X r =. Note that here the expectation is taken with respect to ɛ. Let s calculate the two terms in (6) separately. For the first term we have: = n E U SU ɛ + n γ Sγ = n ETr(U S U ɛɛ ) + n γ Sγ = n ETr(S )σ + n γ Sγ = n k j= d 4 j σ + γ,j n λ (d j + nλ) Now consider the second term in (6). Note that X r = UD 3 V The residual Y r after the first stage can be represented by Y r = Y U ˆγ = (I U U )Y = X r γ +(I U U )ɛ and the optimal coefficient obtained in the second stage is ˆγ = (X r X r + nλi) X r Y r For simplicity, let ɛ = (I U U )ɛ. Lemma. E X r γ X rˆγ = p i=k + where α i is the i th element of α = V γ Proof. First define (d i + nλ) (d4 i σ +nλ d i α i ) X λ = X r X r + nλi D λ = D 3 + nλi (8) Lemma. n E U γ U ˆγ,s = n k j= Here γ,j is the j th element of γ. d 4 j σ + γ,j n λ (d j + nλ) (7) Proof. Let S k k be the diagonal matrix with S j,j = d j d j +nλ. So we have ˆγ,s = SU Y = Sγ +SU ɛ, E X r γ X rˆγ = X r γ X r E(ˆγ ) (9) Consider (9) and () separately. + E X r E(ˆγ ) X rˆγ () (9) = X r X λ X r X r γ X r γ = UD 3 D λ D 3V γ UD 3 V γ = D 3 D λ D 3α D 3 α p = αi d nλ i ( d i + nλ) i=k + 58

5 Note that () = E ɛ X r X λ X r ɛ = E ɛ Tr ( X r X λ X r X r X λ X r ɛ ɛ ) = E ɛ Tr ( D 3 D λ D 3D λ D 3U ɛ ɛ U ) = Tr ( D 3 D λ D 3D λ D 3E ɛ [U ɛ ɛ U] ) E ɛ [U ɛ ɛ U] = diag(, I p k )σ (diag(, I p k )replace the top k k block of the identity matrix with ), p d 4 i () = (d i + nλ) σ () i=k + Add the two terms together finishes the proof. Plug (7) (8) into (6) we have Theorem. The risk of LING algorithm under fixed design setting is n k j= d 4 j σ + γ,j n λ (d j + nλ) + n p i=k + d 4 i σ + n λ d i α i (d i + nλ) () Remark 3. This risk is the same as the risk of ridge regression provided by Lemma in (Dhillon et al., 3). Actually, LING gets exactly the same prediction as on the training dataset. This is very intuitive since on the training set LING is essentially decomposing the solution into the first stage shrinkage predictor on top k PCs and the second stage predictor over the residual spaces as explained in section. 4 EXPERIMENTS In this section we compare the accuracy and computational cost (evaluated in terms of FLOPS) of 3 different algorithms for solving Ridge Regression: Gradient Descent with Optimal step size (), Stochastic Variance Reduction Gradient () (Johnson and Zhang, 3) and LING. Here is an improved version of stochastic gradient descent which achieves exponential convergence with constant step size. We also consider Principal Component Regression () (Artemiou and Li, 9; Jolliffe, 5) which is another common way for running large scale regression. Experiments are performed on 3 simulated models and real datasets. In general, LING performs well on all 3 simulated datasets while, and fails in some cases. For two real datasets, all algorithms give reasonable performance while and LING are the best. Moreover, both stages of LING require only a moderate amount of matrix multiplications each cost O(np), much faster to run on matlab compared with which contains a lot of loops. Table : parameter setup for simulated data MODEL MODEL MODEL 3,,3,6,3,5,3,5, k 3,5,,5,4 5,4,,3 6,,5,,4,6,8, n 5,8, 3,5,8 5,,3 5,,8,5 k,,3,5,4,6,8,,4,6,8, n 8,3, 5,,3 5,3 n 3,5,8,5 4. SIMULATED DATA 5,, 3,5 5,,5,5 4,6,9 Three different datasets are constructed based on the fixed design model Y = Xβ +ɛ where X is of size 5. In the three experiments X and β are generated randomly in different ways (more details in following sections) and i.i.d Gaussian noise is added to Xβ to get Y. Then,, and LING are performed on the dataset. For, we try different number of iterations n. For, we vary the number of passes through data denoted by n. The number of iterations takes equals the number of passes through data times sample size and each iteration takes O(p) FLOPS. The step size of is chosen by cross validation but this cost is not considered when evaluating the total computational cost. Note that one advantage of and LING is that due to the simple quadratic form of the target function, their step size can be computed directly from the data without cross validation which introduces extra cost. For we pick different number of PCs (k ). For LING we pick top k PCs in the first stage and try different number of iterations n in the second stage. The computational cost and the risk of the four algorithms are computed. The above procedure is repeated over random generation of X, β and Y. The risk and computational cost of the traditional solution () for every dataset is also computed as a benchmark. The parameter set up for the three datasets are listed in table. 4.. MODEL In this model the design matrix X has a steep spectrum. The top 3 of X decay exponentially as.3 i where i = 4, 39, 38...,. The spectrum of X is shown in figure 4. To generate X, we fix the diagonal matrix D e with the designed spectrum and construct X by X = U e D e Ve where U e, V e are two random orthonormal matrices. The elements of β are sampled uniformly from interval [.5,.5]. Under this set up, most of the energy of the X matrix lies in top PCs since the top singular values are much larger than the remaining ones so works well. But as indicated by (4), the convergence of 59

6 is very slow. The computational cost and average risk of the four algorithms and also the solution () over repeats are shown in figure. As shown in figure both and LING work well by achieving risk close to at less computational cost. is worse than and LING but much better than. Risk Flat Spectral Case LING Steep Spectral Case LING 5e+7 e+8 e+8 5e+8 e+9 e+9 5e+9 e+ FLOPS in log scale 4 Risk Figure : Model, Risk VS. Computational Cost plot. approaches the risk very fast. and LING are slower than but still achieves risk close to at less cost. is slow and has huge risk. 5e+8 e+9 e+9 5e+9 e+ FLOPS in log scale Figure : Model, Risk VS. Computational Cost plot. and LING approaches the risk very fast. also approaches risk but cost more than the previous two. is very slow and inaccurate. 4.. MODEL In this model the design matrix X has a flat spectrum. The of X are sampled uniformly from [, ]. The spectrum of X is shown in figure 5. To generate X, we fix the diagonal matrix D e with the designed spectrum and construct X by X = U e D e Ve where U e, V e are two random orthonormal matrices. The elements of β are sampled uniformly from interval [.5,.5]. Under this set up, the signal are widely spread among all PCs since the spectrum of X is relatively flat. breaks down because it fails to catch the signal on bottom PCs. As indicated by (4), converges relatively fast due to the flat spectrum of X. The computational cost and average risk of the four algorithms and also the solution () over repeats are shown in figure. As shown by the figure works best since it approaches the risk of at the the lowest computational cost. LING and also works by achieving reasonably low risk with less computational cost. works poorly as explained before MODEL 3 This model presented a special case where both and will break down. The of X 5 are constructed by first uniformly sample from [, ]. The top 5 sampled values are then multiplied by. The top of X are shown in figure 6. To generate X, we fix the diagonal matrix D e with the designed spectrum and construct X by X = U e D e where U e is a random orthonormal matrix. The first 5 and last elements of the coefficient vector β 5 are sampled uniformly from interval [.5,.5] and other elements of β remains. In this set up, X has orthogonal columns which are the PCs, and the signal lies only on the top 5 and bottom PCs. won t work since a large proportion of signal lies on the bottom PCs. On the other hand, won t work as well since the top few singular values are too large compared with other, which makes converges very slowly. The computational cost and risk of the four algorithms and also the solution () over repeats are shown in figure 3. As shown by the figure LING works best in this set up. is slightly worse than LING but still approaching with less cost. In this case, converges slowly and is completely off target as explained before. 4. REAL DATA In this section we compare,, and LING with the solution () on two real datasets. 4.. GISETTE DATASET The first is the gisette data set (Guyon et al., 4) from the UCI repository which is a bi-class classification task. Every row of the design matrix X 6 5 consists of pixel features of a single digit "4" or "9" and Y gives the class label. Among the 6 samples, we use 5 for training and for testing. The classification error rate for solution () is.9. Since the goal is to compare different algorithms for regression, we don t care about achieving the state of the art accuracy for this dataset as long as regression works reasonably well. When running 53

7 Extreme Case 45 Spectrum of X 45 Spectrum of X LING Risk Figure 5: Singular values of X in Model Figure 6: Top singular values of X in Model 3 5e+8 e+9 e+9 5e+9 e+ FLOPS in log scale Figure 3: Model 3, Risk VS. Computational Cost plot. LING approaches risk the fastest. is slightly slower than LING. also approaches risk but cost more than LING. has a huge risk no matter how many PCs are selected. 4 x 4 Spectrum of X Error Rate Gisette LING5 LING e+8 e+9 3e+9 e+ 3e+ e+ 3e+ FLOPS in log scale Figure 4: Top of X in Model, we pick top k =,, 4, 8, 5, 3, 4 PCs and in we iterate n =, 5,, 5,, 3, 5,, 5 times. For we try n =,, 3, 5,,, 4, 8 passes through the data. For LING we pick k = 5, 5 PCs in the first stage and try n =,, 4, 8,, 5,, 3, 5 iterations in the second stage. The computational cost and average classification error of the four algorithms and also the solution () on test set over 6 different train test splits are shown in figure 7. The top 5 of X are shown in figure 9. As shown in the figure, gets close to the error very fast. The two curves of LING with k = 5, 5 are slower than since some initial FLOPS are spent on computing top PCs but after that they approach error very fast. also converges to but cost more than the previous two algorithms. performs worst in terms of error and computational cost. 4.. BUZZ IN SOCIAL MEDIA The second dataset is the UCI buzz in social media dataset which is a regression task. The goal is to predict popularity (evaluated by the mean number of active discussion) of a certain topic on Twitter over a period. The original feature matrix contains some statistics about this topic over that period like number of discussions created and new authors interacting at the topic. The original feature dimen- Figure 7: Gisette, Error Rate VS. Computational Cost plot. achieves small error rate fastest. Two LING lines with different n spent some FLOPS on computing top PCs first, but then converges very fast to a lower error rate. and also provide reasonably small error rate and are faster than, but suboptimal compared with and LING. sion is 77. We add quadratic interactions to make it 38. To save time, we only used a subset of 8 samples. The samples are split into 6 train and test. We use MSE on the test data set as the error measure. For we pick k =,, 3, 5,, 5 PCs and in we iterate n =,, 4, 6, 8,, 5,, 3, 4, 6, times. For we try n =,, 3, 5,, 5,, 4, 8 passes through the dataset and for LING we pick k = 5, 5 in the first stage and n =,,, 4, 6, 8,, 5,, 5 iterations in the second stage. The computational cost and average MSE on test set over 5 different train test splits are shown in figure 8. The top 5 of X are shown in figure. As shown in the figure, approaches MSE of very fast. LING spent some initial FLOPS for computing top PCs but after that converges fast. and also achieves reasonable performance but suboptimal compared with and LING. The MSE of first decays when we add more PCs into regression but finally goes up due to overfit. 53

8 MSE Buzz LING5 LING5 4 3e+8 e+9 3e+9 e+ 3e+ e+ 3e+ FLOPS in log scale Figure 8: Buzz, MSE VS. Computational Cost plot. and two LING lines with different n achieves small MSE fast. is slower than LING and. reaches its smallest MSE at k = 5 then overfits x 5 spectrum of X 5 5 Figure 9: Top 5 singular values of X in Gisette Dataset 5 SUMMARY Spectrum of X 5 5 Figure : Top 5 singular values of X in Social Media Buzz Dataset In this paper we present a two stage algorithm LING for computing large scale Ridge Regression which is both fast and robust in contrast to the well known approaches and. We show that under the fixed design setting LING actually has the same risk as Ridge Regression assuming convergence. In the experiments, LING achieves good performances on all datasets when compare with three other large scale regression algorithms. We conjecture that same strategy can be also used to accelerate the convergence of stochastic gradient descent when solving Ridge Regression since the first stage in LING essentially removes the high variance directions of X, which will lead to variance reduction for the random gradient direction generated by S. Andreas Artemiou and Bing Li. On principal components and regression: a statistical explanation of a natural phenomenon. Statistica Sinica, 9(4): , 9. ISSN Léon Bottou. Large-Scale Machine Learning with Stochastic Gradient Descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of the 9th International Conference on Computational Statistics (COMP- STAT ), pages 77 87, Paris, France, August. Springer. Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. A risk comparison of ordinary least squares vs ridge regression. Journal of Machine Learning Research, 4:55 5, 3. Isabelle Guyon, Asa Ben Hur, Steve Gunn, and Gideon Dror. Result analysis of the nips 3 feature selection challenge. In Advances in Neural Information Processing Systems 7, pages MIT Press, 4. N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53():7 88, May a. ISSN Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert. An algorithm for the principal component analysis of large data sets. SIAM J. Scientific Computing, 33(5):58 594, b. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems (NIPS), 3. Ian Jolliffe. Principal Component Analysis. Encyclopedia of Statistics in Behavioral Science. John Wiley & Sons, 5. Yichao Lu, Paramveer S. Dhillon, Dean Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in Neural Information Processing Systems (NIPS), 3. G. Saunders, A. Gammerman, and V. Vovk. Ridge regression learning algorithm in dual variables. In Proc. 5th International Conf. on Machine Learning, pages Morgan Kaufmann, San Francisco, CA, 998. Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML 4: Proceedings of the twenty-first International Conference on Machine Learning. OMNIPRESS, pages 99 96, 4. References Marina A.Epelman. Rate of convergence of steepest descent algorithm. Lecture Notes, 7. 53

arxiv: v1 [stat.ml] 15 May 2014

arxiv: v1 [stat.ml] 15 May 2014 Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent Yichao Lu and Dean P. Foster arxiv:45.395v [stat.ml] 5 May 4 Abstract We propose a new two stage algorithm LING for

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems Xiangrui Meng Joint with Michael A. Saunders and Michael W. Mahoney Stanford University June 19, 2012 Meng, Saunders, Mahoney

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

An efficient algorithm for sparse PCA

An efficient algorithm for sparse PCA An efficient algorithm for sparse PCA Yunlong He Georgia Institute of Technology School of Mathematics heyunlong@gatech.edu Renato D.C. Monteiro Georgia Institute of Technology School of Industrial & System

More information

Gradient LASSO algoithm

Gradient LASSO algoithm Gradient LASSO algoithm Yongdai Kim Seoul National University, Korea jointly with Yuwon Kim University of Minnesota, USA and Jinseog Kim Statistical Research Center for Complex Systems, Korea Contents

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Divide and Conquer Kernel Ridge Regression

Divide and Conquer Kernel Ridge Regression Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem

More information

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2017 Assignment 3: 2 late days to hand in tonight. Admin Assignment 4: Due Friday of next week. Last Time: MAP Estimation MAP

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Convex Optimization / Homework 2, due Oct 3

Convex Optimization / Homework 2, due Oct 3 Convex Optimization 0-725/36-725 Homework 2, due Oct 3 Instructions: You must complete Problems 3 and either Problem 4 or Problem 5 (your choice between the two) When you submit the homework, upload a

More information

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies. CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Model selection and validation 1: Cross-validation

Model selection and validation 1: Cross-validation Model selection and validation 1: Cross-validation Ryan Tibshirani Data Mining: 36-462/36-662 March 26 2013 Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10 1 Reminder: modern regression techniques Over the

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 23 Announcements...

More information

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson Robust Regression Robust Data Mining Techniques By Boonyakorn Jantaranuson Outline Introduction OLS and important terminology Least Median of Squares (LMedS) M-estimator Penalized least squares What is

More information

No more questions will be added

No more questions will be added CSC 2545, Spring 2017 Kernel Methods and Support Vector Machines Assignment 2 Due at the start of class, at 2:10pm, Thurs March 23. No late assignments will be accepted. The material you hand in should

More information

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma MULTIPLE SUBSPACE ULV ALGORITHM AND LMS TRACKING S. HOSUR, A. H. TEWFIK, D. BOLEY University of Minnesota 200 Union St. S.E. Minneapolis, MN 55455 U.S.A fhosur@ee,tewk@ee,boley@csg.umn.edu ABSTRACT. The

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu 1 Introduction Yelp Dataset Challenge provides a large number of user, business and review data which can be used for

More information

Logistic Regression. Abstract

Logistic Regression. Abstract Logistic Regression Tsung-Yi Lin, Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl60}@ucsd.edu January 4, 013 Abstract Logistic regression

More information

model order p weights The solution to this optimization problem is obtained by solving the linear system

model order p weights The solution to this optimization problem is obtained by solving the linear system CS 189 Introduction to Machine Learning Fall 2017 Note 3 1 Regression and hyperparameters Recall the supervised regression setting in which we attempt to learn a mapping f : R d R from labeled examples

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S May 2, 2009 Introduction Human preferences (the quality tags we put on things) are language

More information

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 6: k-nn Cross-validation Regularization LEARNING METHODS Lazy vs eager learning Eager learning generalizes training data before

More information

CS294-1 Assignment 2 Report

CS294-1 Assignment 2 Report CS294-1 Assignment 2 Report Keling Chen and Huasha Zhao February 24, 2012 1 Introduction The goal of this homework is to predict a users numeric rating for a book from the text of the user s review. The

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

Lecture 16: High-dimensional regression, non-linear regression

Lecture 16: High-dimensional regression, non-linear regression Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we

More information

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will Lectures Recent advances in Metamodel of Optimal Prognosis Thomas Most & Johannes Will presented at the Weimar Optimization and Stochastic Days 2010 Source: www.dynardo.de/en/library Recent advances in

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Effectiveness of Sparse Features: An Application of Sparse PCA

Effectiveness of Sparse Features: An Application of Sparse PCA 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Jian Guo, Debadyuti Roy, Jing Wang University of Michigan, Department of Statistics Introduction In this report we propose robust

More information

CSE446: Linear Regression. Spring 2017

CSE446: Linear Regression. Spring 2017 CSE446: Linear Regression Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill

More information

Lasso. November 14, 2017

Lasso. November 14, 2017 Lasso November 14, 2017 Contents 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) 1 1.1 The Lasso Estimator.................................... 1 1.2 Computation of the Lasso Solution............................

More information

Graph Classification for the Red Team Event of the Los Alamos Authentication Graph. John M. Conroy IDA Center for Computing Sciences

Graph Classification for the Red Team Event of the Los Alamos Authentication Graph. John M. Conroy IDA Center for Computing Sciences Graph Classification for the Red Team Event of the Los Alamos Authentication Graph John M. Conroy IDA Center for Computing Sciences Thanks! Niall Adams, Imperial College, London! Nick Heard, Imperial College,

More information

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5. More on Neural Networks Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.6 Recall the MLP Training Example From Last Lecture log likelihood

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

COMP 558 lecture 19 Nov. 17, 2010

COMP 558 lecture 19 Nov. 17, 2010 COMP 558 lecture 9 Nov. 7, 2 Camera calibration To estimate the geometry of 3D scenes, it helps to know the camera parameters, both external and internal. The problem of finding all these parameters is

More information

Lecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates

Lecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates CSE 51: Design and Analysis of Algorithms I Spring 016 Lecture 11: Clustering and the Spectral Partitioning Algorithm Lecturer: Shayan Oveis Gharan May nd Scribe: Yueqi Sheng Disclaimer: These notes have

More information

An efficient algorithm for rank-1 sparse PCA

An efficient algorithm for rank-1 sparse PCA An efficient algorithm for rank- sparse PCA Yunlong He Georgia Institute of Technology School of Mathematics heyunlong@gatech.edu Renato Monteiro Georgia Institute of Technology School of Industrial &

More information

Stanford University. A Distributed Solver for Kernalized SVM

Stanford University. A Distributed Solver for Kernalized SVM Stanford University CME 323 Final Project A Distributed Solver for Kernalized SVM Haoming Li Bangzheng He haoming@stanford.edu bzhe@stanford.edu GitHub Repository https://github.com/cme323project/spark_kernel_svm.git

More information

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments Jiyan Yang ICME, Stanford University Nov 1, 2015 INFORMS 2015, Philadephia Joint work with Xiangrui Meng (Databricks),

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

A More Efficient Approach to Large Scale Matrix Completion Problems

A More Efficient Approach to Large Scale Matrix Completion Problems A More Efficient Approach to Large Scale Matrix Completion Problems Matthew Olson August 25, 2014 Abstract This paper investigates a scalable optimization procedure to the low-rank matrix completion problem

More information

Matrix algorithms: fast, stable, communication-optimizing random?!

Matrix algorithms: fast, stable, communication-optimizing random?! Matrix algorithms: fast, stable, communication-optimizing random?! Ioana Dumitriu Department of Mathematics University of Washington (Seattle) Joint work with Grey Ballard, James Demmel, Olga Holtz, Robert

More information

Constrained optimization

Constrained optimization Constrained optimization A general constrained optimization problem has the form where The Lagrangian function is given by Primal and dual optimization problems Primal: Dual: Weak duality: Strong duality:

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

Large-Scale Face Manifold Learning

Large-Scale Face Manifold Learning Large-Scale Face Manifold Learning Sanjiv Kumar Google Research New York, NY * Joint work with A. Talwalkar, H. Rowley and M. Mohri 1 Face Manifold Learning 50 x 50 pixel faces R 2500 50 x 50 pixel random

More information

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13 CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

CS281 Section 3: Practical Optimization

CS281 Section 3: Practical Optimization CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 21 Announcements...

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description

More information

Leave-One-Out Support Vector Machines

Leave-One-Out Support Vector Machines Leave-One-Out Support Vector Machines Jason Weston Department of Computer Science Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK. Abstract We present a new learning algorithm

More information

Computer Experiments. Designs

Computer Experiments. Designs Computer Experiments Designs Differences between physical and computer Recall experiments 1. The code is deterministic. There is no random error (measurement error). As a result, no replication is needed.

More information

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan

More information

Adaptive Filtering using Steepest Descent and LMS Algorithm

Adaptive Filtering using Steepest Descent and LMS Algorithm IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 4 October 2015 ISSN (online): 2349-784X Adaptive Filtering using Steepest Descent and LMS Algorithm Akash Sawant Mukesh

More information

Online Learning for URL classification

Online Learning for URL classification Online Learning for URL classification Onkar Dalal 496 Lomita Mall, Palo Alto, CA 94306 USA Devdatta Gangal Yahoo!, 701 First Ave, Sunnyvale, CA 94089 USA onkar@stanford.edu devdatta@gmail.com Abstract

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Optimization Methods for Machine Learning (OMML)

Optimization Methods for Machine Learning (OMML) Optimization Methods for Machine Learning (OMML) 2nd lecture Prof. L. Palagi References: 1. Bishop Pattern Recognition and Machine Learning, Springer, 2006 (Chap 1) 2. V. Cherlassky, F. Mulier - Learning

More information

The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R

The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R Journal of Machine Learning Research (2013) Submitted ; Published The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R Haotian Pang Han Liu Robert Vanderbei Princeton

More information

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Microsoft Research 1 Microsoft Way Redmond, WA 9805 jplatt@microsoft.com Abstract Training a Support Vector Machine

More information

1 Training/Validation/Testing

1 Training/Validation/Testing CPSC 340 Final (Fall 2015) Name: Student Number: Please enter your information above, turn off cellphones, space yourselves out throughout the room, and wait until the official start of the exam to begin.

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Lecture 9. Semidefinite programming is linear programming where variables are entries in a positive semidefinite matrix.

Lecture 9. Semidefinite programming is linear programming where variables are entries in a positive semidefinite matrix. CSE525: Randomized Algorithms and Probabilistic Analysis Lecture 9 Lecturer: Anna Karlin Scribe: Sonya Alexandrova and Keith Jia 1 Introduction to semidefinite programming Semidefinite programming is linear

More information

CSC 411 Lecture 4: Ensembles I

CSC 411 Lecture 4: Ensembles I CSC 411 Lecture 4: Ensembles I Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 04-Ensembles I 1 / 22 Overview We ve seen two particular classification algorithms:

More information

Sparsity Based Regularization

Sparsity Based Regularization 9.520: Statistical Learning Theory and Applications March 8th, 200 Sparsity Based Regularization Lecturer: Lorenzo Rosasco Scribe: Ioannis Gkioulekas Introduction In previous lectures, we saw how regularization

More information

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Journal of Machine Learning Research 6 (205) 553-557 Submitted /2; Revised 3/4; Published 3/5 The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Xingguo Li Department

More information

Robust Face Recognition via Sparse Representation

Robust Face Recognition via Sparse Representation Robust Face Recognition via Sparse Representation Panqu Wang Department of Electrical and Computer Engineering University of California, San Diego La Jolla, CA 92092 pawang@ucsd.edu Can Xu Department of

More information

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice Stat60/CS94: Randomized Algorithms for Matrices and Data Lecture 11-10/09/013 Lecture 11: Randomized Least-squares Approximation in Practice Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these

More information

Segmentation: Clustering, Graph Cut and EM

Segmentation: Clustering, Graph Cut and EM Segmentation: Clustering, Graph Cut and EM Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu

More information

REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER

REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER TODAY Advances in model estimation methodology Application to data that comes in rectangles Examples ESTIMATION Problems with MLE

More information

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Slide credit: http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K. Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential

More information

Practical Least-Squares for Computer Graphics

Practical Least-Squares for Computer Graphics Outline Least Squares with Generalized Errors Robust Least Squares Constrained Least Squares Practical Least-Squares for Computer Graphics Outline Least Squares with Generalized Errors Robust Least Squares

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course

More information

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Lecture 3: Camera Calibration, DLT, SVD

Lecture 3: Camera Calibration, DLT, SVD Computer Vision Lecture 3 23--28 Lecture 3: Camera Calibration, DL, SVD he Inner Parameters In this section we will introduce the inner parameters of the cameras Recall from the camera equations λx = P

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

CSC 2515 Introduction to Machine Learning Assignment 2

CSC 2515 Introduction to Machine Learning Assignment 2 CSC 2515 Introduction to Machine Learning Assignment 2 Zhongtian Qiu(1002274530) Problem 1 See attached scan files for question 1. 2. Neural Network 2.1 Examine the statistics and plots of training error

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 3 Due Tuesday, October 22, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information