An Adaptive Lasso Method for Dimensionality Reduction in. Classification

Size: px

Start display at page:

Download "An Adaptive Lasso Method for Dimensionality Reduction in. Classification"

Buddy Grant
5 years ago
Views:

1 An Adaptive Lasso Method for Dimensionality Reduction in Classification Tian Tian Gareth James Rand Wilcox Abstract Producing high-dimensional classification methods is becoming an increasingly important problem. In this article, we propose an adaptive Lasso search (ALS) approach which first reduces the dimension of the data space and then applies a standard classification method to the reduced space. One key advantage of ALS is that it automatically adjusts to either mimic variable selection type methods, such as the Lasso, or variable combination methods, such as PCA, or somewhere between the two approaches. The adaptivity of ALS allows it to perform well in situations where pure variable selection or variable combination methods fail. ALS uses the Lasso in a stochastic search algorithm to select a handful of optimal projection directions from a large number of random directions in each iteration. We demonstrate the strengths of ALS on an extensive range of simulation studies and real world data sets and compare it to many classical and modern classification methods. Ph.D. Student, Department of Psychology, University of Southern California, Los Angeles, CA ttian@usc.edu. Associate Professor, Department of Information and Operations Management, University of Southern California, Los Angeles, CA gareth@usc.edu. Professor, Department of Psychology, University of Southern California, Los Angeles, CA rwilcox@usc.edu. 1

2 Keywords: Lasso; Dimensionality reduction; Variable selection; Variable combination; Classification 1 Introduction An increasingly important topic is the classification of observations into two or more predefined groups when the number of predictors, d is larger than the number of observations, n. Many conventional classification methods, such as Fisher s discriminant rule, are not applicable if n < d. Other methods, such as classification trees, k-nearest neighbors, and logistic discriminant analysis, etc., do not explicitly require n > d but in practice provide poor classification accuracy when d is ultra large. One major reason is that when n < d, the covariance matrix for each class is singular and hence unstable. A common solution is to first reduce the data into a lower p d dimensional space and then perform classification on the transformed data. Fan and Lv (2008) provide significant theoretical and empirical evidence for the power of such an approach. Generally, the dimension reduction is performed using a linear transformation of the form Z n,p = X n,d P d,p, (1) where X is an n-by-d data matrix and P is a d-by-p projection matrix which projects X onto a p-dimensional subspace Z (p < d). However there are many possible approaches to choosing P. We divide the methods into supervised versus unsupervised and variable selection versus variable combination. Variable selection techniques select a subset of relevant variables that have good predictive power. Variable selection benefits from obtaining a subset of informative 2

3 variables from a set of more complex variables. In this setting, P is some row permutation of the identity matrix and the zero matrix, i.e. P = perm([i p, 0 (d p) p ] T ). Hence, P is considered sparse because most of its components are zeros. Many variable selection methods have been proposed and widely used in numerous areas. A great deal of attention is paid to the L 1 penalized least squares estimator (i.e., the Lasso (Efron et al., 2004; Fu and Knight, 2000; Tibshirani, 1996; Zhao and Yu, 2006)). Other methods include nearest shrunken centroids (Tibshirani et al., 2002), Dantzig selector (Candes and Tao, 2007), SCAD (Fan and Li, 2001), the Elastic Net (Zou and Hastie, 2005), and some Bayesian methods of variable selection (George and McCulloch, 1993, 1995, 1997; Mitchell and Beauchamp, 1988). These methods all use supervised learning, where the response and predictors are both utilized to obtain the subset of variables. In addition, because they always select a subset of the original variables, they provide highly interpretable results. However, because of the sparsity of P, for a given p, variable selection methods are less efficient at compressing the observed data, X. For example, they may discard potentially valuable variables which are not predictive individually but provide significant improvement in conjunction with others. In comparison, variable combination methods utilize a dense P which combines correlated variables and hence does well on multicollinearities which often occur in high-dimensional data. Probably the most commonly applied method in this category is principal component analysis (PCA) (Hotelling, 1933; Pearson, 1901). Using this approach P becomes the first p eigenvectors of X and Z is the associated PCA scores. PCA can deal with an ultra large data scale and produces the most efficient representation of X using p dimensions. However, PCA is an unsupervised learning technique so it does not make use of the response variable to construct P. It has been well shown 3

4 that the dimensions that best explain X will not necessarily be the best dimensions for predicting the response Y. Other variable combination methods include partial least squares regression (Wold, 1966) and multidimensional scaling (Torgerson, 1958; Young and Householder, 1938). All these approaches yield linear combinations of variables which makes interpretation more difficult. In this article we propose a new supervised learning method that is capable of automatically adapting the sparsity of P to generate optimal prediction accuracy. By sparsity here we refer to the fraction of zero elements of P. Our method uses the Lasso so we call it an Adaptive Lasso Search (ALS) procedure. In situations where a subset of the original variables provides a good fit, ALS will utilize a sparse P matrix, while in situations where linear combinations of the variables work better, ALS will produce a denser matrix. ALS has the same advantage of other supervised methods in which P can be designed specifically to provide the best level of prediction accuracy. However, it has the flexibility to adapt the sparsity of P so as to gain the benefits of both the variable selection and variable combination methods. ALS uses a stochastic search procedure. It starts by generating a large set of perspective columns of P with sparsity ξ 0. The Lasso is then used to select a candidate subset of good columns or directions. Then a new set of columns with a new sparsity level ξ1 are produced as candidates and the Lasso must choose among the current set of good columns and the new candidates. Over time the same best columns are picked at each iteration and the process converges. At each step in the algorithm the sparsity level of the new perspective columns is adjusted according to the sparsity level of the previously chosen columns. We show through extensive simulations and real world examples that ALS is highly robust in that it generally provides comparable 4

5 performance to variable selection and variable combination approaches in situations that favor each of these methods. However, ALS can still perform well in situations where one or other of these approaches fails. This article is organized as follows. In Section 2, we present and motivate the basic ALS methodology. We also discuss methods for implementing a preliminary data reduction before applying ALS. In Section 3 we provide an extensive simulation comparison of ALS with several other modern potential classification techniques such as k-nearest neighbors, support vector machines (Vapnik, 1998), random forests (Breiman, 2001) and neural networks. In Section 4 we demonstrate the performance of ALS on two real world data sets. We briefly investigate the computational cost of implementing ALS in Section 5. Section 6 concludes the article with a summary and discussion about future areas of investigation. 2 Projection selection with ALS 2.1 General ideas and motivations Our general approach involves choosing a classification procedure and a matrix P. We then use Equation (1) to project the data into a p-dimensional space and apply the classification procedure to the lower dimensional data. While this approach requires choosing both the classifier and the projection matrix we concentrate only on the latter problem. There are many classification techniques that have been demonstrated to perform well on low to medium dimensional data. A more difficult question involves the best way to produce the lower dimensional data. Hence, we assume that one of these classification methods has been chosen and concentrate our attention on the choice of P. 5

6 Given a classification method, the choice of the best subspace for classification can be formulated as the following optimization problem: P = arg min P E X,Y [e(m P (X), Y )], (2) where M P is the classification method applied to the lower dimensional data, X and Y are the predictors and response variables and e is a loss function resulting from using M P to predict Y. A common choice for e is the 0-1 loss function which results in minimizing the misclassification rate (MCR). Since P is very high dimensional, solving (2) is in general a difficult problem. There are several possible approaches. One option is to assume that P is a sparse 0, 1 matrix and only attempt to estimate the locations of its non-zero elements. This is the approach taken by variable selection methods. Another option is to assume P is dense but, instead of choosing P to optimize (2), choose a matrix which provides a good representation for X. This is the approach taken by PCA. One then hopes that the PCA solution will be close to that of (2). Instead of making restrictive assumptions about P, as with the variable selection approach, or failing to directly optimize (2), as with the PCA approach, we attempt to directly fit (2) without placing restrictions on P. In this type of high dimensional nonlinear optimization problem, stochastic search methods, such as genetic algorithms, simulated annealing, evolutionary strategies, etc., have been shown to provide superior results over more traditional deterministic procedures because they are often able to more effectively search large parameter spaces and can be used for any class of objective functions and yield an asymptotic guarantee of convergence (Fogel, 2000; Fu, 2002; Gosavi, 2003; Liberti and Kucherenko, 2005). Tian et al. (2008) used a stochastic simulated annealing algorithm in a similar situation to ours. They found that the 6

7 approach was effective at producing accurate classifications. However, it required a large number of iterations to effectively search the parameter space and hence was computationally expensive. We explore an alternative stochastic process which is often even more effective at searching the parameter space and generally requires many fewer iterations. 2.2 The ALS Method The ALS procedure works by successively generating a large number, L, of potential random directions, i.e. columns. We then use Equation (1) to compute the corresponding L dimensional data space, z 1, z 2,..., z L, and use a variable selection method to select a good subset of these directions to form an initial estimate for P. The sparsity level of this P is examined and a new random set of potential columns is generated with the same average sparsity as the current P. The procedure iterates in this fashion for a fixed number of steps. Formally, the ALS procedure consists of the following steps: Step 1. Randomly generate the current d-by-l projection matrix P 0, (p < L < d) with an expected sparsity of ξ 0. Step 2. At the lth iteration use (1) and P l to obtain a preliminary reduced data space Z l. Evaluate each variable of Z l and select the p most important variables. Step 3. Keep the corresponding p columns of P l and calculate the average sparsity, ξ l, for these columns. Step 4. Generate L p new columns with an average sparsity of ξ l. Join these columns with the p columns selected in Step 3 to form a new projection matrix P l+1. Step 5. Return to Step 2. for a fixed number of iterations. 7

8 The two crucial parts to this approach are the mechanism for generating the potential columns and the procedure for choosing the best p columns. We discuss the column generation procedure in Section 2.3. As far as the choice of variables is concerned, we considered several possible variable selection methods. In the context of classification, a natural approach is to use a GLM version of the Lasso using a logistic regression framework. We examined this approach using the GLMpath methodology of Park and Hastie (2007). Interestingly, we found that simply using the standard Lasso procedure to select the variables gave similar levels of accuracy to GLMpath and required considerably less computational effort. Therefore, in our implementation of ALS we use the first p variables selected by the Lars algorithm (Tibshirani, 1996). However, in practice one could implement our methodology with any standard variable selection method such as SCAD, Dantzig selector, etc. 2.3 Generating projection directions At each iteration of the ALS algorithm we generate new potential columns with the same average sparsity level as the columns selected in the previous step. This allows the sparsity level to automatically adjust to the data set. The idea is that a data set requiring high levels of sparsity, i.e. where a variable selection method will perform well, will tend to result in sparser columns being selected and the current P l will become sparser at each iteration. Alternatively a data set that requires a denser P l will select less sparse columns at each iteration. We define sparsity of the current matrix P l as ξ l = 1 d p d i=1 p I j=1 ( p (l) i,j = 0 ), (3) 8

9 where p (l) i,j is the (i, j)th element of P l. We then generate L p new columns such that the expected fraction of zero elements over all columns is ξ l. While the overall average sparsity is restricted to equal ξl we desire some variability in the sparsity levels so that ALS is able to select out columns with higher or lower sparsity and hence adjust ξ l for the next iteration. To achieve this goal we allow for different average sparsity levels between columns. Let ξ (l) j represent the average sparsity of the jth column in iteration l. Then in the (l + 1)th iteration we generate ξ (l+1) j from a Beta(α, α(1 ξ l )/ ξ l ) distribution. Note that E(ξ (l+1) j ) = ξ l for all values of α. We found that α = 10 produced a reasonable amount of variance in sparsity levels. To produce the L p columns we then simulate u i,j N (0, 1) and v ij B(1 ξ (l+1) j ), i = 1,..., d, j = p + 1,..., L and set p (l+1) i,j = u i,j v i,j where p (l+1) i,j represents the i, jth element of P l+1. Here, B(π) is the Bernoulli distribution with probability of 1 equal to π. We then combine these L p columns with the selected p columns from iteration l to form the new intermediate projection matrix P l+1. We select ξ 0 = 0.5 for the initial sparsity level, which seems to provide a reasonable compromise between the variable selection and variable combination paradigms. The full ALS algorithm is explicitly described as follows: 1. Set ξ 0 = 0.5 and generate an initial P Iterate until l = I. (a) Generate L p new projection directions P new with the (i, j)th element p i,j = u ij v ij, where u i,j N (0, 1), v i,j B(1 ξ j ), and ξ j Beta(α, α(1 ξ l 1) / ξ l 1 ). (b) Let P l = (P l 1, P new ) and use (1) to obtain Z l. 9

10 (c) Apply the Lasso to select p variables from Z l. (d) Keep the corresponding p columns from P l to obtain P l. (e) Calculate ξ l for P l by (3). (f) Go to (a). To implement ALS, one must choose values for the number of iterations, I, the number of random columns to generate, L, and the final number of columns chosen, p. In our experiments we used I = 500 iterations. We found this value guaranteed convergence and often fewer iterations were required. In general tracking the deviance of the Lasso at each iteration provided a reliable measure of convergence. The value of L influences the convergence speed of the algorithm as well as the execution time. We found that the best results were obtained by choosing L to be a relatively large value for the early iterations but then to have it decline. In particular we set L = I/10+3p/2 for the first iteration and have it decline to 3p/2 by the final iteration. This approach is similar to simulated annealing where the temperature is lowered over time. It guarantees that the Lasso has more variables to choose among at the early stages, but as the search moves towards the optimum, decreasing L improves the reliability of the selected p variables in addition to speeding up the algorithm. The choice of p can obviously have an important impact on the results. In general, p should be chosen as some balance of the classification accuracy and the computational expense. One reasonable approach is to use an objective criterion to choose p such as cross-validation. In other situations, some prior knowledge can be applied to select p. For example, in the fmri study that we examine in Section 4, the prior knowledge and assumptions on the regions of interest can be used to decide on a reasonable value for p. 10

11 2.4 Preliminary reduction ALS can, in general, be applied to any data, but in practice when the data scale is ultra large a preliminary reduction step is usually added to speed up the algorithm as well as obtain a better accuracy. Fan and Lv (2008) argue that a successful strategy to deal with ultra-high dimensional data is to apply a series of dimension reductions. In our setting this strategy would involve an initial reduction of the dimension to a moderate level followed by applying ALS to the new lower dimensional data. This two stage approach potentially has two major advantages. First, Fan and Lv (2008) show that prediction accuracy can be considerably improved by removing dimensions that clearly appear to have no relationship to the response. Our own simulations also reinforce this notion. Second, stochastic search algorithms such as ALS can require significant computational expense. Reducing the data dimension before applying ALS provides a large increase in efficiency. This preliminary data reduction has to be conducted carefully so as to minimize the information loss while maximizing the reduction of redundancy in the variables. The main challenges are as follows. First, the singular and unstable covariance matrix makes the estimation of parameters inaccurate. Second, high-dimensional data usually contain a large number of unnecessary variables (noise), this inclusion of noise could complicate or even mask the class information (Donoho, 2000; Fowlkes et al., 1988). Third, when d n, even if each element of the population mean vector can be accurately estimated, the aggregated estimation error can be very large. In this article, we consider three methods for reducing the data into m (m > p) dimensions. The first approach is conventional PCA which has the effect of selecting the best m dimensional space in terms of minimizing the mean squared deviation with the 11

12 original d dimensional space. The second approach is the sure independence screening (SIS) method based on correlation learning (Fan and Lv, 2008). It computes the componentwise marginal correlations between each predictor and the response vector. One then selects the variables corresponding to the m largest correlations. PCA is an unsupervised variable combination approach while SIS is a supervised variable selection method. PCA may exclude important information among variables with less variability while SIS may tend to select a redundant subset of predictors that have high correlations with the response individually but are also highly correlated among themselves. Our third dimension reduction approach, which we call PCA-SIS, attempts to leverage the best of both PCA and SIS by first using PCA to obtain n orthogonal components and then treating these components as the predictors and using SIS to select the best m in terms of correlation with the response. PCA-SIS can be thought of as a supervised version of PCA. We compare these three types of preliminary reduction methods, along with the effect of performing no initial dimension reduction, in our simulation studies. Fan and Lv (2008) argue that the dimension of the intermediate space m should be chosen as m = 2n log(n). (4) We have found that this approach generally works well so have adopted Equation (4) for selecting m in this article. 3 Applications on simulated data In this section, we give five simulation examples to demonstrate the performance of ALS on classification tasks. In all simulations we show results with ALS applied using the classical Logistic Discriminant Analysis (LD) classifier and using the Support Vector 12

13 Machine (SVM) classifier. In addition we also apply LD and SVM to the original full dimensional data (FD) and to low dimensional data produced using a straight Lasso method (Lars), a generalized Lasso method (GLMpath), SIS, and PCA. Lars, GLMpath, SIS and PCA all utilize equation (1) but compute P directly using their own methodologies. 3.1 Simulation study I: sparse case The first simulation was designed to examine the performance of ALS in a sparse model situation where the response is only associated with a handful of predictors. We simulated 20 training samples, each containing 100 observations and 50 predictor variables generated from a multivariate normal distribution N (0, Σ), where the diagonal elements of Σ are 1s. In order to examine the model performance under multicollinearities, we set the off-diagonal elements of Σ equal to 0.5. We then rescaled the first p = 5 columns to have a standard deviation of 10. We call these columns with the most variability the major columns, and the rest the minor columns. The binary response variable was generated using a standard logistic regression model after first reducing the dimension of the space using a fixed P. For each training data set a corresponding set of 1, 000 observations was generated as a test sample. We tested out two scenarios by creating two different true projection matrices. The first scenario used only the p major columns to generate the response variables. Hence the true projection matrix was P = perm 1 ([I p, 0] T ), where perm 1 is the row permutation that makes the unit row vectors in P associate with the major columns of the data matrix. The second scenario associates the group information with any p of the minor columns having less variability than the major columns. In this setting 13

14 the true projection matrix is P = perm 2 ([I p, 0] T ), where perm 2 is the row permutation that makes the unit row vectors in P associate with p of the minor columns. It is easy to see that the first scenario favors PCA because the first 5 eigenvectors from PCA will tend to concentrate on the major columns from which the response variable is derived. However, in the second scenario PCA should perform poorly because it will still be selecting the most variable columns even though there is no group information among them. Supervised methods, such as ALS, Lars, GLMpath etc., should have an advantage in the latter scenario because they will always look for projections with high predictive power. We compared the test MCRs averaged over the 20 test data sets. In order to demonstrate the importance of the sparsity of the projection matrix to the classification performance, we also implemented a second version of ALS where we fixed the sparsity of the projection matrix, ξ, so that it does not change over iterations. We call this method the fixed Lasso search (FLS) method. Presumably, FLS with a good value of sparsity will perform better than ALS, because ALS has to adaptively search for ξ. If the initial sparsity is far from the true value, ALS will spend more time finding the true sparsity and hence slow down the rate of convergence. Since the true sparsity level for this data was high we set ξ = 0.9 for the FLS procedure. For the Lars, GLMpath, SIS, PCA, ALS and FLS methods we reduced the data to p = 5 dimensions, using the corresponding approaches, and then applied LS and SVM. For FD we did not perform any dimension reduction. The results are shown in the first two sections of Table 1. Not surprisingly, LD generally outperformed SVM on this data because the data were generated using the logistic regression model. Of more interest is the relative performance of the seven 14

15 dimension reduction methods. As expected, PCA worked fairly well in scenario 1 but poorly in scenario 2. In both scenarios, ALS and FLS performed well. Notice that since the true model was highly sparse, the pure variable selection methods (Lars and GLMpath) performed well too. All methods except PCA outperformed FD. Figure 1 illustrate the average deviance values, the average test MCRs, and ξ from ALS over 500 iterations for scenarios 1 and 2 respectively. As we can see, the average test MCRs and deviances in both scenarios were decreasing, which means the Lasso chose good directions from P in each iteration, and hence the classification accuracy was improved gradually. It is interesting to notice that the sparsity levels for the estimated projection matrices of these two scenarios are different, although the sparsity for the true projection matrices are in fact the same: ξ = In scenario 1, ξ decreased to about 0.28, which is far from the true sparsity level. This explains why FLS performed worse than ALS in this scenario. In scenario 2, ξ increased to about 0.8, which is close to the true value, and in this setting FLS worked better than ALS as we expected. One possible reason of the difference is that in scenario 1, the estimated projection matrix may have many elements close to zero which make little contribution in extracting variables, but these elements still contribute to its denseness. Another possible reason is that for scenario 1, since all information is located in major columns, the noise level is low. The inclusion of minor columns would not highly influence the classification accuracy. In scenario 2, on the other hand, since the noise level is high, the search must guarantee to find the correct sparsity in order to improve accuracy. 15

16 Table 1: The average test MCRs and standard errors (in parenthesis) on three intermediate-scaled simulated data sets. Study I (1) Study I (2) Study II Study III FD Lars GLMpath SIS PCA ALS FLS LD (0.011) (0.012) (0.012) (0.021) (0.009) (0.007) (0.008) SVM (0.017) (0.013) (0.013) (0.022) (0.008) (0.006) (0.008) LD (0.009) (0.009) (0.010) (0.017) (0.006) (0.012) (0.007) SVM (0.018) (0.010) (0.011) (0.019) (0.007) (0.012) (0.007) LD (0.018) (0.016) (0.015) (0.021) (0.020) (0.015) (0.007) SVM (0.015) (0.016) (0.016) (0.020) (0.022) (0.016) (0.007) LD (0.008) (0.007) (0.008) (0.010) (0.008) (0.010) (0.017) SVM (0.009) (0.007) (0.008) (0.013) (0.010) (0.009) (0.017) 16

17 Figure 1: Model performance figures. The upper panels are Scenario 1 and the lower panels Scenario 2. The left panels: the average training deviance path; the middle panels: the average test error path on LD models; the right panels: the average ξ path. The Average Training Deviance The Average Test Error Path on LDs The Average Sparsity Deviance MCR Sparsity Iterations Iterations Iterations The Average Training Deviance The Average Test Error Path on LDs The Average Sparsity Deviance MCR Sparsity Iterations Iterations Iterations 17

18 3.2 Simulation study II: dense case In the second simulation, we investigated a relatively dense model. As in the first simulation study, we simulated the design matrix from the same multivariate normal distribution. However, unlike the sparse case, the elements of the true projection matrix were generated from the standard normal distribution. Therefore, by our definition, the sparsity of the projection matrix was close to zero. In all other respects the simulation was identical to the previous setup. Since the true projection matrix was dense we assigned ξ = 0 to FLS. The results are reported in the middle part of Table 1. ALS and FLS clearly outperformed the other methods, with FLS providing the best results. Among other methods FD worked best, suggesting that performing no dimension reduction can be better than a poor dimension reduction. One possible reason for the strong performance of FD is that, since d = 50 is only a moderately large number with d < n, LD and SVM might be able to manage classification on the full dimensional data. One would not expect the performance of Lars, GLMpath or SIS to be as good because they are all variable selection methods and in this case all variables play a role. Plots of the training deviance path, the test MCR paths, and the sparsity path (not shown here) are similar to the sparse case. The sparsity for this simulation study levels off at about 0.5 which explains the superior performance of FLS. 3.3 Simulation study III: contaminated data In order to examine the robustness of ALS against outliers, we created a situation where the distributional assumptions were violated. We generated twenty 100-by-50 training data matrices from a multivariate g-and-h distribution (Field and Genton, 2006) with g = h = (0.5,..., 0.5) T R 50, a zero mean vector and covariance matrix 18

19 Σ, where the diagonal elements of Σ are all 1 s and the off-diagonal elements are 0.3 s. This was a skewed and heavy-tailed distribution, so the data contained many extreme values or outliers. As with the previous simulation the true projection matrix was generated using a standard normal distribution. We generated corresponding test data sets with 1, 000 observations each. FLS was implemented with ξ = 0.1. The results are listed at the bottom of Table 1. As with previous simulations, ALS and FLS do better than other methods on these data which demonstrates that our method is not very sensitive to outliers. Our understanding of the stability of ALS and FLS is as follows. The design matrix contains many extreme values, but unlike PCA, the projection directions are randomly generated and the generation is independent of X. This independence results in a weakening of the effect of extreme values in X on the projected data. Presumably, Z has less extreme values than X. Therefore, the whole search procedure is more stable. 3.4 Simulation study IV: ultra high dimensional data The previous simulations consider data with moderately high dimensions. In this section, we simulated an ultra high dimensional situation where n = 100, d = 1000 and a preliminary dimension reduction was needed. Among the 1000 variables, m = 50 contained the group information. That is, we first created an m-by-p projection matrix where elements were i.i.d. standard normally distributed, and then obtained the reduced data. The logistic regression model was used to generate the class labels based on the p = 5 dimensional reduced data. The 50 informative variables were generated from N (0, Σ), where Σ was as we defined in previous simulations. The remaining 950 variables were pure noise and were generated from N (0, Σ ), where the 19

20 off-diagonal elements of Σ were 0.2s, and the diagonal elements were 1. We used the three aforementioned preliminary reduction methods: SIS, PCA, and PCA-SIS to reduce the data into m = 50 dimensions, and then performed ALS and FLS on the intermediate data to further reduce the data down to p = 5 dimensions. This gave six combinations: SIS-ALS, PCA-ALS, PCA-SIS-ALS, SIS-FLS, PCA-FLS, and PCA-SIS-FLS. Presumably, if the preliminary reduction could extract the m informative variables or the linear combination of the m variables, ALS and FLS with ξ = 0 would be able to perform well. In order to demonstrate that the improvement of the classification accuracy was not mainly due to the preliminary reduction but to the ALS method, we also implemented Lars and GLMpath on the preliminary reduced data from SIS, PCA and PCA-SIS. Moreover, we applied LD on the full dimensional data in addition to using Lars and GLMpath to directly select p = 5 variables from the full 1, 000 dimensional data. Twenty training samples were used to build the models with corresponding test samples (n = 1000) generated to obtain the test MCR for each model. The average test MCRs, using the LD classifier, are shown in Table 2 along with standard errors. As we can see, SIS-ALS and SIS-FLS provided the best results, however, all six implementations of ALS and FLS were statistically indistinguishable from each other. All other methods were clearly inferior, including Lars and GLMpath even after using the same preliminary dimension reduction. This indicates that the improvement of the classification accuracy by ALS was not only caused by the preliminary reduction but more by the method itself. Numerous classical and modern methods have been developed for classification problems. We can not hope to test them all here but we opted to examine a represen- 20

21 tative sampling of these methods. In particular we tested Support Vector Machines (SVM), k-nearest Neighbors (knn), Neural Networks (NN) and Random Forests (RF). We tested each method after first applying one of the three different dimension reduction methods, SIS, PCA and PCA-SIS. The results are shown in Table 3. As can be seen, ALS and FLS performed extremely well relative to these other approaches. Table 2: The average test errors. Further reduction m p (p = 5) none Lars GLMpath SIS PCA ALS FLS none / / (0.011) (0.022) (0.022) (0.021) (0.023) Preliminary SIS- / / / reduction (0.016) (0.017) (0.017) (0.015) d m PCA- / / / (m = 50) (0.019) (0.019) (0.016) (0.015) PCA-SIS- / / / (0.020) (0.020) (0.018) (0.019) Table 3: The average test errors on the preliminary reduced data by different classification methods. SVM knn NN RF SIS (0.024) 0.330(0.023) 0.321(0.027) 0.307(0.023) PCA (0.018) 0.374(0.023) 0.298(0.019) 0.330(0.023) PCA-SIS (0.024) 0.360(0.021) 0.297(0.018) 0.337(0.024) 4 Applications on real data In this section, we apply ALS on two data sets from real studies: a fmri data set and a gene microarray data set. 21

22 4.1 fmri brain imaging data The fmri data was obtained from the imaging center of the University of Southern California and consisted of observations from an individual subject doing a visual task. After preprocessing, the data contained 11, 838 variables and 200 observations representing the number of voxels and the number of recording time points, respectively. The goal was to classify the task time points (96) from the baseline time points (104). We first divided the data set into training (150) and test (50) samples. We chose p = 10, 20, 30, 40, 50, 60 which were considered to be good balances of the execution time and the classification accuracy. We first reduced the 11, 383 dimensional data space into an m-dimensional space with m = 60 chosen via (4). This preliminary dimension reduction had the effect of both increasing classification accuracy and reducing computational cost. We used PCA, SIS, and PCA-SIS to perform the preliminary dimension reduction and then applied ALS, Lars and GLMpath for a total of nine different implementations. As a comparison, we also used Lars and GLMpath directly on the full dimensional data to produce the p dimensional subspace. The full dimensional case was not examined here because the model crashed when this ultra high dimensional data set was imported. Table 4 reports the average test MCRs on 20 runs using SVM as the base classifier. The ALS based methods clearly outperformed other approaches. In particular, the classification accuracy of the three ALS based methods was PCA-SIS-ALS > PCA-ALS > SIS-ALS. As is shown in Figure 2, the average training deviance and the average test MCR decreased and then leveled off indicating the search converged to the optimal subspace. The average ξ for PCA-SIS-ALS increased to about 0.75 and then leveled off, which means ALS chose a relatively sparse matrix as the optimal projection matrix 22

23 for the fmri data. Similar trends of ξ s are obtained for SIS-ALS and PCA-ALS. Table 5 shows results from the SVM, knn, NN and RF classification methods on the m = 60 dimensional reduced data. All four methods perform worse than ALS. Table 4: The test errors on fmri data by SVM models. p Lars GLMpath SIS PCA SIS-ALS (0.008) (0.009) (0.005) (0.006) (0.008) (0.009) p PCA-ALS PCA-SIS-ALS SIS-Lars SIS-GLMpath PCA-Lars (0.011) 0.175(0.020) (0.007) 0.084(0.010) (0.007) 0.061(0.008) (0.005) 0.050(0.006) (0.005) 0.042(0.005) (0.005) / p PCA-GLMpath PCA-SIS-Lars PCA-SIS-GLMpath / / 4.2 Leukemia cancer gene microarray data The second real world data set was the leukemia cancer data from a gene microarray study by Golub et al. (1999). This data set contained 72 tissue samples with each having 7, 129 gene expression measurements and a label of cancer type. Among the 72 tissues, 25 were samples of acute myeloid leukemia (AML) and 47 were samples of acute lymphoblast leukemia (ALL). Within the 38 training samples, 27 were ALL and 23

24 Table 5: The test errors on the preliminary reduced data by different classification methods. SVM knn NN RF SIS PCA PCA-SIS Figure 2: PCA-SIS-ALS on fmri data (p = 30). The left panel: the average training deviance path; the middle panel: the average test error path; the right panel: the average ξ path. The Average Training Deviance on fmir Data The Average Test Error on SVMs The Average Sparsity on fmir Data Deviance MCR Sparsity Iterations Iterations Iterations 24

25 11 AML, and within the 34 test samples, 20 were ALL and 14 AML. In some previous studies (Fan and Fan, 2008; Fan and Lv, 2008), 16 genes were finally picked. So in this study, we also pick p = 16 genes ultimately. The intermediate dimension was chosen to be m = 21 by Equation (4). A LD model was applied for classification. Similar to our simulation studies and the fmri study, PCA-SIS-ALS provided the lowest MCR (see Table 6), and it was better than the nearest shrunken centroids method mentioned in Tibshirani et al. (2002), which obtained a MCR of 2/34 = based on 21 selected genes (Fan and Lv, 2008). Notice that PCA-Lars and PCA-SIS-Lars also performed well on this data. Figure 3 shows the resulting graphs. Similar to previous examples, the training deviance and the test MCRs decrease and then level off. The sparsity of the projection matrix levels off at about Table 6: The test MCRs on leukemia cancer data. Further reduction m p (p = 16) Lars GLMpath SIS PCA ALS Preliminary none / reduction SIS / / 0.176(0.010) d m PCA / / 0.056(0.008) (m = 21) PCA-SIS / / 0.004(0.002) 5 Evaluating the computational expense of ALS In this section we provide a brief review of the computational cost of ALS in comparison to other stochastic search procedures. We compared ALS on the fmri data to the method by Tian et al. (2008) in terms of both classification accuracy and the computational time. The approach of Tian et al. (2008) provides a useful comparison because it uses a similar model to our own but implements the search process using 25

26 Figure 3: PCA-SIS-ALS on leukemia cancer data. The left panel: the average training deviance path; the middle panel: the average test error path; the right panel: the average ξ path. The Average Training Deviance The Average Test Error Path on LD Models The Average Sparsity Deviance MCR Sparsity Iterations Iterations Iterations a simulated annealing approach. The code was written in R and was run on a Dell Precision workstation (CPU 3.00GHz; RAM 2.99GHz, 16.0GB). We recorded the total CPU time (in seconds) of the R program and compared it with the time needed to run the second version of the simulated annealing method (SA2) of Tian et al. (2008) using an SVM model. PCA was applied first to reduce the data dimension to m = 80 and then both methods searched for a p = 30 dimensional subspace. Table 7 reports the time consumed and the classification accuracy on the test data when both methods use 500 iterations. Both ALS and SA2 took a similar time to run. The key difference was that the ALS converged well before 500 iterations while SA2 did not. Hence, the error rate for SA2 was much higher. SA2 took approximately 5000 iterations to converge which resulted in an order of magnitude more computational effort. In addition, even when 5000 iterations were chosen for SA2 the average test errors were still higher than for the ALS method. 26

27 Table 7: The execution time for each run and the average test errors on fmri data. Total Elapsed Time/Run (sec.) Test Error ALS (500 iterations) (0.010) SA2 (500 iterations) (0.018) SA2 (5000 iterations) (0.012) 6 Concluding remarks ALS was designed to implement a supervised learning classification method with the flexibility to mimic either a variable selection or a variable combination method. It does this by adaptively adjusting the sparsity of the projection matrix used to lower the dimensionality of the original data space. We use a stochastic search procedure to address the very high dimensional predictor space. Our simulation results suggest that this approach can provided extremely competitive results relative to a large range of classical and modern classification techniques. We also examined three different preliminary dimension reduction methods which appeared to both increase prediction accuracy as well as improve computational efficiency. ALS seems to converge quickly relative to other stochastic approaches which makes it feasible to use on large data sets. This ALS method for dimensionality reduction methodology could also be generalized to the context of regression where the response is a continuous variable. Further studies are planned for this setting. References Breiman, L. (2001), Random forests, Machine Learning, 45, Candes, E. and Tao, T. (2007), The Dantzig selector: statistical estimations when p is much larger than n (with discussion). The Annals of Statistics, 35,

28 Donoho, D. L. (2000), High-dimensional data analysis: the curses and blessings of dimensionality, Aide-Memoire of the lecture in AMS conference Math challenges of 21st Century. Available at donoho/lectures. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), Least angle regression, The Annals of Statistics, 32, Fan, J. and Fan, Y. (2008), High dimensional classification using features annealed independence rules, The Annals of Statistics, to be published. Fan, J. and Li, R. (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, Fan, J. and Lv, J. (2008), Sure independence screening for ultra-high dimensional feature space, Journal of the Royal Statistical Society: Series B, 70, Field, C., and Genton, M. G. (2006), The multivariate g-and-h distribution, Technometrics, 48, Fogel, D. B. (2000), Evolutionary computation: toward a new philosophy of machine intelligence (2nd ed.), IEEE Press, Piscataway, NJ. Fowlkes, E. B., Gnanadesikan, R., and Kettering, J. R. (1988), Variable selection in clustering, Journal of Classification, 5, Fu, M. C. (2002), Optimization for simulation: theory vs. practice (with discussion by S. Andradóttir, P. Glynn, and J. P. Kelly), INFORMS Journal on Computing, 14, Fu, W. and Knight, K. (2000), Asymptotics for Lasso-type estimators, The Annals of Statistics, 28, George, E. I., and McCulloch, R. E. (1993), Variable selection via Gibbs sampling, Journal of American Statistical Association, 88, George, E. I., and McCulloch, R. E. (1995), Stochastic search variable selection. In Practical Markov Chain Monte Carlo in Practice, Ed. W.R. Gilks, S. Richardson and D. J. Spiegelhalter, , London: Chapman and Hall. George, E. I., and McCulloch, R. E. (1997), Approaches for Bayesian variable selection, Statistica Sinica 7, Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C. and Lander, E. (1999), Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286,

29 Gosavi, A. (2003), Simulation-based optimization: parametric optimization techniques and reinforcement learning, Kluwer, Boston. Hastie. T., Tibshirani, R., and Friedman, J. (2001), The elements of statistical learning: data mining, inference, and prediction, Springer-Verlag. Hotelling, H. (1933), Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24, Liberti, L., and Kucherenko, S. (2005), Comparison of deterministic and stochastic approaches to global optimization, International Transactions in Operations Research, 12, Mitchell, T. J. and Beauchamp, J. J. (1988), Bayesian variable selection in linear regression (with Discussion), Journal of the American Statistical Association, 83, Park, M.-Y. and Hastie, T. (2007), An L 1 regularization-path algorithm for generalized linear models, Journal of the Royal Statistical Society: Series B, 69, Pearson, K. (1901), On lines and planes of closest fit to systems of points in space, Philosophical Magazine, 2, Tian, T., Wilcox, R., and James, G. (2008), Data reduction when dealing with classification based on high-dimensional data, under review. Tibshirani, R. (1996), Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society Series B, 58, Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002), Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proceedings of the National Academy of Sciences of the United States, 99, Torgerson, W. S. (1958), Theory and methods of scaling, New York: Wiley. Vapnik, V. (1998), Statistical Learning Theory, Wiley. West, M., Blanchette, C., Fressman, H., Huang, E., Ishida, S., Spang, R., Zuan, H., Marks, J. R. and Nevins, J.R. (2001), Predicting the clinical status of human breast cancer using gene expression profiles, Proceedings of the National Academy of Sciences of the United States, 98, Wold, H. (1966), Estimation of Principal Components and Related Models by Iterative Least Squares, Multivariate Analysis, Academic Press, New York, Young, G., and Householder, A. S. (1938), Discussion of a set of points in terms of their mutual distances, Psychometrika, 3,

30 Zhao, P. and Yu, B. (2006), On model selection consistency of Lasso, Journal of Machine Learning Research, 7, Zou, H. and Hastie, T. (2005), Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B, 67,

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com