An Adaptive Lasso Method for Dimensionality Reduction in. Classification

Size: px
Start display at page:

Download "An Adaptive Lasso Method for Dimensionality Reduction in. Classification"

Transcription

1 An Adaptive Lasso Method for Dimensionality Reduction in Classification Tian Tian Gareth James Rand Wilcox Abstract Producing high-dimensional classification methods is becoming an increasingly important problem. In this article, we propose an adaptive Lasso search (ALS) approach which first reduces the dimension of the data space and then applies a standard classification method to the reduced space. One key advantage of ALS is that it automatically adjusts to either mimic variable selection type methods, such as the Lasso, or variable combination methods, such as PCA, or somewhere between the two approaches. The adaptivity of ALS allows it to perform well in situations where pure variable selection or variable combination methods fail. ALS uses the Lasso in a stochastic search algorithm to select a handful of optimal projection directions from a large number of random directions in each iteration. We demonstrate the strengths of ALS on an extensive range of simulation studies and real world data sets and compare it to many classical and modern classification methods. Ph.D. Student, Department of Psychology, University of Southern California, Los Angeles, CA ttian@usc.edu. Associate Professor, Department of Information and Operations Management, University of Southern California, Los Angeles, CA gareth@usc.edu. Professor, Department of Psychology, University of Southern California, Los Angeles, CA rwilcox@usc.edu. 1

2 Keywords: Lasso; Dimensionality reduction; Variable selection; Variable combination; Classification 1 Introduction An increasingly important topic is the classification of observations into two or more predefined groups when the number of predictors, d is larger than the number of observations, n. Many conventional classification methods, such as Fisher s discriminant rule, are not applicable if n < d. Other methods, such as classification trees, k-nearest neighbors, and logistic discriminant analysis, etc., do not explicitly require n > d but in practice provide poor classification accuracy when d is ultra large. One major reason is that when n < d, the covariance matrix for each class is singular and hence unstable. A common solution is to first reduce the data into a lower p d dimensional space and then perform classification on the transformed data. Fan and Lv (2008) provide significant theoretical and empirical evidence for the power of such an approach. Generally, the dimension reduction is performed using a linear transformation of the form Z n,p = X n,d P d,p, (1) where X is an n-by-d data matrix and P is a d-by-p projection matrix which projects X onto a p-dimensional subspace Z (p < d). However there are many possible approaches to choosing P. We divide the methods into supervised versus unsupervised and variable selection versus variable combination. Variable selection techniques select a subset of relevant variables that have good predictive power. Variable selection benefits from obtaining a subset of informative 2

3 variables from a set of more complex variables. In this setting, P is some row permutation of the identity matrix and the zero matrix, i.e. P = perm([i p, 0 (d p) p ] T ). Hence, P is considered sparse because most of its components are zeros. Many variable selection methods have been proposed and widely used in numerous areas. A great deal of attention is paid to the L 1 penalized least squares estimator (i.e., the Lasso (Efron et al., 2004; Fu and Knight, 2000; Tibshirani, 1996; Zhao and Yu, 2006)). Other methods include nearest shrunken centroids (Tibshirani et al., 2002), Dantzig selector (Candes and Tao, 2007), SCAD (Fan and Li, 2001), the Elastic Net (Zou and Hastie, 2005), and some Bayesian methods of variable selection (George and McCulloch, 1993, 1995, 1997; Mitchell and Beauchamp, 1988). These methods all use supervised learning, where the response and predictors are both utilized to obtain the subset of variables. In addition, because they always select a subset of the original variables, they provide highly interpretable results. However, because of the sparsity of P, for a given p, variable selection methods are less efficient at compressing the observed data, X. For example, they may discard potentially valuable variables which are not predictive individually but provide significant improvement in conjunction with others. In comparison, variable combination methods utilize a dense P which combines correlated variables and hence does well on multicollinearities which often occur in high-dimensional data. Probably the most commonly applied method in this category is principal component analysis (PCA) (Hotelling, 1933; Pearson, 1901). Using this approach P becomes the first p eigenvectors of X and Z is the associated PCA scores. PCA can deal with an ultra large data scale and produces the most efficient representation of X using p dimensions. However, PCA is an unsupervised learning technique so it does not make use of the response variable to construct P. It has been well shown 3

4 that the dimensions that best explain X will not necessarily be the best dimensions for predicting the response Y. Other variable combination methods include partial least squares regression (Wold, 1966) and multidimensional scaling (Torgerson, 1958; Young and Householder, 1938). All these approaches yield linear combinations of variables which makes interpretation more difficult. In this article we propose a new supervised learning method that is capable of automatically adapting the sparsity of P to generate optimal prediction accuracy. By sparsity here we refer to the fraction of zero elements of P. Our method uses the Lasso so we call it an Adaptive Lasso Search (ALS) procedure. In situations where a subset of the original variables provides a good fit, ALS will utilize a sparse P matrix, while in situations where linear combinations of the variables work better, ALS will produce a denser matrix. ALS has the same advantage of other supervised methods in which P can be designed specifically to provide the best level of prediction accuracy. However, it has the flexibility to adapt the sparsity of P so as to gain the benefits of both the variable selection and variable combination methods. ALS uses a stochastic search procedure. It starts by generating a large set of perspective columns of P with sparsity ξ 0. The Lasso is then used to select a candidate subset of good columns or directions. Then a new set of columns with a new sparsity level ξ1 are produced as candidates and the Lasso must choose among the current set of good columns and the new candidates. Over time the same best columns are picked at each iteration and the process converges. At each step in the algorithm the sparsity level of the new perspective columns is adjusted according to the sparsity level of the previously chosen columns. We show through extensive simulations and real world examples that ALS is highly robust in that it generally provides comparable 4

5 performance to variable selection and variable combination approaches in situations that favor each of these methods. However, ALS can still perform well in situations where one or other of these approaches fails. This article is organized as follows. In Section 2, we present and motivate the basic ALS methodology. We also discuss methods for implementing a preliminary data reduction before applying ALS. In Section 3 we provide an extensive simulation comparison of ALS with several other modern potential classification techniques such as k-nearest neighbors, support vector machines (Vapnik, 1998), random forests (Breiman, 2001) and neural networks. In Section 4 we demonstrate the performance of ALS on two real world data sets. We briefly investigate the computational cost of implementing ALS in Section 5. Section 6 concludes the article with a summary and discussion about future areas of investigation. 2 Projection selection with ALS 2.1 General ideas and motivations Our general approach involves choosing a classification procedure and a matrix P. We then use Equation (1) to project the data into a p-dimensional space and apply the classification procedure to the lower dimensional data. While this approach requires choosing both the classifier and the projection matrix we concentrate only on the latter problem. There are many classification techniques that have been demonstrated to perform well on low to medium dimensional data. A more difficult question involves the best way to produce the lower dimensional data. Hence, we assume that one of these classification methods has been chosen and concentrate our attention on the choice of P. 5

6 Given a classification method, the choice of the best subspace for classification can be formulated as the following optimization problem: P = arg min P E X,Y [e(m P (X), Y )], (2) where M P is the classification method applied to the lower dimensional data, X and Y are the predictors and response variables and e is a loss function resulting from using M P to predict Y. A common choice for e is the 0-1 loss function which results in minimizing the misclassification rate (MCR). Since P is very high dimensional, solving (2) is in general a difficult problem. There are several possible approaches. One option is to assume that P is a sparse 0, 1 matrix and only attempt to estimate the locations of its non-zero elements. This is the approach taken by variable selection methods. Another option is to assume P is dense but, instead of choosing P to optimize (2), choose a matrix which provides a good representation for X. This is the approach taken by PCA. One then hopes that the PCA solution will be close to that of (2). Instead of making restrictive assumptions about P, as with the variable selection approach, or failing to directly optimize (2), as with the PCA approach, we attempt to directly fit (2) without placing restrictions on P. In this type of high dimensional nonlinear optimization problem, stochastic search methods, such as genetic algorithms, simulated annealing, evolutionary strategies, etc., have been shown to provide superior results over more traditional deterministic procedures because they are often able to more effectively search large parameter spaces and can be used for any class of objective functions and yield an asymptotic guarantee of convergence (Fogel, 2000; Fu, 2002; Gosavi, 2003; Liberti and Kucherenko, 2005). Tian et al. (2008) used a stochastic simulated annealing algorithm in a similar situation to ours. They found that the 6

7 approach was effective at producing accurate classifications. However, it required a large number of iterations to effectively search the parameter space and hence was computationally expensive. We explore an alternative stochastic process which is often even more effective at searching the parameter space and generally requires many fewer iterations. 2.2 The ALS Method The ALS procedure works by successively generating a large number, L, of potential random directions, i.e. columns. We then use Equation (1) to compute the corresponding L dimensional data space, z 1, z 2,..., z L, and use a variable selection method to select a good subset of these directions to form an initial estimate for P. The sparsity level of this P is examined and a new random set of potential columns is generated with the same average sparsity as the current P. The procedure iterates in this fashion for a fixed number of steps. Formally, the ALS procedure consists of the following steps: Step 1. Randomly generate the current d-by-l projection matrix P 0, (p < L < d) with an expected sparsity of ξ 0. Step 2. At the lth iteration use (1) and P l to obtain a preliminary reduced data space Z l. Evaluate each variable of Z l and select the p most important variables. Step 3. Keep the corresponding p columns of P l and calculate the average sparsity, ξ l, for these columns. Step 4. Generate L p new columns with an average sparsity of ξ l. Join these columns with the p columns selected in Step 3 to form a new projection matrix P l+1. Step 5. Return to Step 2. for a fixed number of iterations. 7

8 The two crucial parts to this approach are the mechanism for generating the potential columns and the procedure for choosing the best p columns. We discuss the column generation procedure in Section 2.3. As far as the choice of variables is concerned, we considered several possible variable selection methods. In the context of classification, a natural approach is to use a GLM version of the Lasso using a logistic regression framework. We examined this approach using the GLMpath methodology of Park and Hastie (2007). Interestingly, we found that simply using the standard Lasso procedure to select the variables gave similar levels of accuracy to GLMpath and required considerably less computational effort. Therefore, in our implementation of ALS we use the first p variables selected by the Lars algorithm (Tibshirani, 1996). However, in practice one could implement our methodology with any standard variable selection method such as SCAD, Dantzig selector, etc. 2.3 Generating projection directions At each iteration of the ALS algorithm we generate new potential columns with the same average sparsity level as the columns selected in the previous step. This allows the sparsity level to automatically adjust to the data set. The idea is that a data set requiring high levels of sparsity, i.e. where a variable selection method will perform well, will tend to result in sparser columns being selected and the current P l will become sparser at each iteration. Alternatively a data set that requires a denser P l will select less sparse columns at each iteration. We define sparsity of the current matrix P l as ξ l = 1 d p d i=1 p I j=1 ( p (l) i,j = 0 ), (3) 8

9 where p (l) i,j is the (i, j)th element of P l. We then generate L p new columns such that the expected fraction of zero elements over all columns is ξ l. While the overall average sparsity is restricted to equal ξl we desire some variability in the sparsity levels so that ALS is able to select out columns with higher or lower sparsity and hence adjust ξ l for the next iteration. To achieve this goal we allow for different average sparsity levels between columns. Let ξ (l) j represent the average sparsity of the jth column in iteration l. Then in the (l + 1)th iteration we generate ξ (l+1) j from a Beta(α, α(1 ξ l )/ ξ l ) distribution. Note that E(ξ (l+1) j ) = ξ l for all values of α. We found that α = 10 produced a reasonable amount of variance in sparsity levels. To produce the L p columns we then simulate u i,j N (0, 1) and v ij B(1 ξ (l+1) j ), i = 1,..., d, j = p + 1,..., L and set p (l+1) i,j = u i,j v i,j where p (l+1) i,j represents the i, jth element of P l+1. Here, B(π) is the Bernoulli distribution with probability of 1 equal to π. We then combine these L p columns with the selected p columns from iteration l to form the new intermediate projection matrix P l+1. We select ξ 0 = 0.5 for the initial sparsity level, which seems to provide a reasonable compromise between the variable selection and variable combination paradigms. The full ALS algorithm is explicitly described as follows: 1. Set ξ 0 = 0.5 and generate an initial P Iterate until l = I. (a) Generate L p new projection directions P new with the (i, j)th element p i,j = u ij v ij, where u i,j N (0, 1), v i,j B(1 ξ j ), and ξ j Beta(α, α(1 ξ l 1) / ξ l 1 ). (b) Let P l = (P l 1, P new ) and use (1) to obtain Z l. 9

10 (c) Apply the Lasso to select p variables from Z l. (d) Keep the corresponding p columns from P l to obtain P l. (e) Calculate ξ l for P l by (3). (f) Go to (a). To implement ALS, one must choose values for the number of iterations, I, the number of random columns to generate, L, and the final number of columns chosen, p. In our experiments we used I = 500 iterations. We found this value guaranteed convergence and often fewer iterations were required. In general tracking the deviance of the Lasso at each iteration provided a reliable measure of convergence. The value of L influences the convergence speed of the algorithm as well as the execution time. We found that the best results were obtained by choosing L to be a relatively large value for the early iterations but then to have it decline. In particular we set L = I/10+3p/2 for the first iteration and have it decline to 3p/2 by the final iteration. This approach is similar to simulated annealing where the temperature is lowered over time. It guarantees that the Lasso has more variables to choose among at the early stages, but as the search moves towards the optimum, decreasing L improves the reliability of the selected p variables in addition to speeding up the algorithm. The choice of p can obviously have an important impact on the results. In general, p should be chosen as some balance of the classification accuracy and the computational expense. One reasonable approach is to use an objective criterion to choose p such as cross-validation. In other situations, some prior knowledge can be applied to select p. For example, in the fmri study that we examine in Section 4, the prior knowledge and assumptions on the regions of interest can be used to decide on a reasonable value for p. 10

11 2.4 Preliminary reduction ALS can, in general, be applied to any data, but in practice when the data scale is ultra large a preliminary reduction step is usually added to speed up the algorithm as well as obtain a better accuracy. Fan and Lv (2008) argue that a successful strategy to deal with ultra-high dimensional data is to apply a series of dimension reductions. In our setting this strategy would involve an initial reduction of the dimension to a moderate level followed by applying ALS to the new lower dimensional data. This two stage approach potentially has two major advantages. First, Fan and Lv (2008) show that prediction accuracy can be considerably improved by removing dimensions that clearly appear to have no relationship to the response. Our own simulations also reinforce this notion. Second, stochastic search algorithms such as ALS can require significant computational expense. Reducing the data dimension before applying ALS provides a large increase in efficiency. This preliminary data reduction has to be conducted carefully so as to minimize the information loss while maximizing the reduction of redundancy in the variables. The main challenges are as follows. First, the singular and unstable covariance matrix makes the estimation of parameters inaccurate. Second, high-dimensional data usually contain a large number of unnecessary variables (noise), this inclusion of noise could complicate or even mask the class information (Donoho, 2000; Fowlkes et al., 1988). Third, when d n, even if each element of the population mean vector can be accurately estimated, the aggregated estimation error can be very large. In this article, we consider three methods for reducing the data into m (m > p) dimensions. The first approach is conventional PCA which has the effect of selecting the best m dimensional space in terms of minimizing the mean squared deviation with the 11

12 original d dimensional space. The second approach is the sure independence screening (SIS) method based on correlation learning (Fan and Lv, 2008). It computes the componentwise marginal correlations between each predictor and the response vector. One then selects the variables corresponding to the m largest correlations. PCA is an unsupervised variable combination approach while SIS is a supervised variable selection method. PCA may exclude important information among variables with less variability while SIS may tend to select a redundant subset of predictors that have high correlations with the response individually but are also highly correlated among themselves. Our third dimension reduction approach, which we call PCA-SIS, attempts to leverage the best of both PCA and SIS by first using PCA to obtain n orthogonal components and then treating these components as the predictors and using SIS to select the best m in terms of correlation with the response. PCA-SIS can be thought of as a supervised version of PCA. We compare these three types of preliminary reduction methods, along with the effect of performing no initial dimension reduction, in our simulation studies. Fan and Lv (2008) argue that the dimension of the intermediate space m should be chosen as m = 2n log(n). (4) We have found that this approach generally works well so have adopted Equation (4) for selecting m in this article. 3 Applications on simulated data In this section, we give five simulation examples to demonstrate the performance of ALS on classification tasks. In all simulations we show results with ALS applied using the classical Logistic Discriminant Analysis (LD) classifier and using the Support Vector 12

13 Machine (SVM) classifier. In addition we also apply LD and SVM to the original full dimensional data (FD) and to low dimensional data produced using a straight Lasso method (Lars), a generalized Lasso method (GLMpath), SIS, and PCA. Lars, GLMpath, SIS and PCA all utilize equation (1) but compute P directly using their own methodologies. 3.1 Simulation study I: sparse case The first simulation was designed to examine the performance of ALS in a sparse model situation where the response is only associated with a handful of predictors. We simulated 20 training samples, each containing 100 observations and 50 predictor variables generated from a multivariate normal distribution N (0, Σ), where the diagonal elements of Σ are 1s. In order to examine the model performance under multicollinearities, we set the off-diagonal elements of Σ equal to 0.5. We then rescaled the first p = 5 columns to have a standard deviation of 10. We call these columns with the most variability the major columns, and the rest the minor columns. The binary response variable was generated using a standard logistic regression model after first reducing the dimension of the space using a fixed P. For each training data set a corresponding set of 1, 000 observations was generated as a test sample. We tested out two scenarios by creating two different true projection matrices. The first scenario used only the p major columns to generate the response variables. Hence the true projection matrix was P = perm 1 ([I p, 0] T ), where perm 1 is the row permutation that makes the unit row vectors in P associate with the major columns of the data matrix. The second scenario associates the group information with any p of the minor columns having less variability than the major columns. In this setting 13

14 the true projection matrix is P = perm 2 ([I p, 0] T ), where perm 2 is the row permutation that makes the unit row vectors in P associate with p of the minor columns. It is easy to see that the first scenario favors PCA because the first 5 eigenvectors from PCA will tend to concentrate on the major columns from which the response variable is derived. However, in the second scenario PCA should perform poorly because it will still be selecting the most variable columns even though there is no group information among them. Supervised methods, such as ALS, Lars, GLMpath etc., should have an advantage in the latter scenario because they will always look for projections with high predictive power. We compared the test MCRs averaged over the 20 test data sets. In order to demonstrate the importance of the sparsity of the projection matrix to the classification performance, we also implemented a second version of ALS where we fixed the sparsity of the projection matrix, ξ, so that it does not change over iterations. We call this method the fixed Lasso search (FLS) method. Presumably, FLS with a good value of sparsity will perform better than ALS, because ALS has to adaptively search for ξ. If the initial sparsity is far from the true value, ALS will spend more time finding the true sparsity and hence slow down the rate of convergence. Since the true sparsity level for this data was high we set ξ = 0.9 for the FLS procedure. For the Lars, GLMpath, SIS, PCA, ALS and FLS methods we reduced the data to p = 5 dimensions, using the corresponding approaches, and then applied LS and SVM. For FD we did not perform any dimension reduction. The results are shown in the first two sections of Table 1. Not surprisingly, LD generally outperformed SVM on this data because the data were generated using the logistic regression model. Of more interest is the relative performance of the seven 14

15 dimension reduction methods. As expected, PCA worked fairly well in scenario 1 but poorly in scenario 2. In both scenarios, ALS and FLS performed well. Notice that since the true model was highly sparse, the pure variable selection methods (Lars and GLMpath) performed well too. All methods except PCA outperformed FD. Figure 1 illustrate the average deviance values, the average test MCRs, and ξ from ALS over 500 iterations for scenarios 1 and 2 respectively. As we can see, the average test MCRs and deviances in both scenarios were decreasing, which means the Lasso chose good directions from P in each iteration, and hence the classification accuracy was improved gradually. It is interesting to notice that the sparsity levels for the estimated projection matrices of these two scenarios are different, although the sparsity for the true projection matrices are in fact the same: ξ = In scenario 1, ξ decreased to about 0.28, which is far from the true sparsity level. This explains why FLS performed worse than ALS in this scenario. In scenario 2, ξ increased to about 0.8, which is close to the true value, and in this setting FLS worked better than ALS as we expected. One possible reason of the difference is that in scenario 1, the estimated projection matrix may have many elements close to zero which make little contribution in extracting variables, but these elements still contribute to its denseness. Another possible reason is that for scenario 1, since all information is located in major columns, the noise level is low. The inclusion of minor columns would not highly influence the classification accuracy. In scenario 2, on the other hand, since the noise level is high, the search must guarantee to find the correct sparsity in order to improve accuracy. 15

16 Table 1: The average test MCRs and standard errors (in parenthesis) on three intermediate-scaled simulated data sets. Study I (1) Study I (2) Study II Study III FD Lars GLMpath SIS PCA ALS FLS LD (0.011) (0.012) (0.012) (0.021) (0.009) (0.007) (0.008) SVM (0.017) (0.013) (0.013) (0.022) (0.008) (0.006) (0.008) LD (0.009) (0.009) (0.010) (0.017) (0.006) (0.012) (0.007) SVM (0.018) (0.010) (0.011) (0.019) (0.007) (0.012) (0.007) LD (0.018) (0.016) (0.015) (0.021) (0.020) (0.015) (0.007) SVM (0.015) (0.016) (0.016) (0.020) (0.022) (0.016) (0.007) LD (0.008) (0.007) (0.008) (0.010) (0.008) (0.010) (0.017) SVM (0.009) (0.007) (0.008) (0.013) (0.010) (0.009) (0.017) 16

17 Figure 1: Model performance figures. The upper panels are Scenario 1 and the lower panels Scenario 2. The left panels: the average training deviance path; the middle panels: the average test error path on LD models; the right panels: the average ξ path. The Average Training Deviance The Average Test Error Path on LDs The Average Sparsity Deviance MCR Sparsity Iterations Iterations Iterations The Average Training Deviance The Average Test Error Path on LDs The Average Sparsity Deviance MCR Sparsity Iterations Iterations Iterations 17

18 3.2 Simulation study II: dense case In the second simulation, we investigated a relatively dense model. As in the first simulation study, we simulated the design matrix from the same multivariate normal distribution. However, unlike the sparse case, the elements of the true projection matrix were generated from the standard normal distribution. Therefore, by our definition, the sparsity of the projection matrix was close to zero. In all other respects the simulation was identical to the previous setup. Since the true projection matrix was dense we assigned ξ = 0 to FLS. The results are reported in the middle part of Table 1. ALS and FLS clearly outperformed the other methods, with FLS providing the best results. Among other methods FD worked best, suggesting that performing no dimension reduction can be better than a poor dimension reduction. One possible reason for the strong performance of FD is that, since d = 50 is only a moderately large number with d < n, LD and SVM might be able to manage classification on the full dimensional data. One would not expect the performance of Lars, GLMpath or SIS to be as good because they are all variable selection methods and in this case all variables play a role. Plots of the training deviance path, the test MCR paths, and the sparsity path (not shown here) are similar to the sparse case. The sparsity for this simulation study levels off at about 0.5 which explains the superior performance of FLS. 3.3 Simulation study III: contaminated data In order to examine the robustness of ALS against outliers, we created a situation where the distributional assumptions were violated. We generated twenty 100-by-50 training data matrices from a multivariate g-and-h distribution (Field and Genton, 2006) with g = h = (0.5,..., 0.5) T R 50, a zero mean vector and covariance matrix 18

19 Σ, where the diagonal elements of Σ are all 1 s and the off-diagonal elements are 0.3 s. This was a skewed and heavy-tailed distribution, so the data contained many extreme values or outliers. As with the previous simulation the true projection matrix was generated using a standard normal distribution. We generated corresponding test data sets with 1, 000 observations each. FLS was implemented with ξ = 0.1. The results are listed at the bottom of Table 1. As with previous simulations, ALS and FLS do better than other methods on these data which demonstrates that our method is not very sensitive to outliers. Our understanding of the stability of ALS and FLS is as follows. The design matrix contains many extreme values, but unlike PCA, the projection directions are randomly generated and the generation is independent of X. This independence results in a weakening of the effect of extreme values in X on the projected data. Presumably, Z has less extreme values than X. Therefore, the whole search procedure is more stable. 3.4 Simulation study IV: ultra high dimensional data The previous simulations consider data with moderately high dimensions. In this section, we simulated an ultra high dimensional situation where n = 100, d = 1000 and a preliminary dimension reduction was needed. Among the 1000 variables, m = 50 contained the group information. That is, we first created an m-by-p projection matrix where elements were i.i.d. standard normally distributed, and then obtained the reduced data. The logistic regression model was used to generate the class labels based on the p = 5 dimensional reduced data. The 50 informative variables were generated from N (0, Σ), where Σ was as we defined in previous simulations. The remaining 950 variables were pure noise and were generated from N (0, Σ ), where the 19

20 off-diagonal elements of Σ were 0.2s, and the diagonal elements were 1. We used the three aforementioned preliminary reduction methods: SIS, PCA, and PCA-SIS to reduce the data into m = 50 dimensions, and then performed ALS and FLS on the intermediate data to further reduce the data down to p = 5 dimensions. This gave six combinations: SIS-ALS, PCA-ALS, PCA-SIS-ALS, SIS-FLS, PCA-FLS, and PCA-SIS-FLS. Presumably, if the preliminary reduction could extract the m informative variables or the linear combination of the m variables, ALS and FLS with ξ = 0 would be able to perform well. In order to demonstrate that the improvement of the classification accuracy was not mainly due to the preliminary reduction but to the ALS method, we also implemented Lars and GLMpath on the preliminary reduced data from SIS, PCA and PCA-SIS. Moreover, we applied LD on the full dimensional data in addition to using Lars and GLMpath to directly select p = 5 variables from the full 1, 000 dimensional data. Twenty training samples were used to build the models with corresponding test samples (n = 1000) generated to obtain the test MCR for each model. The average test MCRs, using the LD classifier, are shown in Table 2 along with standard errors. As we can see, SIS-ALS and SIS-FLS provided the best results, however, all six implementations of ALS and FLS were statistically indistinguishable from each other. All other methods were clearly inferior, including Lars and GLMpath even after using the same preliminary dimension reduction. This indicates that the improvement of the classification accuracy by ALS was not only caused by the preliminary reduction but more by the method itself. Numerous classical and modern methods have been developed for classification problems. We can not hope to test them all here but we opted to examine a represen- 20

21 tative sampling of these methods. In particular we tested Support Vector Machines (SVM), k-nearest Neighbors (knn), Neural Networks (NN) and Random Forests (RF). We tested each method after first applying one of the three different dimension reduction methods, SIS, PCA and PCA-SIS. The results are shown in Table 3. As can be seen, ALS and FLS performed extremely well relative to these other approaches. Table 2: The average test errors. Further reduction m p (p = 5) none Lars GLMpath SIS PCA ALS FLS none / / (0.011) (0.022) (0.022) (0.021) (0.023) Preliminary SIS- / / / reduction (0.016) (0.017) (0.017) (0.015) d m PCA- / / / (m = 50) (0.019) (0.019) (0.016) (0.015) PCA-SIS- / / / (0.020) (0.020) (0.018) (0.019) Table 3: The average test errors on the preliminary reduced data by different classification methods. SVM knn NN RF SIS (0.024) 0.330(0.023) 0.321(0.027) 0.307(0.023) PCA (0.018) 0.374(0.023) 0.298(0.019) 0.330(0.023) PCA-SIS (0.024) 0.360(0.021) 0.297(0.018) 0.337(0.024) 4 Applications on real data In this section, we apply ALS on two data sets from real studies: a fmri data set and a gene microarray data set. 21

22 4.1 fmri brain imaging data The fmri data was obtained from the imaging center of the University of Southern California and consisted of observations from an individual subject doing a visual task. After preprocessing, the data contained 11, 838 variables and 200 observations representing the number of voxels and the number of recording time points, respectively. The goal was to classify the task time points (96) from the baseline time points (104). We first divided the data set into training (150) and test (50) samples. We chose p = 10, 20, 30, 40, 50, 60 which were considered to be good balances of the execution time and the classification accuracy. We first reduced the 11, 383 dimensional data space into an m-dimensional space with m = 60 chosen via (4). This preliminary dimension reduction had the effect of both increasing classification accuracy and reducing computational cost. We used PCA, SIS, and PCA-SIS to perform the preliminary dimension reduction and then applied ALS, Lars and GLMpath for a total of nine different implementations. As a comparison, we also used Lars and GLMpath directly on the full dimensional data to produce the p dimensional subspace. The full dimensional case was not examined here because the model crashed when this ultra high dimensional data set was imported. Table 4 reports the average test MCRs on 20 runs using SVM as the base classifier. The ALS based methods clearly outperformed other approaches. In particular, the classification accuracy of the three ALS based methods was PCA-SIS-ALS > PCA-ALS > SIS-ALS. As is shown in Figure 2, the average training deviance and the average test MCR decreased and then leveled off indicating the search converged to the optimal subspace. The average ξ for PCA-SIS-ALS increased to about 0.75 and then leveled off, which means ALS chose a relatively sparse matrix as the optimal projection matrix 22

23 for the fmri data. Similar trends of ξ s are obtained for SIS-ALS and PCA-ALS. Table 5 shows results from the SVM, knn, NN and RF classification methods on the m = 60 dimensional reduced data. All four methods perform worse than ALS. Table 4: The test errors on fmri data by SVM models. p Lars GLMpath SIS PCA SIS-ALS (0.008) (0.009) (0.005) (0.006) (0.008) (0.009) p PCA-ALS PCA-SIS-ALS SIS-Lars SIS-GLMpath PCA-Lars (0.011) 0.175(0.020) (0.007) 0.084(0.010) (0.007) 0.061(0.008) (0.005) 0.050(0.006) (0.005) 0.042(0.005) (0.005) / p PCA-GLMpath PCA-SIS-Lars PCA-SIS-GLMpath / / 4.2 Leukemia cancer gene microarray data The second real world data set was the leukemia cancer data from a gene microarray study by Golub et al. (1999). This data set contained 72 tissue samples with each having 7, 129 gene expression measurements and a label of cancer type. Among the 72 tissues, 25 were samples of acute myeloid leukemia (AML) and 47 were samples of acute lymphoblast leukemia (ALL). Within the 38 training samples, 27 were ALL and 23

24 Table 5: The test errors on the preliminary reduced data by different classification methods. SVM knn NN RF SIS PCA PCA-SIS Figure 2: PCA-SIS-ALS on fmri data (p = 30). The left panel: the average training deviance path; the middle panel: the average test error path; the right panel: the average ξ path. The Average Training Deviance on fmir Data The Average Test Error on SVMs The Average Sparsity on fmir Data Deviance MCR Sparsity Iterations Iterations Iterations 24

25 11 AML, and within the 34 test samples, 20 were ALL and 14 AML. In some previous studies (Fan and Fan, 2008; Fan and Lv, 2008), 16 genes were finally picked. So in this study, we also pick p = 16 genes ultimately. The intermediate dimension was chosen to be m = 21 by Equation (4). A LD model was applied for classification. Similar to our simulation studies and the fmri study, PCA-SIS-ALS provided the lowest MCR (see Table 6), and it was better than the nearest shrunken centroids method mentioned in Tibshirani et al. (2002), which obtained a MCR of 2/34 = based on 21 selected genes (Fan and Lv, 2008). Notice that PCA-Lars and PCA-SIS-Lars also performed well on this data. Figure 3 shows the resulting graphs. Similar to previous examples, the training deviance and the test MCRs decrease and then level off. The sparsity of the projection matrix levels off at about Table 6: The test MCRs on leukemia cancer data. Further reduction m p (p = 16) Lars GLMpath SIS PCA ALS Preliminary none / reduction SIS / / 0.176(0.010) d m PCA / / 0.056(0.008) (m = 21) PCA-SIS / / 0.004(0.002) 5 Evaluating the computational expense of ALS In this section we provide a brief review of the computational cost of ALS in comparison to other stochastic search procedures. We compared ALS on the fmri data to the method by Tian et al. (2008) in terms of both classification accuracy and the computational time. The approach of Tian et al. (2008) provides a useful comparison because it uses a similar model to our own but implements the search process using 25

26 Figure 3: PCA-SIS-ALS on leukemia cancer data. The left panel: the average training deviance path; the middle panel: the average test error path; the right panel: the average ξ path. The Average Training Deviance The Average Test Error Path on LD Models The Average Sparsity Deviance MCR Sparsity Iterations Iterations Iterations a simulated annealing approach. The code was written in R and was run on a Dell Precision workstation (CPU 3.00GHz; RAM 2.99GHz, 16.0GB). We recorded the total CPU time (in seconds) of the R program and compared it with the time needed to run the second version of the simulated annealing method (SA2) of Tian et al. (2008) using an SVM model. PCA was applied first to reduce the data dimension to m = 80 and then both methods searched for a p = 30 dimensional subspace. Table 7 reports the time consumed and the classification accuracy on the test data when both methods use 500 iterations. Both ALS and SA2 took a similar time to run. The key difference was that the ALS converged well before 500 iterations while SA2 did not. Hence, the error rate for SA2 was much higher. SA2 took approximately 5000 iterations to converge which resulted in an order of magnitude more computational effort. In addition, even when 5000 iterations were chosen for SA2 the average test errors were still higher than for the ALS method. 26

27 Table 7: The execution time for each run and the average test errors on fmri data. Total Elapsed Time/Run (sec.) Test Error ALS (500 iterations) (0.010) SA2 (500 iterations) (0.018) SA2 (5000 iterations) (0.012) 6 Concluding remarks ALS was designed to implement a supervised learning classification method with the flexibility to mimic either a variable selection or a variable combination method. It does this by adaptively adjusting the sparsity of the projection matrix used to lower the dimensionality of the original data space. We use a stochastic search procedure to address the very high dimensional predictor space. Our simulation results suggest that this approach can provided extremely competitive results relative to a large range of classical and modern classification techniques. We also examined three different preliminary dimension reduction methods which appeared to both increase prediction accuracy as well as improve computational efficiency. ALS seems to converge quickly relative to other stochastic approaches which makes it feasible to use on large data sets. This ALS method for dimensionality reduction methodology could also be generalized to the context of regression where the response is a continuous variable. Further studies are planned for this setting. References Breiman, L. (2001), Random forests, Machine Learning, 45, Candes, E. and Tao, T. (2007), The Dantzig selector: statistical estimations when p is much larger than n (with discussion). The Annals of Statistics, 35,

28 Donoho, D. L. (2000), High-dimensional data analysis: the curses and blessings of dimensionality, Aide-Memoire of the lecture in AMS conference Math challenges of 21st Century. Available at donoho/lectures. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), Least angle regression, The Annals of Statistics, 32, Fan, J. and Fan, Y. (2008), High dimensional classification using features annealed independence rules, The Annals of Statistics, to be published. Fan, J. and Li, R. (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, Fan, J. and Lv, J. (2008), Sure independence screening for ultra-high dimensional feature space, Journal of the Royal Statistical Society: Series B, 70, Field, C., and Genton, M. G. (2006), The multivariate g-and-h distribution, Technometrics, 48, Fogel, D. B. (2000), Evolutionary computation: toward a new philosophy of machine intelligence (2nd ed.), IEEE Press, Piscataway, NJ. Fowlkes, E. B., Gnanadesikan, R., and Kettering, J. R. (1988), Variable selection in clustering, Journal of Classification, 5, Fu, M. C. (2002), Optimization for simulation: theory vs. practice (with discussion by S. Andradóttir, P. Glynn, and J. P. Kelly), INFORMS Journal on Computing, 14, Fu, W. and Knight, K. (2000), Asymptotics for Lasso-type estimators, The Annals of Statistics, 28, George, E. I., and McCulloch, R. E. (1993), Variable selection via Gibbs sampling, Journal of American Statistical Association, 88, George, E. I., and McCulloch, R. E. (1995), Stochastic search variable selection. In Practical Markov Chain Monte Carlo in Practice, Ed. W.R. Gilks, S. Richardson and D. J. Spiegelhalter, , London: Chapman and Hall. George, E. I., and McCulloch, R. E. (1997), Approaches for Bayesian variable selection, Statistica Sinica 7, Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C. and Lander, E. (1999), Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286,

29 Gosavi, A. (2003), Simulation-based optimization: parametric optimization techniques and reinforcement learning, Kluwer, Boston. Hastie. T., Tibshirani, R., and Friedman, J. (2001), The elements of statistical learning: data mining, inference, and prediction, Springer-Verlag. Hotelling, H. (1933), Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24, Liberti, L., and Kucherenko, S. (2005), Comparison of deterministic and stochastic approaches to global optimization, International Transactions in Operations Research, 12, Mitchell, T. J. and Beauchamp, J. J. (1988), Bayesian variable selection in linear regression (with Discussion), Journal of the American Statistical Association, 83, Park, M.-Y. and Hastie, T. (2007), An L 1 regularization-path algorithm for generalized linear models, Journal of the Royal Statistical Society: Series B, 69, Pearson, K. (1901), On lines and planes of closest fit to systems of points in space, Philosophical Magazine, 2, Tian, T., Wilcox, R., and James, G. (2008), Data reduction when dealing with classification based on high-dimensional data, under review. Tibshirani, R. (1996), Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society Series B, 58, Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002), Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proceedings of the National Academy of Sciences of the United States, 99, Torgerson, W. S. (1958), Theory and methods of scaling, New York: Wiley. Vapnik, V. (1998), Statistical Learning Theory, Wiley. West, M., Blanchette, C., Fressman, H., Huang, E., Ishida, S., Spang, R., Zuan, H., Marks, J. R. and Nevins, J.R. (2001), Predicting the clinical status of human breast cancer using gene expression profiles, Proceedings of the National Academy of Sciences of the United States, 98, Wold, H. (1966), Estimation of Principal Components and Related Models by Iterative Least Squares, Multivariate Analysis, Academic Press, New York, Young, G., and Householder, A. S. (1938), Discussion of a set of points in terms of their mutual distances, Psychometrika, 3,

30 Zhao, P. and Yu, B. (2006), On model selection consistency of Lasso, Journal of Machine Learning Research, 7, Zou, H. and Hastie, T. (2005), Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B, 67,

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com

More information

Interpretable Dimension Reduction for Classifying Functional Data

Interpretable Dimension Reduction for Classifying Functional Data Interpretable Dimension Reduction for Classifying Functional Data TIAN SIVA TIAN GARETH M JAMES Abstract Classification problems involving a categorical class label Y and a functional predictor X(t) are

More information

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

More information

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No.677 C.B.C.L Paper No.8

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

The picasso Package for High Dimensional Regularized Sparse Learning in R

The picasso Package for High Dimensional Regularized Sparse Learning in R The picasso Package for High Dimensional Regularized Sparse Learning in R X. Li, J. Ge, T. Zhang, M. Wang, H. Liu, and T. Zhao Abstract We introduce an R package named picasso, which implements a unified

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Automated Microarray Classification Based on P-SVM Gene Selection

Automated Microarray Classification Based on P-SVM Gene Selection Automated Microarray Classification Based on P-SVM Gene Selection Johannes Mohr 1,2,, Sambu Seo 1, and Klaus Obermayer 1 1 Berlin Institute of Technology Department of Electrical Engineering and Computer

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python

Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python J. Ge, X. Li, H. Jiang, H. Liu, T. Zhang, M. Wang and T. Zhao Abstract We describe a new library named picasso, which

More information

Effectiveness of Sparse Features: An Application of Sparse PCA

Effectiveness of Sparse Features: An Application of Sparse PCA 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Feature Selection for SVMs

Feature Selection for SVMs Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research

More information

Clustering and The Expectation-Maximization Algorithm

Clustering and The Expectation-Maximization Algorithm Clustering and The Expectation-Maximization Algorithm Unsupervised Learning Marek Petrik 3/7 Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications

More information

Multiclass Classifiers Based on Dimension Reduction

Multiclass Classifiers Based on Dimension Reduction Multiclass Classifiers Based on Dimension Reduction with Generalized LDA Hyunsoo Kim Barry L Drake Haesun Park College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA Abstract Linear

More information

Updates and Errata for Statistical Data Analytics (1st edition, 2015)

Updates and Errata for Statistical Data Analytics (1st edition, 2015) Updates and Errata for Statistical Data Analytics (1st edition, 2015) Walter W. Piegorsch University of Arizona c 2018 The author. All rights reserved, except where previous rights exist. CONTENTS Preface

More information

Gradient LASSO algoithm

Gradient LASSO algoithm Gradient LASSO algoithm Yongdai Kim Seoul National University, Korea jointly with Yuwon Kim University of Minnesota, USA and Jinseog Kim Statistical Research Center for Complex Systems, Korea Contents

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

An efficient algorithm for sparse PCA

An efficient algorithm for sparse PCA An efficient algorithm for sparse PCA Yunlong He Georgia Institute of Technology School of Mathematics heyunlong@gatech.edu Renato D.C. Monteiro Georgia Institute of Technology School of Industrial & System

More information

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Fast Fuzzy Clustering of Infrared Images. 2. brfcm Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information

MULTICORE LEARNING ALGORITHM

MULTICORE LEARNING ALGORITHM MULTICORE LEARNING ALGORITHM CHENG-TAO CHU, YI-AN LIN, YUANYUAN YU 1. Summary The focus of our term project is to apply the map-reduce principle to a variety of machine learning algorithms that are computationally

More information

Generalized Additive Model

Generalized Additive Model Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1

More information

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an Soft Threshold Estimation for Varying{coecient Models Artur Klinger, Universitat Munchen ABSTRACT: An alternative penalized likelihood estimator for varying{coecient regression in generalized linear models

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

Feature Selection Using Principal Feature Analysis

Feature Selection Using Principal Feature Analysis Feature Selection Using Principal Feature Analysis Ira Cohen Qi Tian Xiang Sean Zhou Thomas S. Huang Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Urbana,

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Classification with PAM and Random Forest

Classification with PAM and Random Forest 5/7/2007 Classification with PAM and Random Forest Markus Ruschhaupt Practical Microarray Analysis 2007 - Regensburg Two roads to classification Given: patient profiles already diagnosed by an expert.

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

MIT 9.520/6.860 Project: Feature selection for SVM

MIT 9.520/6.860 Project: Feature selection for SVM MIT 9.520/6.860 Project: Feature selection for SVM Antoine Dedieu Operations Research Center Massachusetts Insitute of Technology adedieu@mit.edu Abstract We consider sparse learning binary classification

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA

THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA Niv Efron and Nathan Intrator School of Computer Science, Tel-Aviv University, Ramat-Aviv 69978,

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation

More information

Yelp Recommendation System

Yelp Recommendation System Yelp Recommendation System Jason Ting, Swaroop Indra Ramaswamy Institute for Computational and Mathematical Engineering Abstract We apply principles and techniques of recommendation systems to develop

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Breaking it Down: The World as Legos Benjamin Savage, Eric Chu

Breaking it Down: The World as Legos Benjamin Savage, Eric Chu Breaking it Down: The World as Legos Benjamin Savage, Eric Chu To devise a general formalization for identifying objects via image processing, we suggest a two-pronged approach of identifying principal

More information

Robust Face Recognition via Sparse Representation

Robust Face Recognition via Sparse Representation Robust Face Recognition via Sparse Representation Panqu Wang Department of Electrical and Computer Engineering University of California, San Diego La Jolla, CA 92092 pawang@ucsd.edu Can Xu Department of

More information

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Journal of Machine Learning Research 6 (205) 553-557 Submitted /2; Revised 3/4; Published 3/5 The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Xingguo Li Department

More information

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression Machine learning, pattern recognition and statistical data modelling Lecture 3. Linear Methods (part 1) Coryn Bailer-Jones Last time... curse of dimensionality local methods quickly become nonlocal as

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens Faculty of Economics and Applied Economics Trimmed bagging a Christophe Croux, Kristel Joossens and Aurélie Lemmens DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) KBI 0721 Trimmed Bagging

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

An efficient algorithm for rank-1 sparse PCA

An efficient algorithm for rank-1 sparse PCA An efficient algorithm for rank- sparse PCA Yunlong He Georgia Institute of Technology School of Mathematics heyunlong@gatech.edu Renato Monteiro Georgia Institute of Technology School of Industrial &

More information

The supclust Package

The supclust Package The supclust Package May 18, 2005 Title Supervised Clustering of Genes Version 1.0-5 Date 2005-05-18 Methodology for Supervised Grouping of Predictor Variables Author Marcel Dettling and Martin Maechler

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2017 Assignment 3: 2 late days to hand in tonight. Admin Assignment 4: Due Friday of next week. Last Time: MAP Estimation MAP

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Learning from High Dimensional fmri Data using Random Projections

Learning from High Dimensional fmri Data using Random Projections Learning from High Dimensional fmri Data using Random Projections Author: Madhu Advani December 16, 011 Introduction The term the Curse of Dimensionality refers to the difficulty of organizing and applying

More information

Statistics (STAT) Statistics (STAT) 1. Prerequisites: grade in C- or higher in STAT 1200 or STAT 1300 or STAT 1400

Statistics (STAT) Statistics (STAT) 1. Prerequisites: grade in C- or higher in STAT 1200 or STAT 1300 or STAT 1400 Statistics (STAT) 1 Statistics (STAT) STAT 1200: Introductory Statistical Reasoning Statistical concepts for critically evaluation quantitative information. Descriptive statistics, probability, estimation,

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

Penalizied Logistic Regression for Classification

Penalizied Logistic Regression for Classification Penalizied Logistic Regression for Classification Gennady G. Pekhimenko Department of Computer Science University of Toronto Toronto, ON M5S3L1 pgen@cs.toronto.edu Abstract Investigation for using different

More information

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

scikit-learn (Machine Learning in Python)

scikit-learn (Machine Learning in Python) scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29 Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize

More information

An Empirical Comparison of Spectral Learning Methods for Classification

An Empirical Comparison of Spectral Learning Methods for Classification An Empirical Comparison of Spectral Learning Methods for Classification Adam Drake and Dan Ventura Computer Science Department Brigham Young University, Provo, UT 84602 USA Email: adam drake1@yahoo.com,

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

Lecture 06 Decision Trees I

Lecture 06 Decision Trees I Lecture 06 Decision Trees I 08 February 2016 Taylor B. Arnold Yale Statistics STAT 365/665 1/33 Problem Set #2 Posted Due February 19th Piazza site https://piazza.com/ 2/33 Last time we starting fitting

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

FACE RECOGNITION USING SUPPORT VECTOR MACHINES

FACE RECOGNITION USING SUPPORT VECTOR MACHINES FACE RECOGNITION USING SUPPORT VECTOR MACHINES Ashwin Swaminathan ashwins@umd.edu ENEE633: Statistical and Neural Pattern Recognition Instructor : Prof. Rama Chellappa Project 2, Part (b) 1. INTRODUCTION

More information

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition

More information

Variable Selection 6.783, Biomedical Decision Support

Variable Selection 6.783, Biomedical Decision Support 6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based

More information

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection Volume-8, Issue-1 February 2018 International Journal of Engineering and Management Research Page Number: 194-200 The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers

More information

Voxel selection algorithms for fmri

Voxel selection algorithms for fmri Voxel selection algorithms for fmri Henryk Blasinski December 14, 2012 1 Introduction Functional Magnetic Resonance Imaging (fmri) is a technique to measure and image the Blood- Oxygen Level Dependent

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification CAMCOS Report Day December 9 th, 2015 San Jose State University Project Theme: Classification On Classification: An Empirical Study of Existing Algorithms based on two Kaggle Competitions Team 1 Team 2

More information

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008.

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008. Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008. Introduction There is much that is unknown regarding

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Supervised Learning for Image Segmentation

Supervised Learning for Image Segmentation Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.

More information

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CX 4242 DVA March 6, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Analyze! Limited memory size! Data may not be fitted to the memory of your machine! Slow computation!

More information

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques Sea Chen Department of Biomedical Engineering Advisors: Dr. Charles A. Bouman and Dr. Mark J. Lowe S. Chen Final Exam October

More information

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Module 4. Non-linear machine learning econometrics: Support Vector Machine Module 4. Non-linear machine learning econometrics: Support Vector Machine THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction When the assumption of linearity

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information