FEATURE or input variable selection plays a very important

Size: px
Start display at page:

Download "FEATURE or input variable selection plays a very important"

Transcription

1 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER Feature Selection Using a Piecewise Linear Network Jiang Li, Member, IEEE, Michael T. Manry, Pramod L. Narasimha, Student Member, IEEE, and Changhua Yu Abstract We present an efficient feature selection algorithm for the general regression problem, which utilizes a piecewise linear orthonormal least squares (OLS) procedure. The algorithm 1) determines an appropriate piecewise linear network (PLN) model for the given data set, 2) applies the OLS procedure to the PLN model, and 3) searches for useful feature subsets using a floating search algorithm. The floating search prevents the nesting effect. The proposed algorithm is computationally very efficient because only one data pass is required. Several examples are given to demonstrate the effectiveness of the proposed algorithm. Index Terms Feature selection, regression, piecewise linear network (PLN), orthonormal least squares (OLS), floating search. I. INTRODUCTION FEATURE or input variable selection plays a very important role in many multivariate modeling problems, where the best subset of features is not known. We focus on feature selection for regression tasks in this paper. Irrelevant features can lead to several problems when nonlinear networks are used for modeling [1]: 1) Training an unnecessarily large network requires more computational resources and memory, 2) high dimensional data may have the curse of dimensionality problem if the available data is limited, and 3) training algorithms for large networks can also have convergence difficulties and poor generalization. The goal of feature selection is to generate a compact subset of features that leads to an accurate model based on the limited amount of data. Feature selection methods usually consist of three steps: Feature evaluation, feature subset search, and feature subset validation (stopping criterion). Feature evaluation estimates the fitness of features based on a user chosen criterion. Feature subset search attempts to find a combination of the features that maximizes the fitness criterion. Feature subset validation is used to determine the number of features that is sufficient for the given data. Feature evaluation for regression tasks may be divided into filter and wrapper approaches in the same way as its counterpart in classification [2] and [3]. A filter approach estimates the Manuscript received May 7, 2005; revised January 31, This work was supported by the Advanced Technology Program of the state of Texas under Grant J. Li was with the Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX USA. He is now with the Department of Radiology, Warren G. Magnuson Clinical Center, National Institute of Health, Bethesda, MD USA ( li@wcn.uta.edu). M. T. Manry and P. L. Narasimha are with the Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX USA ( manry@uta.edu). C. Yu was with the Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX USA. He is now with Fastvdo LLC, Columbia, MD USA. Color versions of Figs. 1 4 are available online at Digital Object Identifier /TNN fitness values for features without any actual model assumed between outputs and inputs of the data. A feature can be selected or discarded based upon some predefined criteria such as mutual information [4], [5], principal component analysis [6], [7], independent component analysis [1], [8], class separability measure [9], [10], or variable ranking [11]. Filter approaches have the advantage of computational efficiency but do not take into account the biases of regression models. On the other hand, wrapper approaches calculate the fitness of a feature subset by actually implementing a full regression model. A wrapper approach usually selects a model, optimizes the parameters, measures the fitness of the features and selects the feature subset which has the largest fitness value. The neural network paradigms most related to the wrapper approach are pruning methods [12] [14], growing methods [15], [16], and hybrid methods [17], [18]. Usually, wrapper approaches give better results than filter approaches, because both the modeling and selection procedures are based on similar models [2]. However, wrapper approaches have higher computational complexity since evaluation of fitness values require passes through the data. Note that feature selection algorithms are not restricted to selecting input subsets. They can be extended to hidden unit pruning in neural networks [19]. There is another particular feature selection methodology that is neither a filter nor a wrapper method, which is based on a trained neural network [20] [22]. Here, the fitness of one feature is calculated from parameters of the trained network. For example, one can evaluate the summation of all the weight magnitudes connected to an input [23]. However, the summation of weight magnitudes is fixed once the network is trained, so it does not reflect the dynamical interaction among the features: Adding one correlated feature into a set of features has a big effect on the magnitudes of the other weights in the network. Search algorithms used in feature selection are often growing or pruning methods. The former approach adds features to the current feature subset and is sometimes called a bottom-up search. The latter approach removes features from the current feature subset until a satisfactory result is obtained, and is sometimes called a top-down search. However, both approaches suffer from the so-called nesting effect. In the top-down search, a discarded feature cannot be reselected while in the bottom-up search a feature cannot be discarded once selected [24]. Since the fitness of a set of features depends upon correlations among them, a feature with a high fitness value does not necessarily belong to the optimal subset. From the literature of feature selection for classification, we know that the optimal search algorithm is the branch and bound [25], which requires a monotonic criterion function. Although branch and bound is very efficient compared to the exhaustive search, it still becomes impractical for data sets with more than about 30 features. Attempts to prevent the nesting effect and to attain algorithm efficiency include the plus-l-minus-r /$ IEEE

2 1102 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER 2006 search method [26] and the floating search algorithm [24]. The drawback of the algorithm is that there is no theoretical way of predicting the values of and to achieve the best feature subset. The floating search algorithm is an excellent tradeoff between the nesting effect and computational efficiency, and there are no parameters to be determined. Though the floating search was originally proposed for classification tasks, it is readily extended to regression applications if the fitness criterion is appropriately formulated. To perform feature subset validation, the available data may be further divided into training and validation sets. The search procedure is performed on the training set and the search stops when the stopping criterion, such as the significance test of the validation errors on the validation set, is satisfied [3]. If the data size is sufficiently large with respect to the complexity of the regression model, Fisher tests can be used. Otherwise, a leave-one-out method may be a good alternative [27]. Stoppiglia et al. [28] proposed to append to the set of candidate features a probe feature, which is a random variable. All features that are ranked below the probe feature should be discarded. They proved that this is closely related to the Fisher test. Regularization or D-optimality [29] [31] have also been proposed for determining the number of features that should be included in the final model. The orthonormal least squares (OLS) [32] procedure has been applied to construct a sparse kernel regression method [33], [34] [29], [30]. It has also been combined with a lattice partitioning piecewise linear network (PLN) for construction of neurofuzzy networks [31]. In this paper, we develop an efficient wrapper type of feature selection algorithm for regression utilizing a modified Gram Schmidt OLS procedure. The floating search method is extended to the regression case to prevent the nesting effect, and we evaluate the fitness values of feature subsets based upon the OLS procedure in an PLN. The paper is organized as follows. In Section II, we review the OLS procedure for forward selection in a linear regression network. In Section III, an overview of the floating search algorithm is presented and is applied to feature selection for the linear network presented in Section II. The proposed piecewise linear orthonormal floating search algorithm is given in Section IV. Section V discusses connections between the proposed algorithm and other related algorithms. The predicted mean square error (MSE) for an equivalent multilayer perceptron (MLP) is described in Section VI. Numerical examples are given in Section VII, and we conclude this paper in Section VIII. II. OLS PROCEDURE FOR FORWARD SELECTION In this section, we review forward selection in a linear regression network. An efficient implementation was given in terms of OLS and the modified Gram Schmidt algorithm [32]. OLS is the basis for our proposed algorithm, and the modified Gram Schmidt algorithm makes OLS very efficient. A. Orthonormal Linear System Given a set of data pairs, where the input vector and the desired output vector, consider the multiple-input multiple-output (MIMO) regression model of the form where is the desired value of the th output for the th pattern, and is the error between and the model output. Here, we let to handle output thresholds. is the model weight from the th feature to the th output, is the th feature or regressor, is the total number of candidate features, and the number of outputs. Substituting (1) into (2) yields By defining the regression model (3) now can be written in the matrix form The task of feature selection for regression is to select the most significant subset from the set of available features. The OLS procedure selects a set of features to form a set of orthonormal bases,, in a forward manner. Suppose can be decomposed as where (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

3 LI et al.: FEATURE SELECTION USING A PIECEWISE LINEAR NETWORK 1103 and (11) Here,, where denotes the identity matrix. Note that in [32] was decomposed to an orthogonal basis as B. Forward OLS Procedure The forward OLS procedure has been successfully applied to construct kernel regression models and neurofuzzy systems [29] [31], [33], [34]. System (16) consists of subsystems and can be denoted as follows: where with (12) (13) (19) where is the weight matrix connecting to the th output, and. Multiplying (19) by its transpose and time averaging, the following is easily derived: where is the square of the length of.defining, we get (20) and (14) (15) In other words, in order to get the orthonormal basis, the th row in the transformation matrix is normalized by the length of. The regression model of (8) now becomes (16) where denotes the weights for the orthonormal system. The least square solution for (16) is (17) which is projection of the desired output onto the othonormal bases. The weights for the original system can be easily obtained as (18) If none of the features are linearly dependent on the others, the decomposition (9) and the transformation (18) are always possible. The decomposition procedure is done using the algorithm given in Appendix I in which the data is accessed only once. The orthonormal weights are obtained based only on the autocorrelation and cross-correlation matrices. Once these matrices are calculated through one data pass, the orthonormal procedure only utilizes these matrices, whose sizes are usually much smaller than those of the original data. Therefore, a very efficient algorithm can be implemented. The variance or energy for the th output contains two parts, the variance explained by the features, and, the unexplained variance for the th output. The error reduction ratio for the th output due to the th feature is defined as (21) The most significant features are forward selected according to the value of. At the first step, we calculate for the th feature treating it as the first feature to be orthonormalized; then, is obtained using (21). The feature is selected if it produces the largest value of. At the second step, the aforementioned steps are repeated for the remaining features. For multiple output systems, one could apply the forward selection procedure to each output separately and obtain a different feature subset for each output system. This is necessary if each subsystem has different dynamics (see [31]). However, if the multiple output systems have similar dynamics, the selected subset features for each output may not differ much from one to another. We can then make a tradeoff between the complexity and the accuracy as follows. We keep the same feature subset for each output, and the error reduction ratio (21) for the th feature is modified as (22) which is the total error reduction ratio of the th feature for all of the output systems. The feature is selected if it has the largest in the forward selection procedure. Note that the denominator in (21) is a constant for all features; we can just simply ignore it.

4 1104 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER 2006 III. FLOATING FEATURE SEARCH METHOD FOR REGRESSION The drawback of the forward selection procedure in Section II-B is the nesting effect which means that a feature cannot be discarded once selected. In this section, we review the floating search method for feature selection, and again in the linear network. It is well known that when selecting a subset of from the given features, the optimal solution involves evaluating all possible subsets of size. It is obvious that the number of combinations needing to be evaluated increases exponentially with or. Though the branch and bound [25] algorithm reduces search time significantly, it is still not feasible for even a mild value of or. For these reasons, some tradeoffs between the optimality and efficiency of the algorithm have been made. Pudil et al. [24] proposed a floating search method for classification, which is a near-optimal solution that significantly reduces the computational load. The floating search method consists of forward and backward floating selection. Both of them have similar performances, but the forward one is computationally faster. We, therefore, extended the forward floating search algorithm for regression in this paper. In order to describe the floating search algorithm, we first introduce the following definitions. Let be a set of features from of available features. Definition 1: The individual fitness of one feature is (23) which is the total variance explained for all outputs due to the th feature. This is a general measure of the fitness value of one feature regardless of the selection order of in the orthonormal procedure. Definition 2: The total fitness of a set of features is measured as (24) which is the total variance explained for all the outputs due to all features in. Here, the features in are made orthonormal to each other according to the order as they are selected. Definition 3: The fitness of in is defined as (25) In other words, is equivalent to the general fitness value of calculated using (23), where is the last feature in that is made orthonormal to the other features in. This measure is used to identify which feature in the selected feature pool is the least significant one. The least significant feature in can be identified in the following procedure: For each, where, let be the last feature to be made orthonormal to other features in ; the fitness of is then calculated as (25). This procedure is repeated times; is identified as the least significant feature in if (26) Definition 4: The fitness of with respect to, where,is (27) The goal of this measure is to identify which feature in the remaining feature pool is the most significant one. The most significant feature in is identified as follows: For each feature, make it orthonormal to and calculate the fitness value of as (23). This process is repeated times and the most significant feature with respect to is identified as if (28) Definition 1 is a general measure that constructs basis for the other three definitions. Definition 2 is a measure of the stopping criterion for the continuation of the conditional deletion. Definition 3 is used in the conditional deletion, and Definition 4 is used in the adding one feature step in the forward floating search algorithm. See Appendix II for the details of the forward floating search algorithm. IV. PIECEWISE LINEAR ORTHONORMAL FLOATING SEARCH ALGORITHM We have introduced the floating search algorithm for a linear system. However, feature selection for a nonlinear system should be based on a nonlinear model. In this section, we extend the floating search algorithm for linear regression to the piecewise linear regression case. Some important issues about this algorithm are also addressed. A. Piecewise Linear Orthonormal System It has been shown that a PLN model can approximate a nonlinear system adequately [35]. A PLN often employs a clustering method to partition the feature space into a hierarchy of regions (or clusters), where simple hyperplanes are fitted to the local data. Thus local linear models construct local approximations to nonlinear mappings. As the number of training patterns tends to infinity, partition based estimates of the regression function converge to the true function and the MSE of mapping converges to that corresponding to Bayes rules [36], [37]. Thus PLNs are consistent nonparametric estimators, and feature selection based on the PLN model should be more accurate. Some

5 LI et al.: FEATURE SELECTION USING A PIECEWISE LINEAR NETWORK 1105 successful applications of these kinds of local processors can be found in [35] and [38] [41]. The regression model (8) for the PLN can be rewritten as (29) where the superscript denotes to which the input vector is assigned. For each cluster, we apply the modified Gram Schmidt procedure to the data belonging to that cluster, yielding (30) If the output systems have similar dynamics, we could use the same feature space partition for all the systems. Suppose we initially partition the feature space into clusters, the floating search algorithm based on the PLN model is defined the same as in Section III except that the fitness definitions (23) through (24) should be modified as follows. 1) The individual fitness of one feature is 2) The total fitness of is (31) (32) There are two important issues to be noted for this algorithm. First, we need to determine an appropriate number of clusters for partitioning the feature space. Second, in the algorithm the autocorrelation and cross-correlation matrices for the clusters are calculated only once and are used in the whole search procedure without being recalculated. By using one data pass we significantly reduce the computational load. However, using the same matrices for different size feature subsets implies that we keep the partition unchanged while the feature subset size changes. This could have some deleterious effects on the selection result, and Section IV-B discusses these two issues. B. Number of Clusters and Partition of Feature Space Determining the number of clusters in a PLN for a given set of data is a model validation or model selection problem. Model selection is a difficult task because it is not possible just simply to choose the model that fits the data best: More complex models always fit data better, but bad generalizations often result. Bayesian methods [42] employ Occam s Razor to penalize unnecessarily complex models in the selection procedure. Growing methods [15], [43], [16], pruning methods [44], [45], [12] [14], Akaike s information criterion [46], and kernel principal component analysis (kernel-pca) [47] also have been investigated for model selection. In this paper, we utilized a crossvalidation (CV) method to determine an appropriate number of PLN clusters for a given data set. We selected a model based on a curve produced in the CV procedure. Initially, the feature space of the training data set is partitioned into a large number of clusters using a self-organizing-map (SOM) [48]. For each cluster, a linear regression model is then designed, and the total unexplained variance (training error) is calculated. The trained PLN model is then applied to the validation data set to get its validation error. Our goal is to find the PLN structure such that its validation error reaches the minimum. A cluster is pruned if its elimination leads to the smallest increase of the training error, and the remaining local linear models are redesigned if necessary. The pruning procedure continues till only one cluster remains. Finally, we produce curves of the training and validation errors versus the number of clusters. We find the minimum value on the validation error curve, and the number of clusters corresponding to the minimum is chosen for the PLN model. C. Effect of the Fixed Feature Space Partition In the feature selection procedure, once the number of clusters is determined, the algorithm uses the SOM to partition the feature space and accumulates the autocorrelation and cross-correlation matrices for each cluster. These matrices remain unchanged during the whole feature search procedure. This implies that the algorithm uses the initial partition, which involves all features, for any feature subspace. The advantage of doing this is that we can significantly reduce the computational load of the algorithm. One could repartition the subfeature space and recalculate the autocorrelation and cross-correlation matrices for each feature subset, which may produce a more accurate estimate of the unexplained variance for the selected features, but this is not feasible for data with a large number of features. However, our approach may produce an optimistic estimate of the unexplained variance, for a small feature subset. D. Algorithm Description Assuming that we have a training data set and a validation data set, we describe the proposed floating-ols algorithm as follows. 1) Initialize, the number of features that need to be selected, as. could be set to a small number in the case that is large. 2) Determine the number of clusters for the PLN model using the method described in Section IV-B. 3) Design an -cluster PLN model for the training data, and accumulate autocorrelation and cross-correlation matrices for each of the clusters. 4) Use the piecewise linear orthonormal floating search algorithm to find the most important features from the available features. The first two features are selected by the forward OLS procedure, based on the fitness value calculated using (27). In the conditional deletion and the continuation of the conditional deletion procedures of the floating search algorithm (Appendix II), the fitness value of one feature

6 1106 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER 2006 in the currently selected feature set is calculated by (25), and the fitness of a set of features is calculated by (32). The piecewise orthonormal floating search algorithm continues till the most significant features are selected. 5) Apply the PLN model on the validation data set to get validation errors for each of the selected feature subset. 6) The feature subset which has the minimum validation error is determined as the final selected feature subset. In the case that multiple subsets have validation errors close to the minimum validation error, the feature subset that has the smallest size but whose validation error is not bigger than 105% of the minimum validation error is chosen as the final selected feature subset. Remarks: 1) in this algorithm, all the searching efforts are handled with autocorrelation and cross-correlation matrices of the PLN model; the original data is not used; 2) a linearly dependent feature will be assigned zero-valued weights, and, therefore, its fitness value is zero; 3) in step 6), we heuristically choose the final selected feature subset based on its validation error, i.e., its validation error is not bigger than 105% of the minimum validation error. Optimal stopping criteria are important for feature selection, but this topic is out of the scope of this paper. Other stopping criteria can be used for the developed algorithm. V. RELATED WORK There are many feature or variable selection algorithms in the literature [3]. In this section, we discuss some connections between the proposed algorithm and previous ones. A. Correlation Criteria The Pearson correlation coefficient between feature the desired output is defined as and (33) where denotes the covariance and the variance. This coefficient is also the cosine between and if they have been centered. In a linear regression, the square of is the variance of the th output explained by the th feature. The use of can be extended to the case of two-class classification [49]. For our case, we have used the modified Gram Schmidt procedure to calculate output variances explained by features in an efficient way. Our method is equivalent to the general correlation criteria between outputs and the decorrelated features, and the PLN was used to handle nonlinear regressions. The interactions among features have also been taken into account in the floating search. Note that our algorithm is readily extended to the classification case. B. Gram Schmidt Procedure for Classification Mao [9], [10] proposed an orthogonal forward feature selection algorithm using the Gram Schmidt transformation for classification tasks. The motivation of employing the orthogonal decomposition of features is to alleviate the correlation among features, because the orthogonal procedure can decorrelate features and the features can be selected independently. The criterion he used is the Mahalanobis distance measure [10] (34) where is the mean vector of data samples in class., and and are the covariance matrices of class and class, respectively. Under the orthogonality condition, the covariance matrix is diagonal and the Mahalanobis distance is easy to calculate. They selected feature subsets by a similar orthogonal forward procedure, where one feature was selected if it had the largest Mahalanobis distance value. They showed that if correlations between candidate features are trivial, employing orthogonal transform does not make much difference; but orthogonal algorithm provide improvements if severe correlations exist. Other criteria, including the Bhattacharyya distance and the Chernoff probability measure can be used in the orthogonal procedure as well [9]. Their method belongs to the filter category, where no actual classifier is involved in the selection procedure. The advantages of using an orthogonal process are that it decorrelates the original features so that they can be selected independently. Also, physically meaningless features in the Gram Schmidt space can be linked back to the same number of variables in the original feature space that makes it suitable for feature selection. However, since the nesting effect results from correlations among features as well as correlations between outputs and features, decorrelation of features does not necessarily solve the nesting effect because it does not take into account the correlations of features with outputs. This phenomena is illustrated in Example 1, where our proposed algorithm correctly solves this problem. C. Gram Schmidt Procedure for Regression The Gram Schmidt procedure has also been used for regression problems. Stoppiglia et al. [28] uses the sequential orthogonal feature selection for linear regression tasks. If nonlinearity between inputs and outputs exist, Rivals et al. [27] first constructs a polynomial of monomials for up to degree to explain this nonlinear relationship. For example, the polynomial of degree 2 involves a constant term and the monomials:. However, not all the terms are significant for explaining the output. A forward sequential orthogonal selection procedure based on the Gram Schmidt procedure is then performed to select the most significant polynomials for the final regression model. They handle the nonlinearity between inputs and outputs using high-degree polynomial terms, while in our method the nonlinearity is dealt with using a piecewise linear model. D. Principal Component Regression and Partial Least Square Regression Principal component regression (PCR) and partial least square regression (PLS) are two multivariate regression techniques where the number of observations is small compared to the number of features, or the collinearity among features exists

7 LI et al.: FEATURE SELECTION USING A PIECEWISE LINEAR NETWORK 1107 [50], [51]. In such cases, the ordinary multiple regression is not appropriate because overfitting is highly likely. Both PCR and PLS produce factor score matrix as a linear combination of the original feature set as (35) where the matrix is called the loading for, and is an orthogonal or orthonormal matrix. After the decomposition, a linear regression from to is performed as (36) where is regression weights for. Once is computed, the regression model (36) is equivalent to (37) where. The columns of are called latent factors of. If the number of extracted latent factors is greater or equal to the rank of the original feature space, both PCR and PLS are equivalent to the ordinary multiple regression. PCR and PLS differ from the way how they extract the latent factors: PCR chooses with the goal to explain as much as possible, but may be not relevant to. It is usually performed by a single value decomposition (SVD) of, if is centered. On the other hand, PLS chooses to explain the variance between and as much as possible, which is based on the SVD of, if both and are centered. Both PCA and PLS are feature extraction methods, where there is no clear physical meaning in the extracted feature space since each latent factors is a linear combination of all the original features. However, feature selection algorithms using the Gram Schmidt procedure have a clear physical meaning because the orthogonal features are linear combinations of the features that have already been selected, and they are readily transformed back to the original feature space. VI. PREDICTED MSE FOR AN EQUIVALENT MLP The MLP with nonlinear units in a single hidden layer has been established as a universal function approximator [52], [53]. Even though it has been shown that the PLN model can train well, the resulting mapping is discontinuous at the boundaries between adjacent clusters. When a discontinuous mapping is unacceptable, we may use a global network such as the MLP. Chandrasekaran et al. [54] proposed a method for sizing the MLP based on the assumption that a PLN with the same theoretical pattern storage capacity as the MLP will have the same training error. For comparison purposes, we implement an MLP-based feature selection algorithm in this paper. As a heuristic, we choose the number of hidden units in the MLP-based feature selection algorithm according to the concept that its pattern storage capacity should be the same as the PLN model found for that data set. A. Pattern Storage of PLN and MLP A linear network can memorize patterns which is equal to the number of parameters used to calculate each output. The pattern storage capacity of the PLN is multiplied by the storage capacity per cluster (38) It is well known that the Volterra filter has a storage capacity which is equal to the number of coefficients associated with one output. The MLP s pattern storage capacity is bounded above by, where is the total number of free parameters in the network. is the number of outputs, and it has been shown that this bound is fairly tight [55]. Therefore, it can be assumed B. Equivalent MLP Equating (38) and (39) yields (39) (40) This formula helps us to estimate the number of hidden units in an MLP equivalent to a PLN with clusters, and it can be divided into two cases. 1) : When the number of outputs, (40) reduces to. 2) : In this case, overly large MLP networks result when the desired outputs are correlated and many free parameters are redundant. An SVD technique is employed to detect whether outputs can be compressed without significantly degrading the MSE performance. Compressing the outputs, we can predict a smaller, less complex MLP. The resulting MSE after we compress the outputs down to is given by (41) where is the th singular value of the desired outputs covariance matrix. VII. SIMULATION STUDIES We compared our proposed algorithm with four other algorithms on an artificial data set having one output and with three algorithms on two real data sets having multiple outputs. In the experiments, we divided each data set into a training set, a validation set, and a testing set. The training and validation sets were used for feature selection. The testing set was used to evaluate the selected feature subsets by the ten-fold CV method, using MLPs trained on the feature subsets. We used the paired test to verify if the testing errors for different feature subsets were significantly different in the CV. In this section, we first introduce four additional feature selection algorithms and then give three examples.

8 1108 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER 2006 A. Feature Selection Algorithms The algorithms compared include two PLN model-based algorithms, where the OLS procedure is used. The third algorithm is an MLP-based feature selection algorithm, and the fourth algorithm is based on the support vector machine (SVM). 1) Search by Importance: There are many feature selection algorithms that are based on weight magnitudes in a trained network [23]. We implemented such method, which we denote the importance method, based upon the PLN model. We first designed the PLN model for the given data, where the number of clusters was determined by CV as described in Section IV-B. The importance of a feature was calculated as the magnitude summation of the weights from the feature to all outputs among all clusters in the PLN. The selection order of features was based on their importance values, with the most important feature selected first. For each cluster, sets of linear equations were solved for weights, using the conjugate gradient method. Once the feature order was calculated, we made all features orthonormal to each other according to the selection order and calculated the training and validation regression errors for each feature subset. The regression error was defined as the standard MSE. 2) Forward-OLS: The forward-ols feature selection algorithm is the same as floating-ols, except that forward-ols uses the forward OLS search (see Section II-B) based on the PLN model. Both the importance and forward-ols algorithms search feature subsets based on the training data, and the weights for the orthonormal system are transferred back using (18). Networks using the chosen subsets are then applied to the validation data. Both algorithms use the same method as the floating-ols algorithm (see Section IV-B) to determine the size of the final selected feature subset. 3) Leray: An MLP-Based Algorithm: Leray and Gallinari [22] proposed an MLP feature selection algorithm, denoted Leray, based on the optimal cell damage (OCD) algorithm [20]. Here, the decision on pruning a weight is made according to a relevance criterion often named the weight saliency. The weight is pruned if it has a low salience. The saliency of the th feature is defined as the saliency summation of all the weights connected to the th feature Saliency Saliency (42) where denotes all possible weights connecting the th feature to either a hidden unit or an output. Using an order two Taylor expansion of the MSE and a diagonal approximation for the Hessian matrix, the saliency of the th feature is defined as Saliency Leray prunes one feature at a time, and the MLP is retrained after each deletion. For a stopping criterion, Leray uses a variation of the selection according to an estimate of the generalization error method, which estimates the generalization performance on a validation set. Since several subsets may have statistically similar performances, Leray uses the Fisher test to compare each model s performance with that of the best model. Then it chooses the smallest feature subset whose performance is similar to the best one. For a fair comparison in this paper, we used the same method as the floating-ols algorithm to determine the final feature subset. We compared the validation MSE of any feature subset with that of the best feature subset, and chose the smallest feature subset whose MSE was not bigger than 105% of that for the best feature subset. We selected this algorithm for comparison because it outperforms many other MLP based algorithms, as reported in [22]. 4) An SVM-Based Algorithm: Bi et al. [56] proposed a dimensionality reduction methodology via sparse SVM to perform variable ranking and selection, and to construct a final nonlinear model for the data. The method exploits the fact that a linear SVM with -norm regularization inherently performs variable selection, and the distribution of the linear model weights provides a mechanism for ranking and interpreting the effects of variables. This algorithm was particularly designed for systems with one output. In such situations, we find a function that minimizes the regularized risk functional [56] (43) where is a loss function, is the desired output for the th pattern, is a regularization operator, and is called the regularization parameter. In [56], the regularized risk functional is defined as (44) where. By minimizing in (44), a linear model is constructed as, where is a threshold and. After the linear model is constructed, features are ranked by their weight magnitudes. To reduce the weight variability, the algorithm runs several times and the feature ranking is obtained based on the average weight magnitudes. Three random features are added to the data, and the average of their weight magnitudes is used as a threshold for determining the final selected features. The features whose weight magnitudes are bigger than the threshold are included in the final model. B. Example 1 Toydata: This is an artificial data set which contains six features ( trough ) and one output (45) where through are uniformly distributed between [0 1], and are identical, all other features are independent with

9 LI et al.: FEATURE SELECTION USING A PIECEWISE LINEAR NETWORK 1109 Fig. 1. Regression Errors of PLN on the training and validation toydata sets. TABLE I AVERAGE WEIGHT MAGNITUDES FOR THE TOYDATA BY THE SVM BASED ALGORITHM TABLE II TEN-FOLD CV RESULTS ON THE TESTING TOYDATA FOR THE FINAL SELECTED FEATURE SUBSETS each other, and is white Gausian noise with zero mean and unit variance. Note that is independent of. We generated samples for both the training and validation sets, respectively, and samples for the testing set. Using the method described in Section IV-B, we first produced PLN training and validation error curves, on the training set and validation set, respectively, and plotted them in Fig. 1. We observed that the number of clusters corresponding to the minimum of the validation error was 21, and the validation error was We could say that the 21-cluster PLN is a good model for the nonlinear system because we know that the variance of noise for the output is one. We then ran the three PLN-based algorithms for feature selection, where the PLN had 21 clusters. We also ran the Leray algorithm for this data, where the MLP used 20 hidden units, that was calculated using (40). To run the SVM-based algorithm, the training and validation data sets were combined together and were repartitioned randomly in each of ten runs. The average weight magnitudes in the ten runs are shown in Table I. The final model was determined to contain five features with the random feature (the 6th) being excluded. Though the weight magnitudes for the two identical features (the first and the fourth) are different, feature one was, however, not successfully excluded from the final model. The final feature subsets selected by the five algorithms are shown in Table II. We used ten-fold CV on the testing set to evaluate their performances. The equivalent MLPs we used in the TABLE III TEN-FOLD CV RESULTS ON THE TESTING TOYDATA SET IF FEATURE SUBSET SIZE IS FIXED AT 3 ten-fold CV all had 20 hidden units calculated using (40). Paired tests (at a 95% confidence level) of the CV results showed that the final subsets selected by the five algorithms perform statistically similarly. However, if the feature size is fixed at 3, which is the correct feature size for toydata, Table III shows the testing error of CV results for the five feature subsets selected by the algorithms. The minimum MSE is shown in bold.

10 1110 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER 2006 Fig. 2. Ten-fold CV results on the testing toydata set for all size of feature subsets selected by the five algorithms. Paired -tests were performed between the minimum MSE and the other MSEs. A sign indicates that the MSE is significantly different from the corresponding minimum MSE at the 95% confidence level. The results clearly show that the feature subset selected by the floating-ols algorithm is significantly better than those of the other algorithms. The CV MSE is very close to the noise variance, which means that features one, two, and five are adequate to explain most of the variance for the output. For this data set, only the floating-ols algorithm selected the correct compact feature subset for the regression task. Feature subsets selected by the other algorithms contain unnecessary features, especially the identical feature four and feature one, though all finally selected subsets give statistically similar results. Fig. 2 shows the averages of ten-fold CV MSE results on the testing data set for all size of feature subsets selected by different algorithms, which gives a broad overview for each of the feature selection algorithm. It is clear that the floating-ols algorithm obtained the best or one of the best results for each case. We will mention that the SVM algorithm was specifically designed for a challenging problem in quantitative structural-activity relationships (QSAR) analysis with the goal of prediction the bioactivity of molecules. Each molecule has many potential features ( ) that may be highly correlated with each other or irrelevant to the desired output, and the feature size is much bigger than that of the sample size. There is an assumption that a linear model may be adequate for modeling the data. This may explain why the algorithm failed to identify the correct features for modeling the toydata. The SVM-based algorithm was designed for the one output system only. For a fair comparison, it will not be included in the multiple output data set experiments. C. Example 2 Twod: This training file is used in the remote sensing task of inverting the surface scattering parameters from an inhomogeneous layer above a homogeneous half space, where both interfaces are randomly rough [57]. The parameters to be inverted are the effective permittivity of the surface, the normalized root-mean-square (rms) height, the normalized surface correlation length, the optical depth, and single scattering albedo of an inhomogeneous irregular layer above a homogeneous half space from backscattering measurements. The data files contain 1238, 530, and 1000 patterns for training, validation, and testing, respectively. The features consist of eight theoretical values of backscattering coefficient parameters at V and H polarization and four incident angles. The outputs were the corresponding values of permittivity, upper surface height, lower surface height, normalized upper surface correlation length, normalized lower surface correlation length, optical depth, and single scattering albedo which had a jointly uniform probability density function (pdf). In this experiment, we tested wether the proposed feature selection algorithm was able to reject noise features. To this end, we added four noise features to the data sets with zero means and standard deviations of 1, 2, 3, and 4, respectively. The training, validation, and testing data sets thus had 12 features (9 12 features were noises) and seven outputs. The number of clusters in the PLN model was determined as 14. We thus ran the three PLN based feature selection algorithms with a 14-cluster PLN model. We also ran the Leray algorithm with 13 hidden units on this data. The number of hidden units was calculated using (40) with, because the seven outputs can be compressed down to one with less than 1% increase in training MSE of an equivalent MLP.

11 LI et al.: FEATURE SELECTION USING A PIECEWISE LINEAR NETWORK 1111 Fig. 3. Ten-fold CV results on the testing twod data set for all size of feature subsets selected by the four algorithms. TABLE IV TEN-FOLD CV RESULTS ON THE TESTING TWOD DATA SET IF FEATURE SUBSET SIZE IS FIXED AT 6 All four algorithms selected the same feature subset containing all the original features (they are all relevant to the outputs) and successfully rejected the four added noise features. Fig. 3 shows the averages of ten-fold CV MSE results on the testing data set for all size of feature subsets selected by different algorithms. It is clear from Fig. 3 that the floating-ols algorithm is the best algorithm in this example. It was outperformed only when the feature subset size is 1, where the Leray method was the best algorithm. Though there are other subsets (for example, size 5 and 7), where the floating-ols did not get the best results, the paired -test showed that the difference between the best algorithm and the floating-ols algorithm is not significant. For most of the other size feature subsets, the floating-ols algorithm outperformed the other three algorithms. Table IV shows that it is statistically better than the others if the feature subset size is fixed at 6. D. Example 3 Speech: The speech samples are first pre-emphasized and converted to the frequency domain by taking the discrete Fourier transform (DFT). Then they are passed through mel filter banks and the inverse DFT is applied to the output to get mel-frequency cepstrum coefficients (MFCC). Each of MFCC(n), MFCC(n)-MFCC(n-1), and MFCC(n)-MFCC(n-2) would have 13 features, which results in a total of 39 features. The desired outputs are likelihoods for the beginning, middle, and ends of 39 phonemes, resulting in 117 outputs. The data files contain 1405, 585, and 2039 patterns for training, validation, and testing, respectively. The number of clusters in the PLN model was determined to be eight. We thus ran the three PLN-based feature selection algorithms based on an eight-cluster PLN model. We also ran the Leray feature selection with an MLP having seven hidden units. Again, the number of hidden units for the MLP was calculated using (40), where, since the 117 outputs are highly correlated and can be compressed down to one with less than 1% increase in training MSE of an equivalent MLP. The sizes of the final selected feature subsets were determined to be 13, 12, 12, and 17 by the importance, forward-ols, floating-ols and Leray algorithms, respectively. Table V shows the ten-fold CV MSE and the paired -test results for evaluating the selected feature subsets and the full feature set using equivalent MLPs (all have seven hidden units). The paired test showed that there are no significant differences among the feature subsets selected by the three PLN based algorithms, and the feature subsets selected by the three PLN-based algorithms are statistically better than that by the Leray algorithm and the full feature set. It is clear that using all features in the MLP network for this data are not appropriate, since only 12 out of the 39 features are adequate to model the data. All the PLN-based algorithms selected about 12 features for the final feature subset, and they performed better than the full feature set. The Leray

12 1112 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER 2006 Fig. 4. Ten-fold CV results on the testing speech data set for all size of feature subsets selected by the four algorithms. TABLE V TEN-FOLD CV RESULTS ON THE TESTING SPEECH DATA SET FOR THE FINAL SELECTED FEATURE SUBSETS TABLE VI CPU TIME NEEDED FOR THE FOUR ALGORITHMS algorithm kept 17 features and performed worse than the other algorithms. Fig. 4 shows the average ten-fold CV MSE results on the testing data set for all feature subset sizes. Since the size of the final feature subsets selected by different algorithms were less than 20, we ran the ten-fold CV only for these relevant feature subsets with a size of less than 20. It is clear that the floating-ols algorithm is one of the best algorithms compared to others, and the Leray algorithm is the worst. For all the feature subsets, the forward-ols algorithm is as good as the floating-ols algorithm which is a sign that a forward-ols search may be adequate for this highly correlated data. E. Computational Efficiency We compared the computational efficiencies of the four algorithms in terms of the CPU times used in each experiment. Table VI shows the CPU time used by the four algorithms in each experiment. The importance and forward-ols algorithms used similar amount of CPU time in each experiment. The floating-ols algorithm needed less than one more second for the first two experiments and 8 s more for the third experiment to complete the search tasks. The Leray algorithm had a much longer computation time than the other three. It needed at least six times more CPU time than the others. For example, in Experiment 2, it cost 104 s of CPU time while the PLN-based algorithms needed about 17 s. VIII. CONCLUSION We have developed a novel feature selection approach for nonlinear regression problems. The algorithm first determines an appropriate PLN model for the given data by cross validation. It then applies the OLS procedure to the PLN. Finally, useful features are chosen by the floating search. The nesting effect associated with step forward or step backward search is prevented. The fitness of a feature subset was calculated in the OLS procedure during the selection process based on the autocorrelation and cross-correlation matrices of the original data. This makes the algorithm very efficient because it passes through the data only once to accumulate the correlation matrices. The floating search algorithm always finds the best or one of the best feature subsets, among those selected by the other compared algorithms. The contributions of this paper are two fold: The way nonlinearities are handled and the usage of the floating search

13 LI et al.: FEATURE SELECTION USING A PIECEWISE LINEAR NETWORK 1113 in the OLS procedure. To our knowledge, this is the first time that the floating search has been utilized in an OLS based feature selection algorithm. The examples show that it successfully detects the interaction that among features and that between outputs and features. APPENDIX I THE MODIFIED GRAM SCHMIDT PROCEDURE The normal or modified Gram Schmidt procedure [32] is a recursive process that requires scalar products between raw basis functions and orthonormal basis functions. The disadvantage of this procedure is that one pass through the training data is required to obtain each new basis function. In the following, a more useful form of the Schmidt process is reviewed, which enables us to express the orthonormal system in terms of autocorrelation elements. Rewrite (9) as (46) since is an upper triangular matrix so that is also upper triangular. Define (47) Then, obtain coefficients as (53) Finally, for the th basis function the new coefficients are found as (54) Using (17), the weights for the orthonormal system can be obtained as where is the cross-correlation matrix defined as (55) (56) Specifically, the weights from the th basis to the th output can be expressed as From (46) and (47), the th othonormal basis function can be expressed as for, the first basis function is obtained as (48) (57) If the th feature is linearly dependent with the previous basis functions, we just obtain all zero-valued weights for the th feature. This will eliminate the numerical problem and the linearly dependent feature will not contribute to the explanation of the output variance. Using (18), the weights for the original system can be readily found as (49) (58) where and (50) The aforementioned process enables us to calculate all of orthonormal basses in terms of the autocorrelation matrix and cross-correlation matrix, which can be obtained by passing through the data only once before the orthonormal process. (51) is the autocorrelation of and. For values of between 2 and is first found for as (52) APPENDIX II DESCRIPTION OF THE FLOATING SEARCH ALGORITHM The following is a description of the floating search feature selection algorithm [24]. Initialize, and use the forward least square method to form and. Suppose we have already selected features from the available features. The fitness value and corresponding members for each subset feature have been stored, and we then do as follows.

14 1114 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 5, SEPTEMBER ) Adding one feature: Find the most significant feature, say, in the set of with respect to using (28), yielding (59) 2) Conditional deletion: Using (26), we find the least significant feature, say, in the set of. Then, increment as, and (60) return to step 1). However, if is the least significant feature in the set of, then delete from and form a set of as Update as (61) (62) and return to step 1). Otherwise, go to step 3). 3) Continuation of the conditional deletion: Find the least significant feature, say, in the set. If, then set, update using (62), and return to step 1). Otherwise, delete from to form a new set, and set. If, set and and return to step 1). Otherwise, repeat step 3). ACKNOWLEDGMENT The authors would like to thank Dr. J. Bi for useful discussions and the result she provided for the SVM-based feature selection algorithm. REFERENCES [1] A. D. Back and T. P. Trappenberg, Selecting inputs for modelling using normalized higher order statistics and independent component analysis, IEEE Trans. Neural Netw., vol. 12, no. 3, pp , May [2] R. Kohavi and G. John, Wrappers for feature subset selection, Artif. Intell., vol. 97, no. 1 2, pp , [3] I. Guyon and A. Elisseeff, An introduction to variable feature selection, J. Mach. Learn. Res., vol. 3, pp , [4] T. W. S. Chow and D. Huang, Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information, IEEE Trans. Neural Netw., vol. 16, no. 1, pp , Jan [5] V. Sindhwani, S. Rakshit, D. Deodhare, D. Erdogmus, J. Principe, and P. Niyogi, Feature selection in MLPs and SVMs based on maximum output information, IEEE Trans. Neural Netw., vol. 15, no. 4, pp , Jul [6] N. Kambhatla and T. K. Leen, Dimension reduction by local principal component analysis, Neural Comput., vol. 9, no. 7, pp , [7] J. T. Kwok and I. W. Tsang, The pre-image problem in kernel methods, IEEE Trans. Neural Netw., vol. 15, no. 6, pp , Nov [8] M. D. Plumbley and E. Oja, A nonnegative PCA algorithm for independent component analysis, IEEE Trans. Neural Netw., vol. 15, no. 1, pp , Jan [9] K. Z. Mao, Fast orthogonal forward selection algorithm for feature subset selection, IEEE Trans. Neural Netw., vol. 13, no. 5, pp , Sep [10] K. Z. Mao, Orthogonal forward selection and backward elimination algorithms for feature subset selection, IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 1, pp , Feb [11] R. Caruana and V. De Sa, Benefitting from the variables that variable selection discards, J. Mach. Learn. Res., vol. 3, pp , [12] Ponnapalli, A formal selection and pruning algorithm for feedforward artificial network optimization, IEEE Trans. Neural Netw., vol. 10, no. 4, pp , Jul [13] L. K. Kansen and C. E. Rasmussen, Pruning from adaptive regularization, Neural Comput., vol. 6, no. 6, pp , [14] R. Reed, Pruning algorithms A survey, IEEE Trans. Neural Netw., vol. 4, no. 5, pp , Sep [15] S. E. Fahlman and C. lebiére, The cascade correlation learning architecture, in Advances in Neural Information Processing Systems2, 1993, pp , San Mateo, CA: Morgan Kaufmann. [16] T. Y. Kwok and D. Y. Yeung, Constructive algorithms for structure learning in feedforward neural networks for regression problems, IEEE Trans. Neural Netw., vol. 8, no. 3, pp , May [17] I. Rivals and L. Personnaz, Neural-network construction and selection in nonlinear modeling, IEEE Trans. Neural Netw., vol. 14, no. 4, pp , Jul [18] G. B. Huang, P. Saratchandran, and N. Sundararajan, A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation, IEEE Trans. Neural Netw., vol. 16, no. 1, pp , Jan [19] F. J. Maldonado and M. T. Manry, Optimal pruning of feed-forward neural networks based upon the Schmidt procedure, in 36th Asilomar Conf. Signals, Systems Computers, 2002, pp [20] T. Cibas, F. F.Soulié, P. Gallinanri, and S. Raudys, Variable selection with neural networks, Neurocomput., vol. 12, pp , [21] K. L. Priddy, S. K. Rogers, D. W. Ruch, G. L. Tarr, and M. Kabrisky, Bayesian selection of important features for feedforward neural networks, Neurocomput., vol. 5, no. 2, 3, pp , [22] P. Leray and P. Gallinari, Feature selection with neural networks, Behaviormetrika, vol. 26, Jan [23] I. V. Tetko, A. E. P. Villa, and D. J. Livingstone, Neural network studies. 2. variable selection, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp , [24] P. Pudil, J. Novovi cová, and J. Kittler, Floating search methods in feature selection, Pattern Recognit. Lett., vol. 15, pp , [25] P. M. Narendra and K. Fukunaga, A branch and bound algorithm for feature subset selection, IEEE Trans. Comput., vol. C 26, no. 9, pp , Sep [26] S. D. Stearns, On selecting features for pattern classifiers, 3rd Int. Conf. Pattern Recognition, pp , [27] I. Rivals and L. Personnaz, Mlps (mono-layer polynomials and multilayer perceptrons) for nonlinear modeling, J. Mach. Learn. Res., vol. 3, pp , [28] H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar, Ranking a random feature for variable and feature selection, J. Mach. Learn. Res., vol. 3, pp , [29] S. Chen, X. Hong, and C. J. Harris, Sparse kernel regression modelling using combined locally regularised orthogonal least squares and D-Optimality experimental design, IEEE Trans. Autom. Control, vol. 48, no. 6, pp , Jun [30] S. Chen, X. Hong, C. J. Harris, and P. M. Sharkey, Sparse modelling using orthogonal forward regression with PRESS statistic and regularization, IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 2, pp , Apr [31] X. Hong and C. J. Harris, Variable selection algorithm for the construction of MIMO operating point dependent neurofuzzy networks, IEEE Trans. Fuzzy Syst., vol. 9, no. 1, pp , Feb [32] S. Chen, S. A. Billings, and W. Luo, Orthogonal least squares methods and their application to non-linear system identification, Int. J. Control, vol. 50, no. 5, pp , [33] S. Chen, C. F. N. Cowan, and P. M. Grant, Orthogonal least squares learning algorithm for radial basis function networks, IEEE Trans. Neural Netw., vol. 2, no. 2, pp , Mar

15 LI et al.: FEATURE SELECTION USING A PIECEWISE LINEAR NETWORK 1115 [34] S. Chen, Locally regularised orthogonal least squares algorithm for the construction of sparse kernel regression models, in Proc. 6th Int. Conf. Signal Processing, 2002, vol. 2, pp [35] S. A. Billings and W. S. F. Voon, Piecewise linear identification of nonlinear systems, Int. J. Control, vol. 46, pp , [36] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Tress. Belmont, CA: Wadsworth, [37] J. H. Friedman, Multivariate adaptive regression splines, Ann. Statistics, vol. 19, no. 1, pp , [38] J. S. Albus, A new approach to manipulator control: The cerebellar model articulation controller (CMAC), J. Dynam. Syst., Meas. Control Trans. ASME, vol. 97, no. 3, pp , [39] J. S. Albus, Data storage in the cerebellar model articulation controller (CMAC), J. Dynam. Syst., Meas. Control Trans. ASME, vol. 97, no. 3, pp , [40] K. K. Kim, A Local Approach for Sizing the Multilayer Perceptron, Ph.D. dissertation, Dept. Elect. Eng., Univ. Texas, Arlington, [41] S. Subbarayan, K. Kim, M. T. Manry, V. Devarajan, and H. Chen, Modular neural network architecture using piecewise linear mapping, 30th Asilomar Conf. Signals, Systems Computers, vol. 2, pp , [42] D. J. C. MacKay, Bayesian interpolation, Neural Comput., vol. 4, no. 3, pp , [43] F. L. Chung and T. Lee, Network-growth approach to design of feedforward neural networks, IEE Proc. Control Theory Applications, vol. 142, no. 5, pp , [44] Y. Hirose, K. Yamashita, and S. Hijiya, Back-propagation algorithm that varies the number of hidden units, Neural Netw., vol. 4, pp , [45] D. DeMers and G. Cottrell, Nonlinear dimensionality reduction, in Advances in Neural Information Processing Systems. San Mateo, CA: Morgan Kaufmann, 1993, pp [46] N. Murata, S. Yoshizawa, and S. Amari, Network information criterion-determining the number of hidden units for an artificial neural network model, IEEE Trans. Neural Netw., vol. 5, no. 6, pp , Nov [47] B. Schölkopf and A. J. Smola, Learning with Kernels. Cambridge, MA: MIT Press, [48] T. Kohonen, Self-Organization and Associative Memory, 3 ed. Heidelberg, Germany: Springer-Verlag, [49] T. Furey, N. C. Duffy, N. Bednarski, D. M. Schummer, and D. Haussler, Support vector machine classification an validation of cancer tissue samples using microarray expression data, Bioinformatics, vol. 16, pp , [50] A. R. McIntosh, F. L. Bookstein, J. V. Haxby, and C. L. Grady, Spatial pattern analysis of functional brain images using partial least squares, Neuroimage, vol. 3, pp , [51] H. Wold, chapter Estimation of Principal Components and Related Models by Iterative Least Squares, in Multivariate Analysis. New York: Academic Press, 1966, pp [52] S. Haykin, A Comprehensive Foundation, in Neural Networks, 2 ed. Englewood Cliffs, N.J.: Prentice-Hall, [53] M. Leshno, V. Lin, and S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Netw., vol. 6, no. 6, pp , [54] H. Chandrasekaran, K. K. Kim, and M. T. Manry, Sizing of the multilayer perceptron via modular networks, in Proc. Neural Networks for Signal Processing IX (NNSP 99), Madison, WI, Aug. 1999, pp [55] A. Gopalakrishnan, M. S. Chen, X. Jiang, and M. T. Manry, Constructive proof of efficient pattern storage in the multilayer perceptron, in 27th Asilomar Conf. Signals, System, Systems, Computers, 1993, vol. 1, pp [56] J. Bi, K. P. Bennett, M. Embrechts, C. M. Breneman, and M. Song, Dimensionality reduction via sparse support vector machines, J. Mach. Learn. Res., vol. 3, pp , Mar [57] M. S. Dawson, A. K. Fung, and M. T. Manry, Surface parameter retrieval using fast learning neural networks, Remote Sens. Rev., vol. 7, no. 1, pp. 1 18, Jiang Li (S 01 M 05) received the B.S. degree in electrical engineering from Shanghai Jiaotong University, Shanghai, China, in 1992, the M.S. degree in automation from Tsinghua University, Beijing, China, in 2000, and the Ph.D. degree in electrical engineering from the University of Texas at Arlington (UTA), Arlington, in Currently, he has a Postdoctoral position at the National Institutes of Health, Bethesda, MD, where he performs algorithm design for computer-aided colon cancer detection. His research interests include medical image processing, machine learning and signal processing for communication. Dr. Li is a Member of the Sigma Xi. Michael T. Manry was born in Houston, TX, in He received the B.S., M.S., and Ph.D. degrees in electrical engineering in 1971, 1973, and 1976, respectively, from The University of Texas at Austin, Austin. After working there for two years as an Assistant Professor, he joined Schlumberger Well Services, Houston, TX, where he developed signal processing algorithms for magnetic resonance well logging and sonic well logging. He joined the Department of Electrical Engineering at the University of Texas at Arlington (UTA), Arlington, in 1982 and has held the rank of Professor since In Summer 1989, he developed neural networks for the Image Processing Laboratory, Texas Instruments, Dallas, TX. His recent work, sponsored by the Advanced Technology Program of the state of Texas, E-Systems, Mobil Research, and NASA has involved the development of techniques for the analysis and fast design of neural networks for image processing, parameter estimation, and pattern classification. He has served as a consultant for the Office of Missile Electronic Warfare at White Sands Missile Range, MICOM (Missile Command) at Redstone Arsenal, NSF, Texas Instruments, Geophysics International, Halliburton Logging Services, Mobil Research and Verity Instruments. Pramod L. Narasimha (S 04) received the B.E. degree in telecommunications engineering from Bangalore University, Bangalore, India, in 2000 and the M.S. degree in electrical engineering from the University of Texas at Arlington (UTA), Arlington, in Currently, he is working toward the Ph.D. degree at UTA. He joined the Neural Networks and Image Processing Lab in the Electrical Engineering Department, UTA, as a Research Assistant, in His research interests focus on neural networks, pattern recognition, and image and signal processing. Changhua Yu received the B.S. degree in electrical engineering from Huazhong University of Science and Technology, Wuhan, China, in 1995, the M.Sc. degree in automation from Shanghai Jiaotong University, Shanghai, China, in 1998, and the Ph.D. degree in electrical engineering from the University of Texas at Arlington (UTA), Arlington, in Currently, he is a member of the technical staff at Fastvdo LLC., Columbia, MD. His main research interests include neural networks, image processing, and pattern recognition.

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest.

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. D.A. Karras, S.A. Karkanis and D. E. Maroulis University of Piraeus, Dept.

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS /$ IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS /$ IEEE IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Exploration of Heterogeneous FPGAs for Mapping Linear Projection Designs Christos-S. Bouganis, Member, IEEE, Iosifina Pournara, and Peter

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING SECOND EDITION IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING ith Algorithms for ENVI/IDL Morton J. Canty с*' Q\ CRC Press Taylor &. Francis Group Boca Raton London New York CRC

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

Dimensionality Reduction, including by Feature Selection.

Dimensionality Reduction, including by Feature Selection. Dimensionality Reduction, including by Feature Selection www.cs.wisc.edu/~dpage/cs760 Goals for the lecture you should understand the following concepts filtering-based feature selection information gain

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan

More information

Univariate and Multivariate Decision Trees

Univariate and Multivariate Decision Trees Univariate and Multivariate Decision Trees Olcay Taner Yıldız and Ethem Alpaydın Department of Computer Engineering Boğaziçi University İstanbul 80815 Turkey Abstract. Univariate decision trees at each

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Feature Selection for Image Retrieval and Object Recognition

Feature Selection for Image Retrieval and Object Recognition Feature Selection for Image Retrieval and Object Recognition Nuno Vasconcelos et al. Statistical Visual Computing Lab ECE, UCSD Presented by Dashan Gao Scalable Discriminant Feature Selection for Image

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

An Efficient Feature Selection Algorithm for Computer-Aided Polyp Detection

An Efficient Feature Selection Algorithm for Computer-Aided Polyp Detection An Efficient Feature Selection Algorithm for Computer-Aided Polyp Detection Jiang Li, Jianhua Yao, Ronald M. Summers Clinical Center, National Institutes of Health, MD Amy Hara Mayo Clinic, Scottsdale,

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Automatic basis selection for RBF networks using Stein s unbiased risk estimator

Automatic basis selection for RBF networks using Stein s unbiased risk estimator Automatic basis selection for RBF networks using Stein s unbiased risk estimator Ali Ghodsi School of omputer Science University of Waterloo University Avenue West NL G anada Email: aghodsib@cs.uwaterloo.ca

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Machine Learning Feature Creation and Selection

Machine Learning Feature Creation and Selection Machine Learning Feature Creation and Selection Jeff Howbert Introduction to Machine Learning Winter 2012 1 Feature creation Well-conceived new features can sometimes capture the important information

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

AN ALGORITHM FOR BLIND RESTORATION OF BLURRED AND NOISY IMAGES

AN ALGORITHM FOR BLIND RESTORATION OF BLURRED AND NOISY IMAGES AN ALGORITHM FOR BLIND RESTORATION OF BLURRED AND NOISY IMAGES Nader Moayeri and Konstantinos Konstantinides Hewlett-Packard Laboratories 1501 Page Mill Road Palo Alto, CA 94304-1120 moayeri,konstant@hpl.hp.com

More information

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen Lecture 2: Feature selection Feature Selection feature selection (also called variable selection): choosing k < d important

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

Allstate Insurance Claims Severity: A Machine Learning Approach

Allstate Insurance Claims Severity: A Machine Learning Approach Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Singular Value Decomposition, and Application to Recommender Systems

Singular Value Decomposition, and Application to Recommender Systems Singular Value Decomposition, and Application to Recommender Systems CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Recommendation

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Cross-validation. Cross-validation is a resampling method.

Cross-validation. Cross-validation is a resampling method. Cross-validation Cross-validation is a resampling method. It refits a model of interest to samples formed from the training set, in order to obtain additional information about the fitted model. For example,

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

An Improved Measurement Placement Algorithm for Network Observability

An Improved Measurement Placement Algorithm for Network Observability IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 16, NO. 4, NOVEMBER 2001 819 An Improved Measurement Placement Algorithm for Network Observability Bei Gou and Ali Abur, Senior Member, IEEE Abstract This paper

More information

Classroom Tips and Techniques: Least-Squares Fits. Robert J. Lopez Emeritus Professor of Mathematics and Maple Fellow Maplesoft

Classroom Tips and Techniques: Least-Squares Fits. Robert J. Lopez Emeritus Professor of Mathematics and Maple Fellow Maplesoft Introduction Classroom Tips and Techniques: Least-Squares Fits Robert J. Lopez Emeritus Professor of Mathematics and Maple Fellow Maplesoft The least-squares fitting of functions to data can be done in

More information

Image Analysis, Classification and Change Detection in Remote Sensing

Image Analysis, Classification and Change Detection in Remote Sensing Image Analysis, Classification and Change Detection in Remote Sensing WITH ALGORITHMS FOR ENVI/IDL Morton J. Canty Taylor &. Francis Taylor & Francis Group Boca Raton London New York CRC is an imprint

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

CLASSIFICATION AND CHANGE DETECTION

CLASSIFICATION AND CHANGE DETECTION IMAGE ANALYSIS, CLASSIFICATION AND CHANGE DETECTION IN REMOTE SENSING With Algorithms for ENVI/IDL and Python THIRD EDITION Morton J. Canty CRC Press Taylor & Francis Group Boca Raton London NewYork CRC

More information

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi 1. Introduction The choice of a particular transform in a given application depends on the amount of

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

WE consider the gate-sizing problem, that is, the problem

WE consider the gate-sizing problem, that is, the problem 2760 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL 55, NO 9, OCTOBER 2008 An Efficient Method for Large-Scale Gate Sizing Siddharth Joshi and Stephen Boyd, Fellow, IEEE Abstract We consider

More information

Pattern Recognition ( , RIT) Exercise 1 Solution

Pattern Recognition ( , RIT) Exercise 1 Solution Pattern Recognition (4005-759, 20092 RIT) Exercise 1 Solution Instructor: Prof. Richard Zanibbi The following exercises are to help you review for the upcoming midterm examination on Thursday of Week 5

More information

FEATURE SELECTION TECHNIQUES

FEATURE SELECTION TECHNIQUES CHAPTER-2 FEATURE SELECTION TECHNIQUES 2.1. INTRODUCTION Dimensionality reduction through the choice of an appropriate feature subset selection, results in multiple uses including performance upgrading,

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

PERFORMANCE OF THE DISTRIBUTED KLT AND ITS APPROXIMATE IMPLEMENTATION

PERFORMANCE OF THE DISTRIBUTED KLT AND ITS APPROXIMATE IMPLEMENTATION 20th European Signal Processing Conference EUSIPCO 2012) Bucharest, Romania, August 27-31, 2012 PERFORMANCE OF THE DISTRIBUTED KLT AND ITS APPROXIMATE IMPLEMENTATION Mauricio Lara 1 and Bernard Mulgrew

More information

10601 Machine Learning. Model and feature selection

10601 Machine Learning. Model and feature selection 10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

A Systematic Overview of Data Mining Algorithms

A Systematic Overview of Data Mining Algorithms A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a

More information

Variable Selection 6.783, Biomedical Decision Support

Variable Selection 6.783, Biomedical Decision Support 6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Feature Selection in Knowledge Discovery

Feature Selection in Knowledge Discovery Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco

More information

Module 9 : Numerical Relaying II : DSP Perspective

Module 9 : Numerical Relaying II : DSP Perspective Module 9 : Numerical Relaying II : DSP Perspective Lecture 36 : Fast Fourier Transform Objectives In this lecture, We will introduce Fast Fourier Transform (FFT). We will show equivalence between FFT and

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

SELECTION OF A MULTIVARIATE CALIBRATION METHOD SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper

More information

Simple Model Selection Cross Validation Regularization Neural Networks

Simple Model Selection Cross Validation Regularization Neural Networks Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

Artificial Neural Networks (Feedforward Nets)

Artificial Neural Networks (Feedforward Nets) Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x

More information

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,

More information

Multivariate Analysis Multivariate Calibration part 2

Multivariate Analysis Multivariate Calibration part 2 Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,

More information

CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS

CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS 38 CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS 3.1 PRINCIPAL COMPONENT ANALYSIS (PCA) 3.1.1 Introduction In the previous chapter, a brief literature review on conventional

More information

Function approximation using RBF network. 10 basis functions and 25 data points.

Function approximation using RBF network. 10 basis functions and 25 data points. 1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data

More information

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun 1. Introduction The human voice is very versatile and carries a multitude of emotions. Emotion in speech carries extra insight about

More information

Robust Signal-Structure Reconstruction

Robust Signal-Structure Reconstruction Robust Signal-Structure Reconstruction V. Chetty 1, D. Hayden 2, J. Gonçalves 2, and S. Warnick 1 1 Information and Decision Algorithms Laboratories, Brigham Young University 2 Control Group, Department

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information