Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach
|
|
- Kory Parrish
- 5 years ago
- Views:
Transcription
1 Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach Soha Ahmed 1, Mengjie Zhang 1, and Lifeng Peng 2 1 School of Engineering and Computer Science 2 School of Biological Sciences Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand {soha.ahmed,mengjie.zhang}@ecs.vuw.ac.nz, lifeng.peng@vuw.ac.nz Abstract. Biomarker discovery using mass spectrometry (MS) data is very useful in disease detection and drug discovery. The process of biomarker discovery in MS data must start with feature selection as the number of features in MS data is extremely large (e.g. thousands) while the number of samples is comparatively small. In this study, we propose the use of genetic programming (GP) for automatic feature selection and classification of MS data. This GP based approach works by using the features selected by two feature selection metrics, namely information gain (IG) and relief-f (REFS-F) in the terminal set. The feature selection performance of the proposed approach is examined and compared with IG and REFS-F alone on five MS data sets with different numbers of features and instances. Naive Bayes (NB), support vector machines (SVMs) and J48 decision trees (J48) are used in the experiments to evaluate the classification accuracy of the selected features. Meanwhile, GP is also used as a classification method in the experiments and its performance is compared with that of NB, SVMs and J48. The results show that GP as a feature selection method can select a smaller number of features with better classification performance than IG and REFS-F using NB, SVMs and J48. In addition, GP as a classification method also outperforms NB and J48 and achieves comparable or slightly better performance than SVMs on these data sets. 1 Introduction Mass spectrometry (MS) is a tool for measuring the molecular masses of chemical compounds, and these masses are necessary to identify the species of proteins and metabolites [1]. Inside the instrument, the sample molecules are converted to ions in the ionization source and then these ions are passed to the mass analyzer to measure their mass to charge ratios (m/z). MS can work in either the full scan mode which measures the parent ions m/z (called MS spectrum, which contains m/z ratios and their corresponding intensities), or the tandem MS mode that selects and fragments the ion of interest from the parent ions list and subsequently records the m/z ratios of its drought ions (called MS/MS L. Vanneschi, W.S. Bush, and M. Giacobini (Eds.): EvoBIO 2013, LNCS 7833, pp , c Springer-Verlag Berlin Heidelberg 2013
2 44 S. Ahmed, M. Zhang, and L. Peng spectrum, which contains the m/z ratios and their corresponding intensities of the fragment ions) to aid the elucidation of the structure of the molecules [1]. The MS machine is usually coupled with separation techniques, which can be either gas chromatography (GC) or liquid chromatography (LC), in the front to separate the different molecules before MS detection to reduce the complexity of the MS spectrum and increase the detection coverage. Accordingly, the data acquired with LC instrumental set up is called LC MS/MS spectrum, which contains the features of retention time, scan numbers, and m/z ratios and the corresponding intensities of the parent and fragment ions [1]. Due to the nature of the MS data, a data set typically has a very small number of instances but each instance is represented by a huge number of features (typically several thousand), which makes the classification of MS data a very challenging problem [2]. Several machine learning and optimisation techniques have been used for feature selection and classification of MS data. For example, principle component analysis and linear discriminant analysis [3, 4] and random forest algorithm [5] have been used for feature extraction, dimensionality reduction and classification on MS data. Support vector machines (SVMs), k-nearest-neighbour and quadratic discriminant analysis [6] have been used for MS data classification. Genetic algorithms and t-test [7] have also been used for feature selection and classification. SVM-RFE is used to select features by giving a weight to each feature through training a linear SVM and removing the lowest weight features. Feature subsets are passed to SVMs classifiers to classify the protoemics MS data sets [8]. However, many of these techniques for classification cannot easily handle such a huge number of features, and feature selection and dimensionality reduction of data often need to take place before classification is performed. However, the two separate processes cannot be connected so well sometimes, for example, the features selected by one method (e.g. decision trees) do not perform well for classification using a different method (e.g. SVMs). Genetic programming (GP) is one of the evolutionary computation algorithms [9]. It starts with a random initial population to search for a solution to a given task and over a number of generations it modifies the initial solutions through a set of genetic operators guided by the fitness function [10]. GP has the capability to select features implicitly [11] by incorporating useful features during the evolution of programs. Since very recently, there have been a small number of works only using GP for feature selection and classification of bio-data (e.g. [12,13]). For example, GP has also been used for peptide quantification of MS data and the measurement of protein quantities within a biological sample [12]. However, GP has been very seldom used for feature selection and classification of MS data. The overall goal of this paper is to investigate a GP based approach to automatic feature selection and classification of MS data, which typically has a huge number of features and a small number of examples. To achieve feature selection, we will embed two existing feature selection metrics, information gain (IG) [14] and relief-f (REFS-F) [15,16], into the GP system. The features selected
3 FS and Classification of High Dimensional MS Data: A GP Approach 45 by GP will be compared with the original features and the features selected by the above two metrics alone on five MS data sets using three common classifiers namely NB, SVMs, and J48 decision trees (J48). Specifically, we investigate the following objectives: what primitives and fitness function can be used to develop this GP system; whether embedding multiple feature selection metrics into GP can improve the feature selection process; whether GP as a feature selection method can automatically select a small number of features that can achieve better classification performance than using all the original features; whether the new set of features selected by GP outperforms the features selected by IG and REFS alone; and whether GP as a classification method outperform naive Bayes, SVMs, J48 classifiers. The remainder of the paper is organised as follows. Section 2 describes the data sets and the preprocessing steps. Section 3 describes the new GP approach. Section 4 presents the results with discussions. Section 5 presents the conclusions and the future work. 2 Data Sets and Preprocessing Five MS data sets are used in the experiments, as shown in Table 1. They are the high resolution SELDI-TOF ovarian cancer data set and the low resolution SELDI-TOF ovarian cancer data set 1 [17], the premalignant pancreatic cancer data set 2 [2], the Arcene data set [18], and the spike-in LC MS/MS data set [19] from Georgetown University 3. Table 1. Data Sets Name of the Data Set No. of Features Size of Data Set Premalignant Pancreatic Cancer High Resolution Ovarian Cancer 15, Low Resolution Ovarian Cancer 15, Arcene 10, Spike-In 10, Data Preprocessing Due to the nature (e.g. a lot of noise) and types (e.g. MS, LC MS/MS) of the MS data, different preprocessing methods are applied. We use the bioinformatics toolbox in Matlab [20] for this purpose. Low and high resolution ovarian cancer data sets: To obtain the same m/z point at all MS spectra [17], the resampling algorithm in the toolbox is used. Then, the 1 Available at: 2 Available at: 3 Available at:
4 46 S. Ahmed, M. Zhang, and L. Peng Fig. 1. Preprocessing steps of the low resolution ovarian cancer data set. (a) and (b) represent the resampling and the baseline adjustment of the first signal, respectively. (c) is an example alignment of 6 samples. Finally (d) represents the normalization of the samples using AUC. baseline correction is adopted to subtract the background noise and remove the low intensity peaks. Firstly, the baseline is estimated by calculating of the minimum m/z values within a window size of 50 m/z points for the high resolution data. For the low resolution data, the window size is 500 m/z points. Afterwards the varying baseline is regressed and the resulting baseline is subtracted [17]. Due to mis-calibration of mass spectrometers, systematic shifts can appear in repeated experiments. Therefore, the alignment of the spectrograms is the third step in the preprocessing framework. Finally, each spectrum is normalized using area under the curve (AUC). Figure 1 shows the four steps of preprocessing of the low resolution ovarian cancer data set as an example. Premalignant pancreatic cancer data set: Similar to the steps of the previous two data sets, baseline adjustment, filtering and normalization are used here. Baseline adjustment is estimated by segmenting the whole spectra into windows with a size of 200 m/z ratio intensities. The mean values of these windows are then used as the estimate of the baseline value at that intensity. To perform regression, a piecewise cubic interpolation method is used [2]. After this step, noise is filtered using the
5 FS and Classification of High Dimensional MS Data: A GP Approach 47 Extracted Peaks Above (0.75 x Base Peak Intensity) for Each Scan Mass/Charge (M/Z) Relative Ion Intensity 2500 Retention Time (seconds) Fig. 2. Peak extraction from raw data spectrum of the spike-in data set Gaussian kernel filter. Finally, normalization is performed using area under curve where the maximum value of intensity for each m/z ratio is rescaled to 100. Arcene data set: This data set is available after preprocessing, where the m/z is limited between 200 and 10,000 [18] by considering the m/z values under 200 and over 10,000 as noise. Afterward, the technical repeats are averaged, baseline is removed and then smoothing and alignment take place. The preprocessing of these four MS data sets is at a low level, therefore the number of features remains the same after preprocessing. Spike-in data set: The data set is an LC MS/MS data set, which consists of values of scan numbers, LC retention time, m/z ratios and the corresponding intensities to the m/z ratios [1]. Ideally, the same molecule detected in the same LC MS/MS should have the same molecular weight, intensity and retention time, but due to experimental condition variation and the co-elution of molecular ions this is not always the case. Therefore, before computational analysis can take place, the data has to undergo different preprocessing steps than the steps for the full scan MS data. The first step is to extract peaks from the data by clustering significant peaks and noisy peaks and removing the noisy peaks using the toolkit. Figure 2 shows the raw data and the data after peak extraction. The second step is to filter the peaks to further remove noise from each scan. Using a percentile of the base peak intensity the filtering is performed, where the base peak is the most intense peak found in each scan. In order to produce the centroid data from the raw signal, the peak preserving resampling method is adopted. Finally the alignment of the peaks is used to remove the fluctuation of the data. After the preprocessing, the number of features becomes 847 for this data set. 3 The Approach There are many feature selection methods used for dimensionality reduction and improving classification accuracy. Each of these methods can select different sets of features according to the criteria of the selection process. Some of the features
6 48 S. Ahmed, M. Zhang, and L. Peng Fig. 3. Overview of the approach selected may be powerful while other features may not be so relevant [21]. We hypothesize that combining features of different feature selection metrics may improve the feature selection performance. Our aim is to use GP to guide the feature selection process by combining two well known feature selection metrics, IG and REFS-F, and to produce a new and smaller set of features that can effectively improve the power of selected features in terms of classification accuracy. The two metrics are chosen due to their wide applications in the literature [14,15] and also because they show distinct characteristics. The tree-based GP [11] was used here and the proposed method has three steps: (1) we use the two feature ranking techniques IG and REFS-F to rank the features; (2) the top features ranked by the two metrics are used as terminals of the GP method, where the intrinsic capability of GP is used to search for good combinations of the features from those features to form a (hopefully) better set of features; (3) the features selected by the best GP evolved program are used by the three classifiers (and GP) for classification. Figure 3 shows an overview of the proposed GP approach. In the rest of this section, we will describe the feature selection metrics, the terminal set, the function set, the fitness function, and parameter settings for the proposed method. 3.1 Feature Selection Metrics The two feature selection metrics, IG [14] and REFS-F [15,16], are used to rank the importance of the individual features. We briefly describe them here. Information Gain (IG) determines the amount of information gained about a class when a certain feature is present or absent [21]. It is defined as follows: IG( `X, C i ) = C {C 1,C 2} `X {X, X} P ( `X, C) log P ( `X, C) P ( `X)P (C) (1) where X and X denotes the presence and the absence of the feature, and the healthy and diseased classes are denoted by C1 and C2.
7 FS and Classification of High Dimensional MS Data: A GP Approach 49 Relief-F (REFS-F) searches for the two nearest neighbors for a given example, one from the same class (hit) and the other from a different class (miss) [15, 16] and calculates the worth of the feature, which is given by: W (X) = P(different value of X nearest example from the different class) P(different value of X nearest example from the same class) (2) where P refers to probability, and a good feature should differentiate between instances belonging to different classes and should have the same value for the examples from the same class. 3.2 Terminal and Function Sets The MS data is represented by (m/z, Int) = (m/z, Int 1,..., Int n ), where m/z is a vector of the measured m/z ratios, and Int i is the corresponding intensity for the ith sample. The objective is to predict the class label based on the intensity profile [22]. For the five data sets used, there are two classes, and the class labels can be defined as class1 or class2, respectively. Terminal Set. As stated earlier, the goal of GP is to further select a smaller number of features from the feature pool selected by IG and REFS-F. The rationale is as follows. Firstly, the best features which are individually good are often correlated or redundant. The combination of all individually high-rank features often does not perform as well as the mixture of some individually high-rank and low rank features together. Secondly, the two metrics IG and REFS-F rank individual features based on different criteria as stated above, and accordingly we expect the combinations of the two groups of features might lead to better performance. Thirdly, using GP to further select features from the feature pool selected by IG and REFS-F can reduce the search space and possibly computational cost. Finally, GP has an implicit feature selection capability, and we expect GP to automatically select some individual features from those chosen by the two metrics and combine them together via the operators in the function set to form a small feature set that can result in better classification performance. Thus in this approach, we used the top 50 features from each of the two metrics (IG and REFS-F) to form the terminal set with 100 feature terminals, in addition to randomly generated constant terminals. Function Set. Besides the four commonly used basic arithmetic operations +,, and %, we also used the square root, max functions, and a conditional operator ifte. The % is a protected division which performs the same division operation except that the result of division by zero returns zero. The use of, max and ifte functions aims to evolve complex and non-linear functions for feature selection and classification. The if te returns the second argument if the first argument is less than zero or returns the third argument otherwise. The is also protected, where if the argument is negative, its absolute value is considered.
8 50 S. Ahmed, M. Zhang, and L. Peng Table 2. GP parameter values GP Parameter Parameters Value Initial Population Ramped Half-and Half Tree Depth 5-17 Generations 30 Mutation Rate 0.15 Crossover Rate 0.8 Elitisim 0.05 Population Size 1000 Selection Method Tournament Method Tournament Size Fitness Function As our main goal is to produce a subset of features that not only reduces the search space but also yields a better classification accuracy of the proposed GP approach, we define the fitness function to be the classification accuracy, which is evaluated after filtering the input data according to the subset of features selected by the evolved program. Thus the GP framework is to maximize the fitness, such that the generated programs with associated features lead to improved classification performance. For a specific instance in the training set, if the program output is 0, the instance is classified as class1; otherwise as class2. This fitness function will be used in the GP system for both feature selection and classification, which is a single process in this approach. 3.4 Experimental Setup In order to evaluate the performance of our proposed method for feature selection and classification, we conducted a number of experiments on the five different MS data sets. We used the ECJ package [23] for GP. The Weka package [24] was used for running IG and REFS-F for feature selection and running NB, SVMs, and J48 for classification. For the GP system, the initial population was generated using the ramped half-and-half method [10], the individual program tree depth is minimum 5 and maximum 17, and the population size is Also tournament selection is used with a tournament size of 5. The standard subtree crossover and mutation [10] are used. Elitism is applied to make sure the best individual in the next generation is not worse than that in the current generation. The evolution will be terminated when either the fitness reaches 100% or the maximum number of generations (30 generations) is reached. The experiments on each data set are repeated for 50 independent runs with different random seeds. The features selected by the best run are used for evaluation. Table 2 summarizes the GP parameters used in our method. All the data sets were divided into half for training and half for testing except for the spike-in data set in which the leave-one-out cross validation method was used as the number of examples in this data set is too small.
9 FS and Classification of High Dimensional MS Data: A GP Approach 51 4 Results and Discussion Table 3 shows the classification accuracy of the SVMs, NB and J48 on the five MS test data sets, using all the original features (Org), the top 50 features selected by IG, the top 50 features by selected REFS-F, and the features selected by the proposed GP method. The accuracy of GP as a classifier using the four sets of features are also included in this table for further comparison. 4.1 GP Feature Selection Performance As can be seen from Table 3, in the five data sets, the numbers of features selected by GP are only 31, 41, 25, 47 and 29. These numbers of features are not only much smaller than the total numbers of original features (100 features) used by the GP system, but also smaller than the numbers of features selected by IG or REFS-F alone. The reason for selecting the top 50 ranked features is that the performance was degrading when less number of features were used. We can also observe that in most cases, a classifier (SVMs, NB, J48 or GP) using the features selected by GP achieved much better classification performance than using all the original features. The only exception the case of SVMs in the Ovarian cancer low and Arcene data sets, which achieved slightly better performance by using all the original features. This is mainly because the SVMs method is supposed to (or claimed to) have feature selection ability, although it can not always select good features and achieve good performance on problems with a huge number of features (such as the premalignant pancreatic cancer data set). The NB and J48 methods, on the other hand, cannot cope with a huge number of original features to achieve good performance. By further inspection and comparison, we can observe that in most cases, using the top 50 features selected by IG or REFS-F alone improved the classification performance compared to using all the original features, which indicates that most of the classifiers are more comfortable to deal with a relatively small number of features rather than thousands of features. However, in the Ovarian cancer low and Arcene data sets, the SVMs method showed worse classification performance using features selected by IG and REFS-F alone than using all the original features, suggesting that the top 50 features selected by IG or REFS-F alone are individually good but the combinations of them cannot perform well possibly because some individually bad features are not included in the top 50, but they might play some role in classification. The results also show that using a smaller number of features selected by GP generally further improved the classification performance compared to using the 50 features selected by IG or REFS-F using all the four classifiers. This suggests that GP is able to select better combinations of features from the top 100 features selected by IG and REFS-F alone (50 each) and that combinations consist of a smaller number of features that led better classification performance. Further inspection of the selected feature sets reveals that GP actually selected some high-rank features and also some low-rank features from each of the top 50 features. This is consistent with our early hypothesis that the feature
10 52 S. Ahmed, M. Zhang, and L. Peng Table 3. Experimental Results Data set Methods #Features SVMs NB J48 GP Org Ovarian IG cancer high REFS-F GP Org Ovarian IG cancer low REFS-F GP Org Premalignant IG pancreatic REFS-F GP Org Arcene IG REFS-F GP Org Spike-in IG REFS-F GP subset combining individually good and not-very-good features can lead to better classification performance. 4.2 GP Classification Performance To investigate the ability of GP for classification, in the last column in Table 3, we also included the results of GP as a classifier using the four different sets of features as terminals (all plus random constant terminals). Table 3 shows that GP as a classifier generally performed much better than NB and J48 for almost all the four feature sets. Compared to SVMs, GP still performed much better in most cases, except for the Spike-in data set, where both of them achieve the ideal performance, and for the Ovarian low resolutions data set, where SVMs performed better than GP when using all the original feature sets. These results demonstrate that GP as a classifier can perform better than the NB and J48 classifiers, and compatible with or even better than SVMs for these problems. The good performance of GP and SVMs might be because SVMs is primarily developed for binary classification, and tree based GP is also good for binary classification due to the natural splitting of the program output space between positive and negative values for the two classes. Another fact is that GP has a natural ability of constructing high-level features from the original low-level features using the operators in the function set, which might also contribute to the good performance. These will need to be further investigated in the future. 4.3 Further Discussions In the spike-in data set, human experts identified 13 features (bio-markers) that can successfully solve the problem. The proposed GP system successfully detected 6 of the 13 features with 100% accuracy. This suggests that there exist
11 FS and Classification of High Dimensional MS Data: A GP Approach 53 other good combinations of features that domain experts could not identify. In other words, the proposed GP system has the potential to guiding humans to identify biomarkers. This interesting topic will also be investigated in the future. Inspection of the details of performance on both the training and the test sets for the five problems reveals that all the four classification methods generally have an overfitting problem: the classification accuracy on the test set was considerably worse than that on the training set. This is mainly because these bio-data sets typically have a huge number of features while only a small number of instances. Clearly, more instances are required for training the classifier to reduce overfitting. However, due to the nature of the biological experiments for generating the MS data, it is impractical to substantially increase the number of instances. Feature selection approaches can improve this situation as demonstrated in the present study. Further investigation is necessary to completely solve this problem. 5 Conclusions and Future Work The main goal of this paper was to investigate a GP approach to automatic feature selection and classification for the MS data that is characterized with a huge number of features and a small number of instances. This goal was successfully achieved by developing a GP system that takes the top features selected by IG and REFS-F alone and random constants as terminals, and the classification accuracy as the fitness function. The results show that the proposed GP approach selected a smaller number of features than IG and REFS-F, and these selected features resulted in better classification performance than the top features selected by IG and REFS-F alone and all the original features on the five problems using the SVMs, NB, J48 and GP classifiers. GP as a classifier also generally outperformed the other three classifiers on these five data sets. The results also suggest that combining multiple feature selection metrics using GP can improve the classification performance. As future work, we will investigate whether combining more metrics using GP can further improve the classification performance. This will relate to another interesting but challenging research direction, i.e. automatic construction of high-level features from low-level features for feature/dimension reduction. In addition, we will further investigate the feature selection ability of GP for MS data to address the overfitting problem in MS data with a small number of examples. References 1. Listgarten, J., Emili, A.: Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 4, (2005) 2. Ge, G., Wong, G.W.: Classification of premalignant pancreatic cancer massspectrometry data using decision tree ensembles. BMC Bioinformatics 9(1), 275 (2008)
12 54 S. Ahmed, M. Zhang, and L. Peng 3. Lin, Q., Peng, Q., Yao, F., Pan, X.F., Xiong, L.W., Wang, Y., Geng, J.F., Feng, J.X., Han, B.H., Bao, G.L., Yang, Y., Wang, X., Jin, L., Guo, W., Wang, J.C.: A classification method based on principal components of seldi spectra to diagnose of lung adenocarcinoma. PLoS ONE 7, e34457 (2012) 4. He, S., Cooper, H.J., Ward, D.G., Yao, X., Heath, J.K.: Analysis of premalignant pancreatic cancer mass spectrometry data for biomarker selection using a group search optimizer. Transactions of the Institute of Measurement and Control 34, (2011) 5. Satten, G.A., Datta, S., Moura, H., Woolfitt, A.R., da G. Carvalho, M., Carlone, G.M., De, B.K., Pavlopoulos, A., Barr, J.R.: Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens. Bioinformatics 20(17), (2004) 6. Wagner, M., Naik, D., Pothen, A.: Protocols for disease classification from mass spectrometry data. Proteomics 3(9), (2003) 7. Li, L., Tang, H., Wu, Z., Gong, J., Gruidl, M., Zou, J., Tockman, M., Clark, R.A.: Data mining techniques for cancer detection using serum proteomic profiling. Artificial Intelligence in Medicine 32(2), (2004) 8. Jong, K., Marchiori, E., Sebag, M., Vaart, A.V.D.: Feature selection in proteomic pattern data with support vector machines (2004) 9. Langdon, W.B., Poli, R., McPhee, N.F., Koza, J.R.: Genetic Programming: An Introduction and Tutorial, with a Survey of Techniques and Applications. In: Fulcher, J., Jain, L.C. (eds.) Computational Intelligence: A Compendium. SCI, vol. 115, pp Springer, Heidelberg (2008) 10. Poli, R., Langdon, W.B., McPhee, N.F.: A field guide to genetic programming. Lulu Enterprises, UK Ltd. (2008) 11. Neshatian, K., Zhang, M., Andreae, P.: Genetic Programming for Feature Ranking in Classification Problems. In: Li, X., Kirley, M., Zhang, M., Green, D., Ciesielski, V., Abbass, H.A., Michalewicz, Z., Hendtlass, T., Deb, K., Tan, K.C., Branke, J., Shi, Y. (eds.) SEAL LNCS, vol. 5361, pp Springer, Heidelberg (2008) 12. Paul, T.K., Iba, H.: Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6, (2009) 13. Lv, Y., Guo, Y., Sun, H., Zhang, M., Wang, J.: Feature extraction using composite individual genetic programming: An application to mass classification. Applied Mechanics and Materials 198, (2012) 14. Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1 47 (2002) 15. Sun, Y., Wu, D.: A relief based feature extraction algorithm. In: SDM, pp (2008) 16. Kononenko, I.: Estimating Attributes: Analysis and Extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML LNCS, vol. 784, pp Springer, Heidelberg (1994) 17. Petricoin, Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A.: Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 359, (2002) 18. Guyon, I., Gunn, S.R., Ben-Hur, A., Dror, G.: Result analysis of the nips 2003 feature selection challenge. In: NIPS (2004) 19. Tuli, L., Tsai, T.H., Varghese, R., Xiao, J.F., Cheema, A., Ressom, H.: Using a spike-in experiment to evaluate analysis of LC-MS data. Proteome Science 10, 13 (2012)
13 FS and Classification of High Dimensional MS Data: A GP Approach Cai, J., Smith, D., Xia, X., Yuen, K.Y.: MBEToolbox: a Matlab toolbox for sequence data analysis in molecular biology and evolution. BMC Bioinformatics 6(1), 64 (2005) 21. Sandin, I., Andrade, G., Viegas, F., Madeira, D., da Rocha, L.C., Salles, T., Goncalves, M.A.: Aggressive and effective feature selection using genetic programming. In: IEEE Congress on Evolutionary Computation, pp IEEE (2012) 22. Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., Zhao, H.: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19(13), (2003) 23. White, D.R.: Software review: the ecj toolkit. Genetic Programming and Evolvable Machines, (2012) 24. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explorations 11(1), (2009)
Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data
Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data * Mario Cannataro University Magna Græcia of Catanzaro, Italy cannataro@unicz.it * Joint work with P. H. Guzzi, T. Mazza, P.
More informationMass Spec Data Post-Processing Software. ClinProTools. Wayne Xu, Ph.D. Supercomputing Institute Phone: Help:
Mass Spec Data Post-Processing Software ClinProTools Presenter: Wayne Xu, Ph.D Supercomputing Institute Email: Phone: Help: wxu@msi.umn.edu (612) 624-1447 help@msi.umn.edu (612) 626-0802 Aug. 24,Thur.
More informationReview of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.
Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and
More informationIndividualized Error Estimation for Classification and Regression Models
Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationProfiling of mass spectrometry data for ovarian cancer detection using negative correlation learning
Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning Shan He, Huanhuan Chen, Xiaoli Li, and Xin Yao The Centre of Excellence for Research in Computational
More informationSchool of Mathematics, Statistics and Computer Science. Computer Science. Object Detection using Neural Networks and Genetic Programming
T E W H A R E W Ā N A N G A O T E Ū P O K O O T E I K A A M Ā U I ÎÍÏ V I C T O R I A UNIVERSITY OF WELLINGTON School of Mathematics, Statistics and Computer Science Computer Science Object Detection using
More informationComparison of different preprocessing techniques and feature selection algorithms in cancer datasets
Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract
More informationFEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION
FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT
More informationImproving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique
www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn
More informationQuantWiz: A Parallel Software Package for LC-MS-based Label-free Protein Quantification
2009 11th IEEE International Conference on High Performance Computing and Communications QuantWiz: A Parallel Software Package for LC-MS-based Label-free Protein Quantification Jing Wang 1, Yunquan Zhang
More informationImproving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique
Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationA Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization
A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954
More informationThe Imbalanced Problem in Mass-spectrometry Data Analysis
The Second International Symposium on Optimization and Systems Biology (OSB 08) Lijiang, China, October 31 November 3, 2008 Copyright 2008 ORSC & APORC, pp. 136 143 The Imbalanced Problem in Mass-spectrometry
More informationFast or furious? - User analysis of SF Express Inc
CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood
More informationBENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA
BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,
More informationThe Role of Biomedical Dataset in Classification
The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences
More informationTutorial 2: Analysis of DIA/SWATH data in Skyline
Tutorial 2: Analysis of DIA/SWATH data in Skyline In this tutorial we will learn how to use Skyline to perform targeted post-acquisition analysis for peptide and inferred protein detection and quantification.
More informationNoise-based Feature Perturbation as a Selection Method for Microarray Data
Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering
More informationA Survey on Postive and Unlabelled Learning
A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled
More informationOn Demand Phenotype Ranking through Subspace Clustering
On Demand Phenotype Ranking through Subspace Clustering Xiang Zhang, Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA {xiang, weiwang}@cs.unc.edu
More informationNovel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification
Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification Bing Xue, Mengjie Zhang, and Will N. Browne School of Engineering and Computer Science Victoria University of
More informationEstimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification
1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore
More informationStatistical dependence measure for feature selection in microarray datasets
Statistical dependence measure for feature selection in microarray datasets Verónica Bolón-Canedo 1, Sohan Seth 2, Noelia Sánchez-Maroño 1, Amparo Alonso-Betanzos 1 and José C. Príncipe 2 1- Department
More informationMODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS
MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500
More informationGenetic Image Network for Image Classification
Genetic Image Network for Image Classification Shinichi Shirakawa, Shiro Nakayama, and Tomoharu Nagao Graduate School of Environment and Information Sciences, Yokohama National University, 79-7, Tokiwadai,
More informationStability of Feature Selection Algorithms
Stability of Feature Selection Algorithms Alexandros Kalousis, Julien Prados, Melanie Hilario University of Geneva, Computer Science Department Rue General Dufour 24, 1211 Geneva 4, Switzerland {kalousis,
More informationPrognosis of Lung Cancer Using Data Mining Techniques
Prognosis of Lung Cancer Using Data Mining Techniques 1 C. Saranya, M.Phil, Research Scholar, Dr.M.G.R.Chockalingam Arts College, Arni 2 K. R. Dillirani, Associate Professor, Department of Computer Science,
More informationComputers in Biology and Medicine
Computers in Biology and Medicine 39 (29) 818 -- 823 Contents lists available at ScienceDirect Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm Feature extraction and dimensionality
More informationPerformance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances
Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances Minzhong Liu, Xiufen Zou, Yu Chen, Zhijian Wu Abstract In this paper, the DMOEA-DD, which is an improvement of DMOEA[1,
More informationUser Guide Written By Yasser EL-Manzalawy
User Guide Written By Yasser EL-Manzalawy 1 Copyright Gennotate development team Introduction As large amounts of genome sequence data are becoming available nowadays, the development of reliable and efficient
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationAdaptive Crossover in Genetic Algorithms Using Statistics Mechanism
in Artificial Life VIII, Standish, Abbass, Bedau (eds)(mit Press) 2002. pp 182 185 1 Adaptive Crossover in Genetic Algorithms Using Statistics Mechanism Shengxiang Yang Department of Mathematics and Computer
More informationCombination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset
International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran
More informationCombining Selective Search Segmentation and Random Forest for Image Classification
Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, 2013 1 Problem Statement Random Forest algorithm have been successfully used in many
More informationClassification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging
1 CS 9 Final Project Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging Feiyu Chen Department of Electrical Engineering ABSTRACT Subject motion is a significant
More informationInformation Fusion Dr. B. K. Panigrahi
Information Fusion By Dr. B. K. Panigrahi Asst. Professor Department of Electrical Engineering IIT Delhi, New Delhi-110016 01/12/2007 1 Introduction Classification OUTLINE K-fold cross Validation Feature
More informationClassification and Optimization using RF and Genetic Algorithm
International Journal of Management, IT & Engineering Vol. 8 Issue 4, April 2018, ISSN: 2249-0558 Impact Factor: 7.119 Journal Homepage: Double-Blind Peer Reviewed Refereed Open Access International Journal
More informationUsing Decision Boundary to Analyze Classifiers
Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More information/ Computational Genomics. Normalization
10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program
More informationHybrid Models Using Unsupervised Clustering for Prediction of Customer Churn
Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn Indranil Bose and Xi Chen Abstract In this paper, we use two-stage hybrid models consisting of unsupervised clustering techniques
More informationAn Empirical Study of Lazy Multilabel Classification Algorithms
An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationMS data processing. Filtering and correcting data. W4M Core Team. 22/09/2015 v 1.0.0
MS data processing Filtering and correcting data W4M Core Team 22/09/2015 v 1.0.0 Presentation map 1) Processing the data W4M table format for Galaxy 2) Filters for mass spectrometry extracted data a)
More informationEffect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction
International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,
More informationMachine Learning: Algorithms and Applications Mockup Examination
Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature
More informationGood Cell, Bad Cell: Classification of Segmented Images for Suitable Quantification and Analysis
Cell, Cell: Classification of Segmented Images for Suitable Quantification and Analysis Derek Macklin, Haisam Islam, Jonathan Lu December 4, 22 Abstract While open-source tools exist to automatically segment
More informationEvaluation of different biological data and computational classification methods for use in protein interaction prediction.
Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly
More informationGenetic Programming for Data Classification: Partitioning the Search Space
Genetic Programming for Data Classification: Partitioning the Search Space Jeroen Eggermont jeggermo@liacs.nl Joost N. Kok joost@liacs.nl Walter A. Kosters kosters@liacs.nl ABSTRACT When Genetic Programming
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationScheme of Big-Data Supported Interactive Evolutionary Computation
2017 2nd International Conference on Information Technology and Management Engineering (ITME 2017) ISBN: 978-1-60595-415-8 Scheme of Big-Data Supported Interactive Evolutionary Computation Guo-sheng HAO
More informationProject Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University
Project Report on De novo Peptide Sequencing Course: Math 574 Gaurav Kulkarni Washington State University Introduction Protein is the fundamental building block of one s body. Many biological processes
More informationData Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy
Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department
More informationData Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification
More informationClassification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009
1 Introduction Classification of Hyperspectral Breast Images for Cancer Detection Sander Parawira December 4, 2009 parawira@stanford.edu In 2009 approximately one out of eight women has breast cancer.
More informationOn Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions
On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition
More informationUsing Genetic Programming for Multiclass Classification by Simultaneously Solving Component Binary Classification Problems
Using Genetic Programming for Multiclass Classification by Simultaneously Solving Component Binary Classification Problems Will Smart and Mengjie Zhang School of Mathematics, Statistics and Computer Sciences,
More informationPreprocessing of Stream Data using Attribute Selection based on Survival of the Fittest
Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological
More informationHybridization EVOLUTIONARY COMPUTING. Reasons for Hybridization - 1. Naming. Reasons for Hybridization - 3. Reasons for Hybridization - 2
Hybridization EVOLUTIONARY COMPUTING Hybrid Evolutionary Algorithms hybridization of an EA with local search techniques (commonly called memetic algorithms) EA+LS=MA constructive heuristics exact methods
More informationImproved Centroid Peak Detection and Mass Accuracy using a Novel, Fast Data Reconstruction Method
Overview Improved Centroid Peak Detection and Mass Accuracy using a Novel, Fast Data Reconstruction Method James A. Ferguson 1, William G. Sawyers 1, Keith A. Waddell 1, Anthony G. Ferrige 2, Robert Alecio
More informationGenetic Programming Prof. Thomas Bäck Nat Evur ol al ut ic o om nar put y Aling go rg it roup hms Genetic Programming 1
Genetic Programming Prof. Thomas Bäck Natural Evolutionary Computing Algorithms Group Genetic Programming 1 Genetic programming The idea originated in the 1950s (e.g., Alan Turing) Popularized by J.R.
More informationA Classifier with the Function-based Decision Tree
A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationClassification Strategies for Image Classification in Genetic Programming
Classification Strategies for Image Classification in Genetic Programming Will R. Smart, Mengjie Zhang School of Mathematical and Computing Sciences, Victoria University of Wellington, New Zealand {smartwill,
More informationCHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES
CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving
More informationLearning based face hallucination techniques: A survey
Vol. 3 (2014-15) pp. 37-45. : A survey Premitha Premnath K Department of Computer Science & Engineering Vidya Academy of Science & Technology Thrissur - 680501, Kerala, India (email: premithakpnath@gmail.com)
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationEnsemble Image Classification Method Based on Genetic Image Network
Ensemble Image Classification Method Based on Genetic Image Network Shiro Nakayama, Shinichi Shirakawa, Noriko Yata and Tomoharu Nagao Graduate School of Environment and Information Sciences, Yokohama
More informationParallel Linear Genetic Programming
Parallel Linear Genetic Programming Carlton Downey and Mengjie Zhang School of Engineering and Computer Science Victoria University of Wellington, Wellington, New Zealand Carlton.Downey@ecs.vuw.ac.nz,
More informationDistributed Optimization of Feature Mining Using Evolutionary Techniques
Distributed Optimization of Feature Mining Using Evolutionary Techniques Karthik Ganesan Pillai University of Dayton Computer Science 300 College Park Dayton, OH 45469-2160 Dale Emery Courte University
More informationAN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION
AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO
More informationImproving Recognition through Object Sub-categorization
Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,
More informationSupervised Learning Classification Algorithms Comparison
Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------
More informationMonika Maharishi Dayanand University Rohtak
Performance enhancement for Text Data Mining using k means clustering based genetic optimization (KMGO) Monika Maharishi Dayanand University Rohtak ABSTRACT For discovering hidden patterns and structures
More informationLecture 7: Decision Trees
Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...
More informationComparative Study of Data Mining Classification Techniques over Soybean Disease by Implementing PCA-GA
Comparative Study of Data Mining Classification Techniques over Soybean Disease by Implementing PCA-GA Dr. Geraldin B. Dela Cruz Institute of Engineering, Tarlac College of Agriculture, Philippines, delacruz.geri@gmail.com
More informationAnalyzing ICAT Data. Analyzing ICAT Data
Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationStability of Feature Selection Algorithms
Stability of Feature Selection Algorithms Alexandros Kalousis, Jullien Prados, Phong Nguyen Melanie Hilario Artificial Intelligence Group Department of Computer Science University of Geneva Stability of
More informationQuery Disambiguation from Web Search Logs
Vol.133 (Information Technology and Computer Science 2016), pp.90-94 http://dx.doi.org/10.14257/astl.2016. Query Disambiguation from Web Search Logs Christian Højgaard 1, Joachim Sejr 2, and Yun-Gyung
More information1. INTRODUCTION. AMS Subject Classification. 68U10 Image Processing
ANALYSING THE NOISE SENSITIVITY OF SKELETONIZATION ALGORITHMS Attila Fazekas and András Hajdu Lajos Kossuth University 4010, Debrecen PO Box 12, Hungary Abstract. Many skeletonization algorithms have been
More informationFeature Selection and Classification for Small Gene Sets
Feature Selection and Classification for Small Gene Sets Gregor Stiglic 1,2, Juan J. Rodriguez 3, and Peter Kokol 1,2 1 University of Maribor, Faculty of Health Sciences, Zitna ulica 15, 2000 Maribor,
More informationFEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION
FEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION Hong Guo, Qing Zhang and Asoke K. Nandi Signal Processing and Communications Group, Department of Electrical Engineering and Electronics,
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationEpitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016
Epitopes Toolkit (EpiT) Yasser EL-Manzalawy http://www.cs.iastate.edu/~yasser August 30, 2016 What is EpiT? Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools. An EpiT developer
More informationEvolving SQL Queries for Data Mining
Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationBMC Bioinformatics. Research. Open Access. Abstract
BMC Bioinformatics BioMed Central Research A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection Michele Ceccarelli 1,2,Antoniod Acierno*
More informationAn Evolutionary Programming Algorithm for Automatic Chromatogram Alignment
Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2007 An Evolutionary Programming Algorithm for Automatic Chromatogram Alignment Bonnie Jo Schwartz Wright
More information1. Introduction. International IEEE multi topics Conference (INMIC 2005), Pakistan, Karachi, Dec. 2005
Combining Nearest Neighborhood Classifiers using Genetic Programming Abdul Majid, Asifullah Khan and Anwar M. Mirza Faculty of Computer Science & Engineering, GIK Institute, Ghulam Ishaq Khan (GIK) Institute
More informationUsing a genetic algorithm for editing k-nearest neighbor classifiers
Using a genetic algorithm for editing k-nearest neighbor classifiers R. Gil-Pita 1 and X. Yao 23 1 Teoría de la Señal y Comunicaciones, Universidad de Alcalá, Madrid (SPAIN) 2 Computer Sciences Department,
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationRotation Invariant Finger Vein Recognition *
Rotation Invariant Finger Vein Recognition * Shaohua Pang, Yilong Yin **, Gongping Yang, and Yanan Li School of Computer Science and Technology, Shandong University, Jinan, China pangshaohua11271987@126.com,
More informationOverview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8
Tutorial 3 1 / 8 Overview Non-Parametrics Models Definitions KNN Ensemble Methods Definitions, Examples Random Forests Clustering Definitions, Examples k-means Clustering 2 / 8 Non-Parametrics Models Definitions
More informationA Comparative Study of Linear Encoding in Genetic Programming
2011 Ninth International Conference on ICT and Knowledge A Comparative Study of Linear Encoding in Genetic Programming Yuttana Suttasupa, Suppat Rungraungsilp, Suwat Pinyopan, Pravit Wungchusunti, Prabhas
More informationA Naïve Soft Computing based Approach for Gene Expression Data Analysis
Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for
More informationEvaluating Classifiers
Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with
More informationAnalyzing Outlier Detection Techniques with Hybrid Method
Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,
More information