Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

Size: px
Start display at page:

Download "Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach"

Transcription

1 Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach Soha Ahmed 1, Mengjie Zhang 1, and Lifeng Peng 2 1 School of Engineering and Computer Science 2 School of Biological Sciences Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand {soha.ahmed,mengjie.zhang}@ecs.vuw.ac.nz, lifeng.peng@vuw.ac.nz Abstract. Biomarker discovery using mass spectrometry (MS) data is very useful in disease detection and drug discovery. The process of biomarker discovery in MS data must start with feature selection as the number of features in MS data is extremely large (e.g. thousands) while the number of samples is comparatively small. In this study, we propose the use of genetic programming (GP) for automatic feature selection and classification of MS data. This GP based approach works by using the features selected by two feature selection metrics, namely information gain (IG) and relief-f (REFS-F) in the terminal set. The feature selection performance of the proposed approach is examined and compared with IG and REFS-F alone on five MS data sets with different numbers of features and instances. Naive Bayes (NB), support vector machines (SVMs) and J48 decision trees (J48) are used in the experiments to evaluate the classification accuracy of the selected features. Meanwhile, GP is also used as a classification method in the experiments and its performance is compared with that of NB, SVMs and J48. The results show that GP as a feature selection method can select a smaller number of features with better classification performance than IG and REFS-F using NB, SVMs and J48. In addition, GP as a classification method also outperforms NB and J48 and achieves comparable or slightly better performance than SVMs on these data sets. 1 Introduction Mass spectrometry (MS) is a tool for measuring the molecular masses of chemical compounds, and these masses are necessary to identify the species of proteins and metabolites [1]. Inside the instrument, the sample molecules are converted to ions in the ionization source and then these ions are passed to the mass analyzer to measure their mass to charge ratios (m/z). MS can work in either the full scan mode which measures the parent ions m/z (called MS spectrum, which contains m/z ratios and their corresponding intensities), or the tandem MS mode that selects and fragments the ion of interest from the parent ions list and subsequently records the m/z ratios of its drought ions (called MS/MS L. Vanneschi, W.S. Bush, and M. Giacobini (Eds.): EvoBIO 2013, LNCS 7833, pp , c Springer-Verlag Berlin Heidelberg 2013

2 44 S. Ahmed, M. Zhang, and L. Peng spectrum, which contains the m/z ratios and their corresponding intensities of the fragment ions) to aid the elucidation of the structure of the molecules [1]. The MS machine is usually coupled with separation techniques, which can be either gas chromatography (GC) or liquid chromatography (LC), in the front to separate the different molecules before MS detection to reduce the complexity of the MS spectrum and increase the detection coverage. Accordingly, the data acquired with LC instrumental set up is called LC MS/MS spectrum, which contains the features of retention time, scan numbers, and m/z ratios and the corresponding intensities of the parent and fragment ions [1]. Due to the nature of the MS data, a data set typically has a very small number of instances but each instance is represented by a huge number of features (typically several thousand), which makes the classification of MS data a very challenging problem [2]. Several machine learning and optimisation techniques have been used for feature selection and classification of MS data. For example, principle component analysis and linear discriminant analysis [3, 4] and random forest algorithm [5] have been used for feature extraction, dimensionality reduction and classification on MS data. Support vector machines (SVMs), k-nearest-neighbour and quadratic discriminant analysis [6] have been used for MS data classification. Genetic algorithms and t-test [7] have also been used for feature selection and classification. SVM-RFE is used to select features by giving a weight to each feature through training a linear SVM and removing the lowest weight features. Feature subsets are passed to SVMs classifiers to classify the protoemics MS data sets [8]. However, many of these techniques for classification cannot easily handle such a huge number of features, and feature selection and dimensionality reduction of data often need to take place before classification is performed. However, the two separate processes cannot be connected so well sometimes, for example, the features selected by one method (e.g. decision trees) do not perform well for classification using a different method (e.g. SVMs). Genetic programming (GP) is one of the evolutionary computation algorithms [9]. It starts with a random initial population to search for a solution to a given task and over a number of generations it modifies the initial solutions through a set of genetic operators guided by the fitness function [10]. GP has the capability to select features implicitly [11] by incorporating useful features during the evolution of programs. Since very recently, there have been a small number of works only using GP for feature selection and classification of bio-data (e.g. [12,13]). For example, GP has also been used for peptide quantification of MS data and the measurement of protein quantities within a biological sample [12]. However, GP has been very seldom used for feature selection and classification of MS data. The overall goal of this paper is to investigate a GP based approach to automatic feature selection and classification of MS data, which typically has a huge number of features and a small number of examples. To achieve feature selection, we will embed two existing feature selection metrics, information gain (IG) [14] and relief-f (REFS-F) [15,16], into the GP system. The features selected

3 FS and Classification of High Dimensional MS Data: A GP Approach 45 by GP will be compared with the original features and the features selected by the above two metrics alone on five MS data sets using three common classifiers namely NB, SVMs, and J48 decision trees (J48). Specifically, we investigate the following objectives: what primitives and fitness function can be used to develop this GP system; whether embedding multiple feature selection metrics into GP can improve the feature selection process; whether GP as a feature selection method can automatically select a small number of features that can achieve better classification performance than using all the original features; whether the new set of features selected by GP outperforms the features selected by IG and REFS alone; and whether GP as a classification method outperform naive Bayes, SVMs, J48 classifiers. The remainder of the paper is organised as follows. Section 2 describes the data sets and the preprocessing steps. Section 3 describes the new GP approach. Section 4 presents the results with discussions. Section 5 presents the conclusions and the future work. 2 Data Sets and Preprocessing Five MS data sets are used in the experiments, as shown in Table 1. They are the high resolution SELDI-TOF ovarian cancer data set and the low resolution SELDI-TOF ovarian cancer data set 1 [17], the premalignant pancreatic cancer data set 2 [2], the Arcene data set [18], and the spike-in LC MS/MS data set [19] from Georgetown University 3. Table 1. Data Sets Name of the Data Set No. of Features Size of Data Set Premalignant Pancreatic Cancer High Resolution Ovarian Cancer 15, Low Resolution Ovarian Cancer 15, Arcene 10, Spike-In 10, Data Preprocessing Due to the nature (e.g. a lot of noise) and types (e.g. MS, LC MS/MS) of the MS data, different preprocessing methods are applied. We use the bioinformatics toolbox in Matlab [20] for this purpose. Low and high resolution ovarian cancer data sets: To obtain the same m/z point at all MS spectra [17], the resampling algorithm in the toolbox is used. Then, the 1 Available at: 2 Available at: 3 Available at:

4 46 S. Ahmed, M. Zhang, and L. Peng Fig. 1. Preprocessing steps of the low resolution ovarian cancer data set. (a) and (b) represent the resampling and the baseline adjustment of the first signal, respectively. (c) is an example alignment of 6 samples. Finally (d) represents the normalization of the samples using AUC. baseline correction is adopted to subtract the background noise and remove the low intensity peaks. Firstly, the baseline is estimated by calculating of the minimum m/z values within a window size of 50 m/z points for the high resolution data. For the low resolution data, the window size is 500 m/z points. Afterwards the varying baseline is regressed and the resulting baseline is subtracted [17]. Due to mis-calibration of mass spectrometers, systematic shifts can appear in repeated experiments. Therefore, the alignment of the spectrograms is the third step in the preprocessing framework. Finally, each spectrum is normalized using area under the curve (AUC). Figure 1 shows the four steps of preprocessing of the low resolution ovarian cancer data set as an example. Premalignant pancreatic cancer data set: Similar to the steps of the previous two data sets, baseline adjustment, filtering and normalization are used here. Baseline adjustment is estimated by segmenting the whole spectra into windows with a size of 200 m/z ratio intensities. The mean values of these windows are then used as the estimate of the baseline value at that intensity. To perform regression, a piecewise cubic interpolation method is used [2]. After this step, noise is filtered using the

5 FS and Classification of High Dimensional MS Data: A GP Approach 47 Extracted Peaks Above (0.75 x Base Peak Intensity) for Each Scan Mass/Charge (M/Z) Relative Ion Intensity 2500 Retention Time (seconds) Fig. 2. Peak extraction from raw data spectrum of the spike-in data set Gaussian kernel filter. Finally, normalization is performed using area under curve where the maximum value of intensity for each m/z ratio is rescaled to 100. Arcene data set: This data set is available after preprocessing, where the m/z is limited between 200 and 10,000 [18] by considering the m/z values under 200 and over 10,000 as noise. Afterward, the technical repeats are averaged, baseline is removed and then smoothing and alignment take place. The preprocessing of these four MS data sets is at a low level, therefore the number of features remains the same after preprocessing. Spike-in data set: The data set is an LC MS/MS data set, which consists of values of scan numbers, LC retention time, m/z ratios and the corresponding intensities to the m/z ratios [1]. Ideally, the same molecule detected in the same LC MS/MS should have the same molecular weight, intensity and retention time, but due to experimental condition variation and the co-elution of molecular ions this is not always the case. Therefore, before computational analysis can take place, the data has to undergo different preprocessing steps than the steps for the full scan MS data. The first step is to extract peaks from the data by clustering significant peaks and noisy peaks and removing the noisy peaks using the toolkit. Figure 2 shows the raw data and the data after peak extraction. The second step is to filter the peaks to further remove noise from each scan. Using a percentile of the base peak intensity the filtering is performed, where the base peak is the most intense peak found in each scan. In order to produce the centroid data from the raw signal, the peak preserving resampling method is adopted. Finally the alignment of the peaks is used to remove the fluctuation of the data. After the preprocessing, the number of features becomes 847 for this data set. 3 The Approach There are many feature selection methods used for dimensionality reduction and improving classification accuracy. Each of these methods can select different sets of features according to the criteria of the selection process. Some of the features

6 48 S. Ahmed, M. Zhang, and L. Peng Fig. 3. Overview of the approach selected may be powerful while other features may not be so relevant [21]. We hypothesize that combining features of different feature selection metrics may improve the feature selection performance. Our aim is to use GP to guide the feature selection process by combining two well known feature selection metrics, IG and REFS-F, and to produce a new and smaller set of features that can effectively improve the power of selected features in terms of classification accuracy. The two metrics are chosen due to their wide applications in the literature [14,15] and also because they show distinct characteristics. The tree-based GP [11] was used here and the proposed method has three steps: (1) we use the two feature ranking techniques IG and REFS-F to rank the features; (2) the top features ranked by the two metrics are used as terminals of the GP method, where the intrinsic capability of GP is used to search for good combinations of the features from those features to form a (hopefully) better set of features; (3) the features selected by the best GP evolved program are used by the three classifiers (and GP) for classification. Figure 3 shows an overview of the proposed GP approach. In the rest of this section, we will describe the feature selection metrics, the terminal set, the function set, the fitness function, and parameter settings for the proposed method. 3.1 Feature Selection Metrics The two feature selection metrics, IG [14] and REFS-F [15,16], are used to rank the importance of the individual features. We briefly describe them here. Information Gain (IG) determines the amount of information gained about a class when a certain feature is present or absent [21]. It is defined as follows: IG( `X, C i ) = C {C 1,C 2} `X {X, X} P ( `X, C) log P ( `X, C) P ( `X)P (C) (1) where X and X denotes the presence and the absence of the feature, and the healthy and diseased classes are denoted by C1 and C2.

7 FS and Classification of High Dimensional MS Data: A GP Approach 49 Relief-F (REFS-F) searches for the two nearest neighbors for a given example, one from the same class (hit) and the other from a different class (miss) [15, 16] and calculates the worth of the feature, which is given by: W (X) = P(different value of X nearest example from the different class) P(different value of X nearest example from the same class) (2) where P refers to probability, and a good feature should differentiate between instances belonging to different classes and should have the same value for the examples from the same class. 3.2 Terminal and Function Sets The MS data is represented by (m/z, Int) = (m/z, Int 1,..., Int n ), where m/z is a vector of the measured m/z ratios, and Int i is the corresponding intensity for the ith sample. The objective is to predict the class label based on the intensity profile [22]. For the five data sets used, there are two classes, and the class labels can be defined as class1 or class2, respectively. Terminal Set. As stated earlier, the goal of GP is to further select a smaller number of features from the feature pool selected by IG and REFS-F. The rationale is as follows. Firstly, the best features which are individually good are often correlated or redundant. The combination of all individually high-rank features often does not perform as well as the mixture of some individually high-rank and low rank features together. Secondly, the two metrics IG and REFS-F rank individual features based on different criteria as stated above, and accordingly we expect the combinations of the two groups of features might lead to better performance. Thirdly, using GP to further select features from the feature pool selected by IG and REFS-F can reduce the search space and possibly computational cost. Finally, GP has an implicit feature selection capability, and we expect GP to automatically select some individual features from those chosen by the two metrics and combine them together via the operators in the function set to form a small feature set that can result in better classification performance. Thus in this approach, we used the top 50 features from each of the two metrics (IG and REFS-F) to form the terminal set with 100 feature terminals, in addition to randomly generated constant terminals. Function Set. Besides the four commonly used basic arithmetic operations +,, and %, we also used the square root, max functions, and a conditional operator ifte. The % is a protected division which performs the same division operation except that the result of division by zero returns zero. The use of, max and ifte functions aims to evolve complex and non-linear functions for feature selection and classification. The if te returns the second argument if the first argument is less than zero or returns the third argument otherwise. The is also protected, where if the argument is negative, its absolute value is considered.

8 50 S. Ahmed, M. Zhang, and L. Peng Table 2. GP parameter values GP Parameter Parameters Value Initial Population Ramped Half-and Half Tree Depth 5-17 Generations 30 Mutation Rate 0.15 Crossover Rate 0.8 Elitisim 0.05 Population Size 1000 Selection Method Tournament Method Tournament Size Fitness Function As our main goal is to produce a subset of features that not only reduces the search space but also yields a better classification accuracy of the proposed GP approach, we define the fitness function to be the classification accuracy, which is evaluated after filtering the input data according to the subset of features selected by the evolved program. Thus the GP framework is to maximize the fitness, such that the generated programs with associated features lead to improved classification performance. For a specific instance in the training set, if the program output is 0, the instance is classified as class1; otherwise as class2. This fitness function will be used in the GP system for both feature selection and classification, which is a single process in this approach. 3.4 Experimental Setup In order to evaluate the performance of our proposed method for feature selection and classification, we conducted a number of experiments on the five different MS data sets. We used the ECJ package [23] for GP. The Weka package [24] was used for running IG and REFS-F for feature selection and running NB, SVMs, and J48 for classification. For the GP system, the initial population was generated using the ramped half-and-half method [10], the individual program tree depth is minimum 5 and maximum 17, and the population size is Also tournament selection is used with a tournament size of 5. The standard subtree crossover and mutation [10] are used. Elitism is applied to make sure the best individual in the next generation is not worse than that in the current generation. The evolution will be terminated when either the fitness reaches 100% or the maximum number of generations (30 generations) is reached. The experiments on each data set are repeated for 50 independent runs with different random seeds. The features selected by the best run are used for evaluation. Table 2 summarizes the GP parameters used in our method. All the data sets were divided into half for training and half for testing except for the spike-in data set in which the leave-one-out cross validation method was used as the number of examples in this data set is too small.

9 FS and Classification of High Dimensional MS Data: A GP Approach 51 4 Results and Discussion Table 3 shows the classification accuracy of the SVMs, NB and J48 on the five MS test data sets, using all the original features (Org), the top 50 features selected by IG, the top 50 features by selected REFS-F, and the features selected by the proposed GP method. The accuracy of GP as a classifier using the four sets of features are also included in this table for further comparison. 4.1 GP Feature Selection Performance As can be seen from Table 3, in the five data sets, the numbers of features selected by GP are only 31, 41, 25, 47 and 29. These numbers of features are not only much smaller than the total numbers of original features (100 features) used by the GP system, but also smaller than the numbers of features selected by IG or REFS-F alone. The reason for selecting the top 50 ranked features is that the performance was degrading when less number of features were used. We can also observe that in most cases, a classifier (SVMs, NB, J48 or GP) using the features selected by GP achieved much better classification performance than using all the original features. The only exception the case of SVMs in the Ovarian cancer low and Arcene data sets, which achieved slightly better performance by using all the original features. This is mainly because the SVMs method is supposed to (or claimed to) have feature selection ability, although it can not always select good features and achieve good performance on problems with a huge number of features (such as the premalignant pancreatic cancer data set). The NB and J48 methods, on the other hand, cannot cope with a huge number of original features to achieve good performance. By further inspection and comparison, we can observe that in most cases, using the top 50 features selected by IG or REFS-F alone improved the classification performance compared to using all the original features, which indicates that most of the classifiers are more comfortable to deal with a relatively small number of features rather than thousands of features. However, in the Ovarian cancer low and Arcene data sets, the SVMs method showed worse classification performance using features selected by IG and REFS-F alone than using all the original features, suggesting that the top 50 features selected by IG or REFS-F alone are individually good but the combinations of them cannot perform well possibly because some individually bad features are not included in the top 50, but they might play some role in classification. The results also show that using a smaller number of features selected by GP generally further improved the classification performance compared to using the 50 features selected by IG or REFS-F using all the four classifiers. This suggests that GP is able to select better combinations of features from the top 100 features selected by IG and REFS-F alone (50 each) and that combinations consist of a smaller number of features that led better classification performance. Further inspection of the selected feature sets reveals that GP actually selected some high-rank features and also some low-rank features from each of the top 50 features. This is consistent with our early hypothesis that the feature

10 52 S. Ahmed, M. Zhang, and L. Peng Table 3. Experimental Results Data set Methods #Features SVMs NB J48 GP Org Ovarian IG cancer high REFS-F GP Org Ovarian IG cancer low REFS-F GP Org Premalignant IG pancreatic REFS-F GP Org Arcene IG REFS-F GP Org Spike-in IG REFS-F GP subset combining individually good and not-very-good features can lead to better classification performance. 4.2 GP Classification Performance To investigate the ability of GP for classification, in the last column in Table 3, we also included the results of GP as a classifier using the four different sets of features as terminals (all plus random constant terminals). Table 3 shows that GP as a classifier generally performed much better than NB and J48 for almost all the four feature sets. Compared to SVMs, GP still performed much better in most cases, except for the Spike-in data set, where both of them achieve the ideal performance, and for the Ovarian low resolutions data set, where SVMs performed better than GP when using all the original feature sets. These results demonstrate that GP as a classifier can perform better than the NB and J48 classifiers, and compatible with or even better than SVMs for these problems. The good performance of GP and SVMs might be because SVMs is primarily developed for binary classification, and tree based GP is also good for binary classification due to the natural splitting of the program output space between positive and negative values for the two classes. Another fact is that GP has a natural ability of constructing high-level features from the original low-level features using the operators in the function set, which might also contribute to the good performance. These will need to be further investigated in the future. 4.3 Further Discussions In the spike-in data set, human experts identified 13 features (bio-markers) that can successfully solve the problem. The proposed GP system successfully detected 6 of the 13 features with 100% accuracy. This suggests that there exist

11 FS and Classification of High Dimensional MS Data: A GP Approach 53 other good combinations of features that domain experts could not identify. In other words, the proposed GP system has the potential to guiding humans to identify biomarkers. This interesting topic will also be investigated in the future. Inspection of the details of performance on both the training and the test sets for the five problems reveals that all the four classification methods generally have an overfitting problem: the classification accuracy on the test set was considerably worse than that on the training set. This is mainly because these bio-data sets typically have a huge number of features while only a small number of instances. Clearly, more instances are required for training the classifier to reduce overfitting. However, due to the nature of the biological experiments for generating the MS data, it is impractical to substantially increase the number of instances. Feature selection approaches can improve this situation as demonstrated in the present study. Further investigation is necessary to completely solve this problem. 5 Conclusions and Future Work The main goal of this paper was to investigate a GP approach to automatic feature selection and classification for the MS data that is characterized with a huge number of features and a small number of instances. This goal was successfully achieved by developing a GP system that takes the top features selected by IG and REFS-F alone and random constants as terminals, and the classification accuracy as the fitness function. The results show that the proposed GP approach selected a smaller number of features than IG and REFS-F, and these selected features resulted in better classification performance than the top features selected by IG and REFS-F alone and all the original features on the five problems using the SVMs, NB, J48 and GP classifiers. GP as a classifier also generally outperformed the other three classifiers on these five data sets. The results also suggest that combining multiple feature selection metrics using GP can improve the classification performance. As future work, we will investigate whether combining more metrics using GP can further improve the classification performance. This will relate to another interesting but challenging research direction, i.e. automatic construction of high-level features from low-level features for feature/dimension reduction. In addition, we will further investigate the feature selection ability of GP for MS data to address the overfitting problem in MS data with a small number of examples. References 1. Listgarten, J., Emili, A.: Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 4, (2005) 2. Ge, G., Wong, G.W.: Classification of premalignant pancreatic cancer massspectrometry data using decision tree ensembles. BMC Bioinformatics 9(1), 275 (2008)

12 54 S. Ahmed, M. Zhang, and L. Peng 3. Lin, Q., Peng, Q., Yao, F., Pan, X.F., Xiong, L.W., Wang, Y., Geng, J.F., Feng, J.X., Han, B.H., Bao, G.L., Yang, Y., Wang, X., Jin, L., Guo, W., Wang, J.C.: A classification method based on principal components of seldi spectra to diagnose of lung adenocarcinoma. PLoS ONE 7, e34457 (2012) 4. He, S., Cooper, H.J., Ward, D.G., Yao, X., Heath, J.K.: Analysis of premalignant pancreatic cancer mass spectrometry data for biomarker selection using a group search optimizer. Transactions of the Institute of Measurement and Control 34, (2011) 5. Satten, G.A., Datta, S., Moura, H., Woolfitt, A.R., da G. Carvalho, M., Carlone, G.M., De, B.K., Pavlopoulos, A., Barr, J.R.: Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens. Bioinformatics 20(17), (2004) 6. Wagner, M., Naik, D., Pothen, A.: Protocols for disease classification from mass spectrometry data. Proteomics 3(9), (2003) 7. Li, L., Tang, H., Wu, Z., Gong, J., Gruidl, M., Zou, J., Tockman, M., Clark, R.A.: Data mining techniques for cancer detection using serum proteomic profiling. Artificial Intelligence in Medicine 32(2), (2004) 8. Jong, K., Marchiori, E., Sebag, M., Vaart, A.V.D.: Feature selection in proteomic pattern data with support vector machines (2004) 9. Langdon, W.B., Poli, R., McPhee, N.F., Koza, J.R.: Genetic Programming: An Introduction and Tutorial, with a Survey of Techniques and Applications. In: Fulcher, J., Jain, L.C. (eds.) Computational Intelligence: A Compendium. SCI, vol. 115, pp Springer, Heidelberg (2008) 10. Poli, R., Langdon, W.B., McPhee, N.F.: A field guide to genetic programming. Lulu Enterprises, UK Ltd. (2008) 11. Neshatian, K., Zhang, M., Andreae, P.: Genetic Programming for Feature Ranking in Classification Problems. In: Li, X., Kirley, M., Zhang, M., Green, D., Ciesielski, V., Abbass, H.A., Michalewicz, Z., Hendtlass, T., Deb, K., Tan, K.C., Branke, J., Shi, Y. (eds.) SEAL LNCS, vol. 5361, pp Springer, Heidelberg (2008) 12. Paul, T.K., Iba, H.: Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6, (2009) 13. Lv, Y., Guo, Y., Sun, H., Zhang, M., Wang, J.: Feature extraction using composite individual genetic programming: An application to mass classification. Applied Mechanics and Materials 198, (2012) 14. Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1 47 (2002) 15. Sun, Y., Wu, D.: A relief based feature extraction algorithm. In: SDM, pp (2008) 16. Kononenko, I.: Estimating Attributes: Analysis and Extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML LNCS, vol. 784, pp Springer, Heidelberg (1994) 17. Petricoin, Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A.: Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 359, (2002) 18. Guyon, I., Gunn, S.R., Ben-Hur, A., Dror, G.: Result analysis of the nips 2003 feature selection challenge. In: NIPS (2004) 19. Tuli, L., Tsai, T.H., Varghese, R., Xiao, J.F., Cheema, A., Ressom, H.: Using a spike-in experiment to evaluate analysis of LC-MS data. Proteome Science 10, 13 (2012)

13 FS and Classification of High Dimensional MS Data: A GP Approach Cai, J., Smith, D., Xia, X., Yuen, K.Y.: MBEToolbox: a Matlab toolbox for sequence data analysis in molecular biology and evolution. BMC Bioinformatics 6(1), 64 (2005) 21. Sandin, I., Andrade, G., Viegas, F., Madeira, D., da Rocha, L.C., Salles, T., Goncalves, M.A.: Aggressive and effective feature selection using genetic programming. In: IEEE Congress on Evolutionary Computation, pp IEEE (2012) 22. Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., Zhao, H.: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19(13), (2003) 23. White, D.R.: Software review: the ecj toolkit. Genetic Programming and Evolvable Machines, (2012) 24. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explorations 11(1), (2009)

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data * Mario Cannataro University Magna Græcia of Catanzaro, Italy cannataro@unicz.it * Joint work with P. H. Guzzi, T. Mazza, P.

More information

Mass Spec Data Post-Processing Software. ClinProTools. Wayne Xu, Ph.D. Supercomputing Institute Phone: Help:

Mass Spec Data Post-Processing Software. ClinProTools. Wayne Xu, Ph.D. Supercomputing Institute   Phone: Help: Mass Spec Data Post-Processing Software ClinProTools Presenter: Wayne Xu, Ph.D Supercomputing Institute Email: Phone: Help: wxu@msi.umn.edu (612) 624-1447 help@msi.umn.edu (612) 626-0802 Aug. 24,Thur.

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning

Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning Shan He, Huanhuan Chen, Xiaoli Li, and Xin Yao The Centre of Excellence for Research in Computational

More information

School of Mathematics, Statistics and Computer Science. Computer Science. Object Detection using Neural Networks and Genetic Programming

School of Mathematics, Statistics and Computer Science. Computer Science. Object Detection using Neural Networks and Genetic Programming T E W H A R E W Ā N A N G A O T E Ū P O K O O T E I K A A M Ā U I ÎÍÏ V I C T O R I A UNIVERSITY OF WELLINGTON School of Mathematics, Statistics and Computer Science Computer Science Object Detection using

More information

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract

More information

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn

More information

QuantWiz: A Parallel Software Package for LC-MS-based Label-free Protein Quantification

QuantWiz: A Parallel Software Package for LC-MS-based Label-free Protein Quantification 2009 11th IEEE International Conference on High Performance Computing and Communications QuantWiz: A Parallel Software Package for LC-MS-based Label-free Protein Quantification Jing Wang 1, Yunquan Zhang

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954

More information

The Imbalanced Problem in Mass-spectrometry Data Analysis

The Imbalanced Problem in Mass-spectrometry Data Analysis The Second International Symposium on Optimization and Systems Biology (OSB 08) Lijiang, China, October 31 November 3, 2008 Copyright 2008 ORSC & APORC, pp. 136 143 The Imbalanced Problem in Mass-spectrometry

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,

More information

The Role of Biomedical Dataset in Classification

The Role of Biomedical Dataset in Classification The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences

More information

Tutorial 2: Analysis of DIA/SWATH data in Skyline

Tutorial 2: Analysis of DIA/SWATH data in Skyline Tutorial 2: Analysis of DIA/SWATH data in Skyline In this tutorial we will learn how to use Skyline to perform targeted post-acquisition analysis for peptide and inferred protein detection and quantification.

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

On Demand Phenotype Ranking through Subspace Clustering

On Demand Phenotype Ranking through Subspace Clustering On Demand Phenotype Ranking through Subspace Clustering Xiang Zhang, Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA {xiang, weiwang}@cs.unc.edu

More information

Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification

Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification Bing Xue, Mengjie Zhang, and Will N. Browne School of Engineering and Computer Science Victoria University of

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information

Statistical dependence measure for feature selection in microarray datasets

Statistical dependence measure for feature selection in microarray datasets Statistical dependence measure for feature selection in microarray datasets Verónica Bolón-Canedo 1, Sohan Seth 2, Noelia Sánchez-Maroño 1, Amparo Alonso-Betanzos 1 and José C. Príncipe 2 1- Department

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Genetic Image Network for Image Classification

Genetic Image Network for Image Classification Genetic Image Network for Image Classification Shinichi Shirakawa, Shiro Nakayama, and Tomoharu Nagao Graduate School of Environment and Information Sciences, Yokohama National University, 79-7, Tokiwadai,

More information

Stability of Feature Selection Algorithms

Stability of Feature Selection Algorithms Stability of Feature Selection Algorithms Alexandros Kalousis, Julien Prados, Melanie Hilario University of Geneva, Computer Science Department Rue General Dufour 24, 1211 Geneva 4, Switzerland {kalousis,

More information

Prognosis of Lung Cancer Using Data Mining Techniques

Prognosis of Lung Cancer Using Data Mining Techniques Prognosis of Lung Cancer Using Data Mining Techniques 1 C. Saranya, M.Phil, Research Scholar, Dr.M.G.R.Chockalingam Arts College, Arni 2 K. R. Dillirani, Associate Professor, Department of Computer Science,

More information

Computers in Biology and Medicine

Computers in Biology and Medicine Computers in Biology and Medicine 39 (29) 818 -- 823 Contents lists available at ScienceDirect Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm Feature extraction and dimensionality

More information

Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances

Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances Minzhong Liu, Xiufen Zou, Yu Chen, Zhijian Wu Abstract In this paper, the DMOEA-DD, which is an improvement of DMOEA[1,

More information

User Guide Written By Yasser EL-Manzalawy

User Guide Written By Yasser EL-Manzalawy User Guide Written By Yasser EL-Manzalawy 1 Copyright Gennotate development team Introduction As large amounts of genome sequence data are becoming available nowadays, the development of reliable and efficient

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Adaptive Crossover in Genetic Algorithms Using Statistics Mechanism

Adaptive Crossover in Genetic Algorithms Using Statistics Mechanism in Artificial Life VIII, Standish, Abbass, Bedau (eds)(mit Press) 2002. pp 182 185 1 Adaptive Crossover in Genetic Algorithms Using Statistics Mechanism Shengxiang Yang Department of Mathematics and Computer

More information

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran

More information

Combining Selective Search Segmentation and Random Forest for Image Classification

Combining Selective Search Segmentation and Random Forest for Image Classification Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, 2013 1 Problem Statement Random Forest algorithm have been successfully used in many

More information

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging 1 CS 9 Final Project Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging Feiyu Chen Department of Electrical Engineering ABSTRACT Subject motion is a significant

More information

Information Fusion Dr. B. K. Panigrahi

Information Fusion Dr. B. K. Panigrahi Information Fusion By Dr. B. K. Panigrahi Asst. Professor Department of Electrical Engineering IIT Delhi, New Delhi-110016 01/12/2007 1 Introduction Classification OUTLINE K-fold cross Validation Feature

More information

Classification and Optimization using RF and Genetic Algorithm

Classification and Optimization using RF and Genetic Algorithm International Journal of Management, IT & Engineering Vol. 8 Issue 4, April 2018, ISSN: 2249-0558 Impact Factor: 7.119 Journal Homepage: Double-Blind Peer Reviewed Refereed Open Access International Journal

More information

Using Decision Boundary to Analyze Classifiers

Using Decision Boundary to Analyze Classifiers Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn Indranil Bose and Xi Chen Abstract In this paper, we use two-stage hybrid models consisting of unsupervised clustering techniques

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

MS data processing. Filtering and correcting data. W4M Core Team. 22/09/2015 v 1.0.0

MS data processing. Filtering and correcting data. W4M Core Team. 22/09/2015 v 1.0.0 MS data processing Filtering and correcting data W4M Core Team 22/09/2015 v 1.0.0 Presentation map 1) Processing the data W4M table format for Galaxy 2) Filters for mass spectrometry extracted data a)

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

More information

Good Cell, Bad Cell: Classification of Segmented Images for Suitable Quantification and Analysis

Good Cell, Bad Cell: Classification of Segmented Images for Suitable Quantification and Analysis Cell, Cell: Classification of Segmented Images for Suitable Quantification and Analysis Derek Macklin, Haisam Islam, Jonathan Lu December 4, 22 Abstract While open-source tools exist to automatically segment

More information

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly

More information

Genetic Programming for Data Classification: Partitioning the Search Space

Genetic Programming for Data Classification: Partitioning the Search Space Genetic Programming for Data Classification: Partitioning the Search Space Jeroen Eggermont jeggermo@liacs.nl Joost N. Kok joost@liacs.nl Walter A. Kosters kosters@liacs.nl ABSTRACT When Genetic Programming

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Scheme of Big-Data Supported Interactive Evolutionary Computation

Scheme of Big-Data Supported Interactive Evolutionary Computation 2017 2nd International Conference on Information Technology and Management Engineering (ITME 2017) ISBN: 978-1-60595-415-8 Scheme of Big-Data Supported Interactive Evolutionary Computation Guo-sheng HAO

More information

Project Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University

Project Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University Project Report on De novo Peptide Sequencing Course: Math 574 Gaurav Kulkarni Washington State University Introduction Protein is the fundamental building block of one s body. Many biological processes

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

Classification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009

Classification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009 1 Introduction Classification of Hyperspectral Breast Images for Cancer Detection Sander Parawira December 4, 2009 parawira@stanford.edu In 2009 approximately one out of eight women has breast cancer.

More information

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition

More information

Using Genetic Programming for Multiclass Classification by Simultaneously Solving Component Binary Classification Problems

Using Genetic Programming for Multiclass Classification by Simultaneously Solving Component Binary Classification Problems Using Genetic Programming for Multiclass Classification by Simultaneously Solving Component Binary Classification Problems Will Smart and Mengjie Zhang School of Mathematics, Statistics and Computer Sciences,

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Hybridization EVOLUTIONARY COMPUTING. Reasons for Hybridization - 1. Naming. Reasons for Hybridization - 3. Reasons for Hybridization - 2

Hybridization EVOLUTIONARY COMPUTING. Reasons for Hybridization - 1. Naming. Reasons for Hybridization - 3. Reasons for Hybridization - 2 Hybridization EVOLUTIONARY COMPUTING Hybrid Evolutionary Algorithms hybridization of an EA with local search techniques (commonly called memetic algorithms) EA+LS=MA constructive heuristics exact methods

More information

Improved Centroid Peak Detection and Mass Accuracy using a Novel, Fast Data Reconstruction Method

Improved Centroid Peak Detection and Mass Accuracy using a Novel, Fast Data Reconstruction Method Overview Improved Centroid Peak Detection and Mass Accuracy using a Novel, Fast Data Reconstruction Method James A. Ferguson 1, William G. Sawyers 1, Keith A. Waddell 1, Anthony G. Ferrige 2, Robert Alecio

More information

Genetic Programming Prof. Thomas Bäck Nat Evur ol al ut ic o om nar put y Aling go rg it roup hms Genetic Programming 1

Genetic Programming Prof. Thomas Bäck Nat Evur ol al ut ic o om nar put y Aling go rg it roup hms Genetic Programming 1 Genetic Programming Prof. Thomas Bäck Natural Evolutionary Computing Algorithms Group Genetic Programming 1 Genetic programming The idea originated in the 1950s (e.g., Alan Turing) Popularized by J.R.

More information

A Classifier with the Function-based Decision Tree

A Classifier with the Function-based Decision Tree A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Classification Strategies for Image Classification in Genetic Programming

Classification Strategies for Image Classification in Genetic Programming Classification Strategies for Image Classification in Genetic Programming Will R. Smart, Mengjie Zhang School of Mathematical and Computing Sciences, Victoria University of Wellington, New Zealand {smartwill,

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

Learning based face hallucination techniques: A survey

Learning based face hallucination techniques: A survey Vol. 3 (2014-15) pp. 37-45. : A survey Premitha Premnath K Department of Computer Science & Engineering Vidya Academy of Science & Technology Thrissur - 680501, Kerala, India (email: premithakpnath@gmail.com)

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Ensemble Image Classification Method Based on Genetic Image Network

Ensemble Image Classification Method Based on Genetic Image Network Ensemble Image Classification Method Based on Genetic Image Network Shiro Nakayama, Shinichi Shirakawa, Noriko Yata and Tomoharu Nagao Graduate School of Environment and Information Sciences, Yokohama

More information

Parallel Linear Genetic Programming

Parallel Linear Genetic Programming Parallel Linear Genetic Programming Carlton Downey and Mengjie Zhang School of Engineering and Computer Science Victoria University of Wellington, Wellington, New Zealand Carlton.Downey@ecs.vuw.ac.nz,

More information

Distributed Optimization of Feature Mining Using Evolutionary Techniques

Distributed Optimization of Feature Mining Using Evolutionary Techniques Distributed Optimization of Feature Mining Using Evolutionary Techniques Karthik Ganesan Pillai University of Dayton Computer Science 300 College Park Dayton, OH 45469-2160 Dale Emery Courte University

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

Monika Maharishi Dayanand University Rohtak

Monika Maharishi Dayanand University Rohtak Performance enhancement for Text Data Mining using k means clustering based genetic optimization (KMGO) Monika Maharishi Dayanand University Rohtak ABSTRACT For discovering hidden patterns and structures

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

Comparative Study of Data Mining Classification Techniques over Soybean Disease by Implementing PCA-GA

Comparative Study of Data Mining Classification Techniques over Soybean Disease by Implementing PCA-GA Comparative Study of Data Mining Classification Techniques over Soybean Disease by Implementing PCA-GA Dr. Geraldin B. Dela Cruz Institute of Engineering, Tarlac College of Agriculture, Philippines, delacruz.geri@gmail.com

More information

Analyzing ICAT Data. Analyzing ICAT Data

Analyzing ICAT Data. Analyzing ICAT Data Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Stability of Feature Selection Algorithms

Stability of Feature Selection Algorithms Stability of Feature Selection Algorithms Alexandros Kalousis, Jullien Prados, Phong Nguyen Melanie Hilario Artificial Intelligence Group Department of Computer Science University of Geneva Stability of

More information

Query Disambiguation from Web Search Logs

Query Disambiguation from Web Search Logs Vol.133 (Information Technology and Computer Science 2016), pp.90-94 http://dx.doi.org/10.14257/astl.2016. Query Disambiguation from Web Search Logs Christian Højgaard 1, Joachim Sejr 2, and Yun-Gyung

More information

1. INTRODUCTION. AMS Subject Classification. 68U10 Image Processing

1. INTRODUCTION. AMS Subject Classification. 68U10 Image Processing ANALYSING THE NOISE SENSITIVITY OF SKELETONIZATION ALGORITHMS Attila Fazekas and András Hajdu Lajos Kossuth University 4010, Debrecen PO Box 12, Hungary Abstract. Many skeletonization algorithms have been

More information

Feature Selection and Classification for Small Gene Sets

Feature Selection and Classification for Small Gene Sets Feature Selection and Classification for Small Gene Sets Gregor Stiglic 1,2, Juan J. Rodriguez 3, and Peter Kokol 1,2 1 University of Maribor, Faculty of Health Sciences, Zitna ulica 15, 2000 Maribor,

More information

FEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION

FEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION FEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION Hong Guo, Qing Zhang and Asoke K. Nandi Signal Processing and Communications Group, Department of Electrical Engineering and Electronics,

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy  August 30, 2016 Epitopes Toolkit (EpiT) Yasser EL-Manzalawy http://www.cs.iastate.edu/~yasser August 30, 2016 What is EpiT? Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools. An EpiT developer

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

BMC Bioinformatics. Research. Open Access. Abstract

BMC Bioinformatics. Research. Open Access. Abstract BMC Bioinformatics BioMed Central Research A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection Michele Ceccarelli 1,2,Antoniod Acierno*

More information

An Evolutionary Programming Algorithm for Automatic Chromatogram Alignment

An Evolutionary Programming Algorithm for Automatic Chromatogram Alignment Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2007 An Evolutionary Programming Algorithm for Automatic Chromatogram Alignment Bonnie Jo Schwartz Wright

More information

1. Introduction. International IEEE multi topics Conference (INMIC 2005), Pakistan, Karachi, Dec. 2005

1. Introduction. International IEEE multi topics Conference (INMIC 2005), Pakistan, Karachi, Dec. 2005 Combining Nearest Neighborhood Classifiers using Genetic Programming Abdul Majid, Asifullah Khan and Anwar M. Mirza Faculty of Computer Science & Engineering, GIK Institute, Ghulam Ishaq Khan (GIK) Institute

More information

Using a genetic algorithm for editing k-nearest neighbor classifiers

Using a genetic algorithm for editing k-nearest neighbor classifiers Using a genetic algorithm for editing k-nearest neighbor classifiers R. Gil-Pita 1 and X. Yao 23 1 Teoría de la Señal y Comunicaciones, Universidad de Alcalá, Madrid (SPAIN) 2 Computer Sciences Department,

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Rotation Invariant Finger Vein Recognition *

Rotation Invariant Finger Vein Recognition * Rotation Invariant Finger Vein Recognition * Shaohua Pang, Yilong Yin **, Gongping Yang, and Yanan Li School of Computer Science and Technology, Shandong University, Jinan, China pangshaohua11271987@126.com,

More information

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8 Tutorial 3 1 / 8 Overview Non-Parametrics Models Definitions KNN Ensemble Methods Definitions, Examples Random Forests Clustering Definitions, Examples k-means Clustering 2 / 8 Non-Parametrics Models Definitions

More information

A Comparative Study of Linear Encoding in Genetic Programming

A Comparative Study of Linear Encoding in Genetic Programming 2011 Ninth International Conference on ICT and Knowledge A Comparative Study of Linear Encoding in Genetic Programming Yuttana Suttasupa, Suppat Rungraungsilp, Suwat Pinyopan, Pravit Wungchusunti, Prabhas

More information

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

A Naïve Soft Computing based Approach for Gene Expression Data Analysis Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information