A New Implementation of Recursive Feature Elimination Algorithm for Gene Selection from Microarray Data
|
|
- Lucinda Roberts
- 5 years ago
- Views:
Transcription
1 2009 World Congress on Computer Science and Information Engineering A New Implementation of Recursive Feature Elimination Algorithm for Gene Selection from Microarray Data Sihua Peng 1, Xiaoping Liu 2, Jiyang Yu 1, Zhizhen Wan 3, and Xiaoning Peng 4* 1 Department of Pathology, School of Medicine, Zhejiang University; 2 College of Life Science and Technology, Xinjiang University; 3 College of Computer Science and Engineering, Zhejiang University, 4 School of Medicine, Hunan Normal University. Abstract We proposed a new approach for gene selection and multi-cancer classification based on step-by-step improvement of classification performance (SSiCP). The SSiCP gene selection algorithms were evaluated over the NCI60 and GCM benchmar datasets, with an accuracy of 96.6% and 95.5% in 10-fold crossvalidation, respectively. Furthermore, the SSiCP outperformed recently published algorithms when applied to another two multi-cancer data sets. Computational evidence indicated that SSiCP can avoid overfitting effectively. Compared with various gene selection algorithms, the implementation of SSiCP is very simple, and all the computational experiments are repeatable. 1. Introduction Cancer classification is a very important step for diagnosis and treatment of cancers. Without the correct identification of cancer types, it is almost impossible to achieve a good therapeutic effect. Based on the cdna microarray technology for cancer identification and classification, many in-depth studies have been done [1, 2]. As for binary classification issues, such as tumour versus normal tissue [3], or one subtype of a tumor versus another [4], molecular classification using gene expression profiles has achieved a very high degree of accuracy. For classification of multiple tumour types, however, the accuracy has yet to be improved [5-10]. Because of the high dimensionality, the excessive noise, and the relatively small sample sizes in DNA microarray data, this issue has become a hot focus in the data mining of gene expression profiles. Especially for data with a large number of cancer types, many conventional classification methods show very poor performance [11], such as the NCI60 data set (9 types of cancer) [5], and the GCM data set (14 types of cancer) [6]. Recently, to face the challenge of multi-cancer classification, investigators have proposed many new approaches. Xu et al. used semi-supervised ellipsoid ARTMAP and particle swarm optimization, with a competitive performance [12]. Cai et al. proposed a new algorithm, which introduced a new measurement to quantify the class discrimination strength difference between two genes [13]. Zhou et al. [14] recently put forward the MSVM-RFE algorithms, which are four expansions of the well-nown SVM-RFE algorithm [15]. However, obtaining higher classification accuracy as well as choosing fewer genes is possible by using more powerful dada mining algorithms. In this paper, we proposed a new approach of gene selection and multi-cancer classification based on stepby-step improvement of classification performance (SSiCP). SSiCP, which is neither SVM-RFE nor the expansion of SVM-RFE [15], is a new SVM based implementation of RFE feature selection methodology. The results show that our strategy is very effective, with a fast calculation procedure. 2. Materials and Methods 2.1 Data sets NCI60 dataset [5] * To whom correspondence should be addressed: Xiaoning Peng, PhD, Hunan Normal University Schoole of Medicine, No. 81 Jiatongjie, Changsha, Hunan Province, P.R.China ( pxiaoning@hunnu.edu.cn, Tel: , FAX: , Zip Code: ) /08 $ IEEE DOI /CSIE
2 The NCI60 data set was described by Ross et al., and can be downloaded from ( wi.mit.edu/mpr/nci60/nci_60.expression.scfrs.txt). There are 60 samples in this data set, which express 7129 genes in nine types. GCM dataset [6,7] The original GCM dataset contains 198 samples with genes from 14 classes of cancers [6]. A subset of the original GCM dataset is employed in this study, which was download at the web site ( view&paper_id=114). Human Carcinomas Dataset (HCD174) [8] The HCD174 dataset contains 174 samples in 11 classes. Each sample contains genes. The dataset was obtained from ( Central Nervous System Embryonal Tumors dataset (CNS) [9] The CNS dataset contains 42 samples with 7129 gene probes, and can be downloaded from ( 2.2 Gene pre-selection Without gene pre-selection, computation becomes a time-consuming tas because of the very high dimensions in feature space. After gene pre-selection, we can obtain a few dozen or hundreds of differentially expresse. Based on this reduced gene subset, the second step of gene selection was carried out smoothly, with the calculation burden greatly reduced. As our algorithm is based on the Wea platform, we tested several feature selection methods on Wea. After going through calculation and comparison, we chose the chi-squared test-based feature selection algorithm as our gene pre-selection algorithm, which is named the "ChiSquaredAttributeEval" feature selection on Wea. The Chi-Squared (χ 2 ) method evaluates features individually by measuring their χ 2 statistic with respect to the classes. After calculating the χ 2 value of all considered features, we sorted the values with the largest one at the first position, as the larger the χ 2 value, the more important the feature [16]. 2.3 RFE: Recursive Feature Elimination RFE is an iterative procedure, which can be described as follows. 1. Train the classifier. 2. Compute the raning criterion for all features. 3. Remove the feature with smallest raning criterion. In the algorithm of SVM-RFE proposed by guyon et al., the main steps are described as follows [15]. 1. Train the classifier: α = SVM train( x, y) 2. Compute the weight vector of dimension length (s): w = α y x c = ( wi 3. Compute the raning criteria: i 4. Find the feature with smallest raning criterion: f = arg min( c) c =. 5. Eliminate the feature with f 2.4 Feature selection methodology Step by step feature reduction SSiCP algorithm is not a ind of wrapper algorithm [17]. In SSiCP, we do not use a search method. But we do employ an evaluation function to guide the eliminate features step by step. To some extent, SSiCP is similar to SVM-RFE in two aspects. Both of the algorithms are SVM based algorithm, and both of them employed the recursive feature elimination (RFE) methodology. Nevertheless, they are completely different algorithms. The innovation of our algorithm is the feature elimination criteria. Briefly, we eliminate a feature at a time. If the classification accuracy increases (or equal to the original value) without this feature, we remove this feature forever, otherwise restores this feature. So SSiCP did not ran the features by some raning criteria. The ey steps of the algorithm we proposed were as follows: Step 1. Train the classifier with n features (genes), and compute the accuracy with m-fold cross-validation. Step 2. Eliminate a feature f temporarily, and compute the accuracy with m-fold cross-validation. Step 3. If, remove the feature f, and if >, restore the feature f. If all the retained features were restored once without the increase, a local maxima valve of the accuracy is obtained. In this case, we mae =. Step 4. If n=2, stop the calculation. If n>2 go to Step 2. The above steps are the ey points of our algorithm, and the details shown in Fig ) 666
3 (classifier for building linear logistic regression models) [Wea: we determined the classification algorithm which provided the best performance. By using the seven classification algorithms on the GCM and NCI60 data sets, the optimal algorithm was selected. Subsequent calculation results showed that SMO outperformed all of the other six algorithms. 2.6 Parameter selection on Wea Fig. 1 Schematic map of the feature reduction algorithm. Overfitting evaluation of SSiCP algorithm As a machine learning algorithm, overfitting issue must be addressed. Of the four datasets, there are more instances in HCD174 (174 instances) dataset than that of GCM, NCI60, and NCS. Therefore, to evaluate the overfitting status of SSiCP algorithm, HCD174 dataset is partitioned into two parts: training set and test set. A classifier model is obtained by running the SSiCP algorithm on the training dataset, with an accuracy of ten-fold across validation denoted in x. And the classifier model is then tested by the independent test dataset, with an accuracy denoted in x. If there is little difference between x and x, we conclude that SSiCP can avoid overfitting effectively. 2.5 Confirmation of classification algorithm in the second step of feature selection By comparing the seven classification algorithms including the Naive Bayes classifier, the BayesNet classifier, SMO (sequential minimal optimization algorithm for training a support vector classifier), KStar, LMT (logistic model trees), J48, and SL SMO algorithm was superior to the other algorithms. After features (genes) pre-selection, 208 genes were 3.2 Gene selection based on step-by-step improvement of classification performance When we used SVM to do the classification tas, the choice of the ernel function was a ey factor to obtain better performance. For the classification of the microarray dataset, a relatively better classification performance was achieved by using the polynomial ernel function [10]. After testing the four ernel functions (NormalizedPolyKernel, PolyKernel, RBFKernel, and StringKernel) on Wea, it was also clear that the best results were achieved by using PolyKernel. 3. Results 3.1 Initial noise removal and comparison of classification algorithms The NCI60 and GCM datasets are generally considered benchmar datasets in the microarray data mining problem, so they are always used to test the performance of a new algorithm. Therefore, seven classification algorithms which are commonly used in data mining issues were employed with these two datasets. First, we obtained the computational results with and without feature pre-selection (using the χ 2 test-based feature selection algorithm). The results suggested that after initial pre-selection of the features, the classification performance improved considerably, indicating that the noise in the microarray datasets was removed to a certain extent. The results also indicated that when using both NCI60 data and GCM data, the selected from NCI 60 data set and 150 genes from GCM data set. By calling the main pacage of Wea to run our algorithm, the computations were carried out using the NCI60 and GCM datasets, and the gene selection results of the above seven algorithms were obtained 667
4 (Fig. 2 and Fig. 3). Clearly, the SMO algorithm also outperformed the other six algorithms. Fig. 2 Classification performance comparisons of the seven algorithms using the NCI60 data set. The maximal accuracy of 96.6% was obtained using the SMO algorithm with 24 genes (red). Fig. 3 Classification performance comparisons of the seven algorithms using the GCM data set. The maximal accuracy of 95.5% was obtained using the SMO algorithm with 28 genes (red). 3.3 Comparison of computational results using four data sets Through the above comparisons, the SMO algorithm was selected as the classifier embedded in our algorithm. This SMO-based algorithm was then applied to the other two datasets: CNS, and HCD174. In the calculation process, we generally chose the following parameters: ten-fold cross-validation, PolyKernel ernel function and standardization data filter type, with the remaining parameters set to the default values. The results are shown in Table 1. Table 1 - cy comparison of multi-class classification using the four data sets (%) NCI60 GCM CNS HCD174 SU Pomeroy Yeang Peng Lin Xu Cai Zhou This study Overfitting evaluation HCD174 dataset is divided into training dataset with 142 instances and test dataset with 32 instances. Running SSiCP on HCD174 training set, a classifier 668
5 model including 49 features was obtained with accuracy of 95.8% by ten-fold cross validation. Then independent test dataset from HD174 is employed to test the classifier model with accuracy of 93.8%. From 95.8% to 93.8%, the accuracy declined slightly, suggesting that SSiCP avoids Overfitting efficaciously. 4. Discussions In the comparison of the results obtained from the four datasets, our algorithm was superior to all other algorithms in classification accuracy except for the algorithm of Cai et al., which achieved slightly higher accuracy than ours (97.3% versus 97.1%, Table 1), whereas the number of genes we selected was far less than theirs (80 versus 37, Table 1). The advantages of wrapper-based techniques for feature selection are well established [17]. So a comparison should be made between the wrapperbased approaches and SSiCP algorithm. First, it has recently been recognized that wrapper-based techniques have the potential to overfit the training data [18], while SSiCP has shown the ability to overcome overfitting by computational experiments. Second, wrapper-based techniques must employ a heuristic search method to search subset feature states in a large state space, maing a heavy computational burden on the computer. However, instead of searching states in a huge space, SSiCP uses a step by step improvement of classification accuracy to reduce feature space, with a result of fast procedure of computation and simple implementation of the algorithm. 5. References [1] Golub, T.R., Slonim, D.K., Tamayo, P., et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science 286, 1999, pp [2] Bittner, M., et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature 406, 2000, pp [3] Furey, T.S., et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 16, 2000, [4] Alizadeh, A.A., et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 2000, pp [5] Ross, D.T., et al. Systematic variation in gene expression patterns in human cancer cell lines, Nature Genetics 24, 2000, pp [6] Ramaswamy, S., Tamayo, P., Rifin, R., et al. Multiclass cancer diagnosis using tumor gene expression signatures, Proc. Natl. Acad. Sci. USA 98, 2001, pp [7] Lu, J., Getz, G., Misa, E.A., et al. MicroRNA expression profiles classify human cancers, Nature 435, 2005, pp [8] Su, A.I., et al. Molecular classification of human carcinomas by use of gene expression signatures, Cancer Research 61, 2001, pp [9] Pomeroy, S.L., et al. Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature 415, 2002, pp [10] Peng, S.H., Xu, Q.H., Ling, X.B., et al. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines, FEBS Letters 555, 2003, pp [11] Li, T., et al. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics 20, 2004, pp [12] Xu, R., et al. Multiclass cancer classification using semisupervised ellipsoid ARTMAP and particle swarm optimization with gene expression data, IEEE-ACM Transaction on Computational Biology and Bioinformatics 4, 2007, pp [13] Cai, Z.P., et al. Selecting dissimilar genes for multiclass classification, an application in cancer subtyping, BMC Bioinformatics 8, 2007, Art. No.206. [14] Zhou, X. and Tuc, D.P. MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data, Bioinformatics 23, 2007, pp [15] Guyon, I., et al. Gene selection for cancer classification using support vector machines, Machine Learning 46, 2002, pp [16] Liu, H. and Setiono, R. Chi2: Feature selection and discrimination of numeric attributes. In: Proceedings of the IEEE 7th International Conference on Tools with Artificial Intelligence, pp , [17] R. Kohavi and G. H. John. Wrapper for feature subset selection, Artificial Intelligence 97, 1997, pp [18] Reunanen, J. Overfitting in maing comparisons between variable selection methods, Journal of Machine Learning Research 3, 2003, pp
Noise-based Feature Perturbation as a Selection Method for Microarray Data
Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering
More informationEstimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification
1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore
More informationTitle: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data
Supplementary material for Manuscript BIOINF-2005-1602 Title: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data Appendix A. Testing K-Nearest Neighbor and Support
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationFEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION
FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationA New Maximum-Relevance Criterion for Significant Gene Selection
A New Maximum-Relevance Criterion for Significant Gene Selection Young Bun Kim 1,JeanGao 1, and Pawel Michalak 2 1 Department of Computer Science and Engineering 2 Department of Biology The University
More informationGene Expression Based Classification using Iterative Transductive Support Vector Machine
Gene Expression Based Classification using Iterative Transductive Support Vector Machine Hossein Tajari and Hamid Beigy Abstract Support Vector Machine (SVM) is a powerful and flexible learning machine.
More informationComparison of Optimization Methods for L1-regularized Logistic Regression
Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com
More information10601 Machine Learning. Model and feature selection
10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationFuzzy Entropy based feature selection for classification of hyperspectral data
Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering NIT Kurukshetra, 136119 mpce_pal@yahoo.co.uk Abstract: This paper proposes to use
More informationStatistical dependence measure for feature selection in microarray datasets
Statistical dependence measure for feature selection in microarray datasets Verónica Bolón-Canedo 1, Sohan Seth 2, Noelia Sánchez-Maroño 1, Amparo Alonso-Betanzos 1 and José C. Príncipe 2 1- Department
More informationEnsemble-based Classifiers for Cancer Classification Using Human Tumor Microarray Data
1 Ensemble-based Classifiers for Cancer Classification Using Human Tumor Microarray Data Argin Margoosian and Jamshid Abouei, Member, IEEE, Dept. of Electrical and Computer Engineering, Yazd University,
More informationSVM-Based Local Search for Gene Selection and Classification of Microarray Data
SVM-Based Local Search for Gene Selection and Classification of Microarray Data Jose Crispin Hernandez Hernandez, Béatrice Duval, and Jin-Kao Hao LERIA, Université d Angers, 2 Boulevard Lavoisier, 49045
More informationUnivariate Margin Tree
Univariate Margin Tree Olcay Taner Yıldız Department of Computer Engineering, Işık University, TR-34980, Şile, Istanbul, Turkey, olcaytaner@isikun.edu.tr Abstract. In many pattern recognition applications,
More informationA PSO-based Generic Classifier Design and Weka Implementation Study
International Forum on Mechanical, Control and Automation (IFMCA 16) A PSO-based Generic Classifier Design and Weka Implementation Study Hui HU1, a Xiaodong MAO1, b Qin XI1, c 1 School of Economics and
More informationBest First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis
Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction
More informationGene selection through Switched Neural Networks
Gene selection through Switched Neural Networks Marco Muselli Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Email: Marco.Muselli@ieiit.cnr.it
More informationSupport Vector Machines: Brief Overview" November 2011 CPSC 352
Support Vector Machines: Brief Overview" Outline Microarray Example Support Vector Machines (SVMs) Software: libsvm A Baseball Example with libsvm Classifying Cancer Tissue: The ALL/AML Dataset Golub et
More informationFeature Selection for SVMs
Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research
More informationComparison of different preprocessing techniques and feature selection algorithms in cancer datasets
Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract
More informationSVMFILEFS- A NOVEL ENSEMBLE FEATURE SELECTION TECHNIQUE FOR EFFECTIVE BREAST CANCER DIAGNOSIS
International Journal of Civil Engineering and Technology (IJCIET) Volume 9, Issue 11, November 2018, pp. 1526 1533, Article ID: IJCIET_09_11_147 Available online at http://www.iaeme.com/ijciet/issues.asp?jtype=ijciet&vtype=9&itype=11
More informationPublished by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1
Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant
More informationFilter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review
Filter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review Binita Kumari #1, Tripti Swarnkar *2 #1 Department of Computer Science - *2 Department of Computer Applications,
More informationThe Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data
The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data Ka Yee Yeung, Roger E. Bumgarner, and Adrian E. Raftery April 30, 2018 1
More informationForward Feature Selection Using Residual Mutual Information
Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics
More informationFeature Selection in Knowledge Discovery
Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco
More informationReview of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.
Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and
More informationMachine Learning in Biology
Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant
More informationOnline Streaming Feature Selection
Online Streaming Feature Selection Abstract In the paper, we consider an interesting and challenging problem, online streaming feature selection, in which the size of the feature set is unknown, and not
More informationA Naïve Soft Computing based Approach for Gene Expression Data Analysis
Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationRedundancy Based Feature Selection for Microarray Data
Redundancy Based Feature Selection for Microarray Data Lei Yu Department of Computer Science & Engineering Arizona State University Tempe, AZ 85287-8809 leiyu@asu.edu Huan Liu Department of Computer Science
More informationK-means clustering based filter feature selection on high dimensional data
International Journal of Advances in Intelligent Informatics ISSN: 2442-6571 Vol 2, No 1, March 2016, pp. 38-45 38 K-means clustering based filter feature selection on high dimensional data Dewi Pramudi
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationAutomated Microarray Classification Based on P-SVM Gene Selection
Automated Microarray Classification Based on P-SVM Gene Selection Johannes Mohr 1,2,, Sambu Seo 1, and Klaus Obermayer 1 1 Berlin Institute of Technology Department of Electrical Engineering and Computer
More informationSoftware Documentation of the Potential Support Vector Machine
Software Documentation of the Potential Support Vector Machine Tilman Knebel and Sepp Hochreiter Department of Electrical Engineering and Computer Science Technische Universität Berlin 10587 Berlin, Germany
More informationData Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier
Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio
More informationOn The Value of Leave-One-Out Cross-Validation Bounds
On The Value of Leave-One-Out Cross-Validation Bounds Jason D. M. Rennie jrennie@csail.mit.edu December 15, 2003 Abstract A long-standing problem in classification is the determination of the regularization
More informationSVM Classification in -Arrays
SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What
More information2. Department of Electronic Engineering and Computer Science, Case Western Reserve University
Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,
More informationDouble Self-Organizing Maps to Cluster Gene Expression Data
Double Self-Organizing Maps to Cluster Gene Expression Data Dali Wang, Habtom Ressom, Mohamad Musavi, Cristian Domnisoru University of Maine, Department of Electrical & Computer Engineering, Intelligent
More informationA STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES
A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES Narsaiah Putta Assistant professor Department of CSE, VASAVI College of Engineering, Hyderabad, Telangana, India Abstract Abstract An Classification
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationFeature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes
Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes Madhu.G 1, Rajinikanth.T.V 2, Govardhan.A 3 1 Dept of Information Technology, VNRVJIET, Hyderabad-90, INDIA,
More informationCANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.
CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known
More informationA Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search
A Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search Jianli Ding, Liyang Fu School of Computer Science and Technology Civil Aviation University of China
More informationFeature-weighted k-nearest Neighbor Classifier
Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationFeature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm
Ann. Data. Sci. (2015) 2(3):293 300 DOI 10.1007/s40745-015-0060-x Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm Li-min Du 1,2 Yang Xu 1 Hua Zhu 1 Received: 30 November
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Adrian Alexa Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken slides by Jörg Rahnenführer NGFN - Courses
More informationIndividualized Error Estimation for Classification and Regression Models
Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models
More informationImproving Classification Accuracy for Single-loop Reliability-based Design Optimization
, March 15-17, 2017, Hong Kong Improving Classification Accuracy for Single-loop Reliability-based Design Optimization I-Tung Yang, and Willy Husada Abstract Reliability-based design optimization (RBDO)
More informationIndividual feature selection in each One-versus-One classifier improves multi-class SVM performance
Individual feature selection in each One-versus-One classifier improves multi-class SVM performance Phoenix X. Huang School of Informatics University of Edinburgh 10 Crichton street, Edinburgh Xuan.Huang@ed.ac.uk
More informationKnowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA
Knowledge Discovery Javier Béjar URL - Spring 2019 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationIntroduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No.677 C.B.C.L Paper No.8
More informationDetecting Network Intrusions
Detecting Network Intrusions Naveen Krishnamurthi, Kevin Miller Stanford University, Computer Science {naveenk1, kmiller4}@stanford.edu Abstract The purpose of this project is to create a predictive model
More informationPrognosis of Lung Cancer Using Data Mining Techniques
Prognosis of Lung Cancer Using Data Mining Techniques 1 C. Saranya, M.Phil, Research Scholar, Dr.M.G.R.Chockalingam Arts College, Arni 2 K. R. Dillirani, Associate Professor, Department of Computer Science,
More information2.5 A STORM-TYPE CLASSIFIER USING SUPPORT VECTOR MACHINES AND FUZZY LOGIC
2.5 A STORM-TYPE CLASSIFIER USING SUPPORT VECTOR MACHINES AND FUZZY LOGIC Jennifer Abernethy* 1,2 and John K. Williams 2 1 University of Colorado, Boulder, Colorado 2 National Center for Atmospheric Research,
More informationGood Cell, Bad Cell: Classification of Segmented Images for Suitable Quantification and Analysis
Cell, Cell: Classification of Segmented Images for Suitable Quantification and Analysis Derek Macklin, Haisam Islam, Jonathan Lu December 4, 22 Abstract While open-source tools exist to automatically segment
More informationFEATURE SELECTION TECHNIQUES
CHAPTER-2 FEATURE SELECTION TECHNIQUES 2.1. INTRODUCTION Dimensionality reduction through the choice of an appropriate feature subset selection, results in multiple uses including performance upgrading,
More informationA Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu
More informationSUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018
SUPERVISED LEARNING METHODS Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 2 CHOICE OF ML You cannot know which algorithm will work
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationstepwisecm: Stepwise Classification of Cancer Samples using High-dimensional Data Sets
stepwisecm: Stepwise Classification of Cancer Samples using High-dimensional Data Sets Askar Obulkasim Department of Epidemiology and Biostatistics, VU University Medical Center P.O. Box 7075, 1007 MB
More informationGene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients
1 Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients 1,2 Keyue Ding, Ph.D. Nov. 8, 2014 1 NCIC Clinical Trials Group, Kingston, Ontario, Canada 2 Dept. Public
More informationHotel Recommendation Based on Hybrid Model
Hotel Recommendation Based on Hybrid Model Jing WANG, Jiajun SUN, Zhendong LIN Abstract: This project develops a hybrid model that combines content-based with collaborative filtering (CF) for hotel recommendation.
More informationDESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES
EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset
More informationThe digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).
http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis
More informationInformation Integration of Partially Labeled Data
Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de
More informationClustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford
Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically
More informationMachine learning techniques for binary classification of microarray data with correlation-based gene selection
Machine learning techniques for binary classification of microarray data with correlation-based gene selection By Patrik Svensson Master thesis, 15 hp Department of Statistics Uppsala University Supervisor:
More informationInteractive Text Mining with Iterative Denoising
Interactive Text Mining with Iterative Denoising, PhD kegiles@vcu.edu www.people.vcu.edu/~kegiles Assistant Professor Department of Statistics and Operations Research Virginia Commonwealth University Interactive
More informationUsing Google s PageRank Algorithm to Identify Important Attributes of Genes
Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105
More informationFeature Selection and Classification for Small Gene Sets
Feature Selection and Classification for Small Gene Sets Gregor Stiglic 1,2, Juan J. Rodriguez 3, and Peter Kokol 1,2 1 University of Maribor, Faculty of Health Sciences, Zitna ulica 15, 2000 Maribor,
More informationHybrid Feature Selection for Modeling Intrusion Detection Systems
Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,
More informationData Mining in Bioinformatics Day 1: Classification
Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls
More informationUnsupervised Feature Selection for Sparse Data
Unsupervised Feature Selection for Sparse Data Artur Ferreira 1,3 Mário Figueiredo 2,3 1- Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL 2- Instituto Superior Técnico, Lisboa, PORTUGAL 3-
More informationChapter 8 The C 4.5*stat algorithm
109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationTopics In Feature Selection
Topics In Feature Selection CSI 5388 Theme Presentation Joe Burpee 2005/2/16 Feature Selection (FS) aka Attribute Selection Witten and Frank book Section 7.1 Liu site http://athena.csee.umbc.edu/idm02/
More informationChapter 22 Information Gain, Correlation and Support Vector Machines
Chapter 22 Information Gain, Correlation and Support Vector Machines Danny Roobaert, Grigoris Karakoulas, and Nitesh V. Chawla Customer Behavior Analytics Retail Risk Management Canadian Imperial Bank
More informationAn Adaptive Threshold LBP Algorithm for Face Recognition
An Adaptive Threshold LBP Algorithm for Face Recognition Xiaoping Jiang 1, Chuyu Guo 1,*, Hua Zhang 1, and Chenghua Li 1 1 College of Electronics and Information Engineering, Hubei Key Laboratory of Intelligent
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationA hybrid of discrete particle swarm optimization and support vector machine for gene selection and molecular classification of cancer
A hybrid of discrete particle swarm optimization and support vector machine for gene selection and molecular classification of cancer Adithya Sagar Cornell University, New York 1.0 Introduction: Cancer
More informationCS229 Lecture notes. Raphael John Lamarre Townshend
CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationBIOINFORMATICS. New algorithms for multi-class cancer diagnosis using tumor gene expression signatures
BIOINFORMATICS Vol. 19 no. 14 2003, pages 1800 1807 DOI: 10.1093/bioinformatics/btg238 New algorithms for multi-class cancer diagnosis using tumor gene expression signatures A. M. Bagirov, B. Ferguson,
More informationEstimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees
Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationCover Page. The handle holds various files of this Leiden University dissertation.
Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:
More informationClassification and Regression Trees
Classification and Regression Trees David S. Rosenberg New York University April 3, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 1 / 51 Contents 1 Trees 2 Regression
More information8/19/13. Computational problems. Introduction to Algorithm
I519, Introduction to Introduction to Algorithm Yuzhen Ye (yye@indiana.edu) School of Informatics and Computing, IUB Computational problems A computational problem specifies an input-output relationship
More informationSubject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.
Subject Copy paste feature into the diagram. When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the
More informationAn Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm
Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy
More information