Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.
|
|
- Adrian Waters
- 5 years ago
- Views:
Transcription
1 Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and what it is used for. We discuss methods for getting those features and present them in the context of classification that is using only supervised learning. Our goal is to present applications of these techniques in bioinformatics and match them with the appropriate type of method to use. We will also talk about how to deal with small sample sizes and problems related with it. Finally, we will present some areas in bioinformatics that can be improved using feature selection methods. 1 - INTRODUCTION The need of using feature selection (FS) techniques is growing in last years due to the size of data we need to analyze in areas related to bioinformatics. FS can be classified as one of many dimensionality reduction techniques. What distinguishes it from the others is the fact that it allows us to pick the subset of variables from the original data without changing it. 2 - FEATURE SELECTION TECHNIQUES The principles of FS usage are: avoiding overfitting improving model performance gaining a deeper insight into the underlying processes that generated the data Unfortunately, the optimal parameters of the model generated from the full feature set are not always the same as the optimal parameters of the model generated using the optimal features set selected by FS. So it is important to use it wisely, there is a danger that we lose some information. Consequently, we need to find the optimal model settings for the new set of features. Feature selection techniques can be divided into three categories, depending on how they combine the feature selection search with the construction of the classification model: filter methods, wrapper methods and embedded methods FILTER TECHNIQUES These techniques give importance to features by looking only at the basic properties of the data. In most cases a feature relevance score is calculated, and low-scoring features are removed. Afterwards, this subset of features is presented as input to the classification algorithm. Advantages of filter techniques are that they easily scale to very high-dimensional datasets, they are computationally simple and fast, and they are independent of the classification algorithm. Because of the problem with not considering the features dependencies multivariate techniques were introduced. But they are also not perfect, as they only use a certain degree of dependencies.
2 2.2 - WRAPPER TECHNIQUES Unlike filter techniques, wrapper methods embed the model hypothesis search within the feature subset search. In this setup, the result of search procedure belongs to the space of possible feature subsets, and various subsets of features are generated. By using wrapper methods we can also evaluate selected feature subset by training and testing specific classification model. In order to pick the best feature subset, we need to perform the search that will take exponential time, thus we use heuristics. To make this search we can use deterministic or randomized search algorithms. A common disadvantage of these techniques is that they have a higher risk of overfitting than filter techniques and are very computationally intensive, especially if building the classifier has a high computational cost.
3 2.3 - EMBEDDED TECHNIQUES The search space of the feature selection algorithm is a combination of feature space and hypothesis space. The classifier itself provides the optimal feature selection. A great advantage is that it is not so computationally intensive. 3 - APPLICATIONS IN BIOINFORMATICS 3.1 -SEQUENCE ANALYSIS Sequence analysis is one of the most traditional areas of bioinformatics. The problems that the programmer meets in this area can be divided in two types differing in the scope we are interested in. If we want to focus on general characteristics, to reason basing on statistical features of the whole sequence, then we are interested in performing content analysis. On the other hand, if we want to detect in a sequence the presence of a particular motif or some specified aberration - we are in fact interested in analyzing only some small part(s) of the sequence. In that case we want to perform signal analysis. As we can imagine, in both cases (content/signal) in our sequence there's a lot of garbage data which provides us with information that is either irrelevant or redundant. And that is exactly where FS can be exploited! FS dedicated to content analysis (Filter multivariate) Because features are derived from a sequence which is ordered, keeping them ordered is often beneficial, as it preserves dependencies between adjacent features. That's why Markov models were used in the first approach mentioned in the Saeys review [1] and it's improvements are still maintained, but the first idea stays the same. For scoring feature subsets there are used also genetic algorithms and SVMs FS dedicated to signal analysis (Wrapper) Usually signal analysis is performed to recognize binding sites or other places of sequence with special function. For feature selection it's best to interpret the code and relate motifs to the gene expression level. Then motifs can be cropped in such a way that the preserved motifs would fit the best to their regression models. Choosing which motifs are unselected in FS is dependent on the threshold number of misclassification (TNoM) to the regression models. Importance of motifs can be sorted by the P-value (derived directly from the TNoM score) MICROARRAY ANALYSIS Main feature of new datasets created by microarrays is both: large dimensionality and small sample sizes. What makes it more exciting is that analysis has to cope with noises and variability Univariate (only filter) Reasons why this approach is used most widely: understandable output faster than multivariate it is somehow easier to validate the selection by biological lab methods the experts usually don't feel the need to consider genes interactions
4 Simplest heuristics techniques include setting a threshold on the differences in gene expression between the states and then detection of the threshold point in each gene that minimizes TNoM. But they were also developed in two directions: Parametric methods: Parametric methods assume a given distribution from which the samples have been generated. That's why before using them; the programmer should justify his choice of the distribution. Unfortunately, samples are so small that it is very hard to even validate such choice. The most standard choice is a Gaussian distribution Model-free methods: Just as parametric, but tries to figure the distribution out by estimating random permutations of the data, which enhances the robustness against outliers. There is also another group of non-parametric methods which, instead of trying to identify differentially expressed genes at the whole population level, are able to capture genes which are significantly disregulated in only a subset of samples. These types of methods can select genes containing specific patterns that are missed by previously mentioned metrics Multivariate Filter methods: The application of multivariate filter methods ranges from simple bivariate interactions towards more advanced solutions exploring higher order interactions, such as correlation-based feature selection (CFS) and several variants of the Markov blanket filter method Wrapper methods: In the context of microarray analysis, most wrapper methods use population-based, randomized search heuristics, although also a few examples use sequential search techniques. An interesting hybrid filterwrapper approach is crossing a univariately preordered gene ranking with an incrementally augmenting wrapper method. Another characteristic of any wrapper procedure concerns the scoring function used to evaluate each gene subset found. As the 0 1 accuracy measure allows for comparison with previous works, the vast majority of papers use this measure Embedded methods: The embedded capacity of several classifiers to discard input features and thus propose a subset of discriminative genes, has been exploited by several authors. Examples include the use of random forests in an embedded way to calculate the importance of each gene. Another line of embedded FS techniques uses the weights of each feature in linear classifiers, such as SVMs and logistic regression. These weights are used to reflect the relevance of each gene in a multivariate way and allow the removal of genes with very small weights. Partially due to the higher computational complexity of wrapper and to a lesser degree embedded approaches, these techniques have not received as much interest as filter proposals. However, an advisable practice is to pre-reduce the search space using a univariate filter method, and only then apply wrapper or embedded methods, hence fitting the computation time to the available resources MASS SPECTRA ANALYSIS Mass spectrometry technology is emerging as a new and attractive framework for disease diagnosis and protein-based biomarker profiling. A mass spectrum sample is characterized by thousands of different mass/charge (m/ z) ratios on the x-axis, each with their corresponding signal intensity value on the y-axis. A low-resolution profile can contain up to data points in the spectrum between 500 and m/z. That's why data analysis step is severely constrained by both high-dimensional input spaces and their inherent sparseness, just as it is the case with gene expression datasets.
5 3.3.1 Filter methods: Starting from the raw data, and after an initial step to reduce noise and normalize the spectra from different samples, we need to extract the variables that will constitute the initial pool of candidate discriminative features. Some studies employ the simplest approach of considering every measured value as a feature ( variables!). On the other hand, a great deal of the current studies performs aggressive feature extraction procedures that tend to limit the number of variables even to 500. FS has to be of low-cost in this case. Similar to the domain of microarray analysis, univariate filter techniques seem to be the most common techniques used, although the use of embedded techniques is certainly emerging as an alternative. Multivariate filter techniques on the other hand, are still somewhat underrepresented Wrapper methods: In the wrapper approaches different types of population-based randomized heuristics are used as search engines in the major part of these papers: genetic algorithms, particle swarm optimization and ant colony procedures. It is worth noting that the tendency of improvements in these methods is to reduce the initial number of variables. Variations of the popular method originally proposed for gene expression domains, using the weights of the variables in the SVM-formulation to discard features with small weights, have been broadly and successfully applied in the mass spectrometry domain. Also a neural network classifier (using the weights of the input masses to rank the features importance) and different types of decision tree-based algorithms (including random forests) are an alternative for this strategy. 4 - DEALING WITH SMALL SAMPLE DOMAINS When using small sample sizes to create models, some bad results start to appear, that is, the risks of overfitting and imprecision of the models grow with the smaller amount of data used to train the model. This poses a great challenge to many modeling problems in the area of bioinformatics. To try to overcome these problems with feature selection two techniques were created, that is, the use of adequate evaluation criteria, and the use of stable and robust feature selection models ADEQUATE EVALUATION CRITERIA In some cases it is selected a discriminative subset of features from the data and this subset is used to test the final model. Since the model is also made using this subset we are using the same samples for both testing and training. Because of this it is needed to have an external feature selection process for training the model during each stage of testing the model, to get a better estimation of the accuracy. The bolstered error estimation[2] is an example of a method to get a good estimation of the predictive accuracy that can deal with small sample domains ENSEMBLE FEATURE SELECTION APPROACHES The idea of using ensembles is that instead of using just one feature selection method and accepting its outcome, we can use different feature selection methods together to have better results. This is useful because it's not certain that a specific optimal feature subset is the only optimal one. Although ensembles are computationally more complex and require more resources, they give decent results even for small sample domains, thus affording more computational resources will compensate with better results. The random Forest method is a particular example of and ensemble that is based on a collection of decision trees. This method can be used to get the relevance of each feature which can help on selecting interesting features.
6 5 FEATURE SELECTION IN UPCOMING DOMAINS SINGLE NUCLEOTIDE POLYMORPHISM ANALYSIS Single nucleotide polymorphisms (SMPs) are mutations at a single nucleotide position that occurred during evolution and were passed on through heredity, accounting for most of the genetic variation among different individuals. They are used in many disease-gene associations and their number is estimated to be 7milion in the human genome. Because of this, there is a need to select a portion of SMPs that is sufficiently informative and also small enough to reduce the genotyping overhead which is an important step towards disease-gene association TEXT AND LITERATURE MINING Text and literature mining is a method to get information from texts, which can further be used to generate classification models. A particular representation of this texts and documents is using a Bag-of-words. In this representation each word on the text represents a specific variable or feature and its frequency is counted. Because of this on some texts the dimension of the data will be too big and the data will be sparse, which is why feature selection must be used to choose the fundamental features. Although using feature selection on text mining is common when making text classification, in the context of bioinformatics it is still not much developed. In the case of biomedical documents clustering and classification it is expected that most of the methods that were developed by the text mining community can be used. 6 - CONCLUSION Feature selection is becoming very important, because the amount of data that has to be processed in bioinformatics and other areas is constantly growing in numbers and dimensions. Nevertheless, not all the people know about the existence of various types of feature selection methods, as they usually pick the univariate filter without considering other options. Prudent usage of feature selection methods lead us to deal better with the following issues of bioinformatics: large input dimensionality and small sample sizes. We hope the reader feels convinced to make an effort to find the proper method of FS to use in each problem he is solving that needs dimensionality reduction, before actually doing it. REFERENCES [1] A review of feature selection techniques in bioinformatics, Yvan Saeys, 2007 [2] High-dimensional bolstered error estimation, Chao Sima, 2011.
Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationFeature Selection in Knowledge Discovery
Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationFEATURE EVALUATION FOR EMG-BASED LOAD CLASSIFICATION
FEATURE EVALUATION FOR EMG-BASED LOAD CLASSIFICATION Anne Gu Department of Mechanical Engineering, University of Michigan Ann Arbor, Michigan, USA ABSTRACT Human-machine interfaces (HMIs) often have pattern
More informationDidacticiel - Études de cas
Subject In this tutorial, we use the stepwise discriminant analysis (STEPDISC) in order to determine useful variables for a classification task. Feature selection for supervised learning Feature selection.
More informationDecision Trees Dr. G. Bharadwaja Kumar VIT Chennai
Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Feature Selection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 3: Due Friday Midterm: Feb 14 in class
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationFeature Selection Using Modified-MCA Based Scoring Metric for Classification
2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification
More informationFilter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review
Filter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review Binita Kumari #1, Tripti Swarnkar *2 #1 Department of Computer Science - *2 Department of Computer Applications,
More informationMachine Learning. Decision Trees. Manfred Huber
Machine Learning Decision Trees Manfred Huber 2015 1 Decision Trees Classifiers covered so far have been Non-parametric (KNN) Probabilistic with independence (Naïve Bayes) Linear in features (Logistic
More informationBENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA
BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,
More informationCSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo
CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationEquation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.
Equation to LaTeX Abhinav Rastogi, Sevy Harris {arastogi,sharris5}@stanford.edu I. Introduction Copying equations from a pdf file to a LaTeX document can be time consuming because there is no easy way
More informationFEATURE SELECTION TECHNIQUES
CHAPTER-2 FEATURE SELECTION TECHNIQUES 2.1. INTRODUCTION Dimensionality reduction through the choice of an appropriate feature subset selection, results in multiple uses including performance upgrading,
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationCPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017
CPSC 340: Machine Learning and Data Mining Feature Selection Fall 2017 Assignment 2: Admin 1 late day to hand in tonight, 2 for Wednesday, answers posted Thursday. Extra office hours Thursday at 4pm (ICICS
More informationSVM Classification in -Arrays
SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What
More informationClassification Algorithms in Data Mining
August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More information/ Computational Genomics. Normalization
10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program
More informationCS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series
CS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series Jingyuan Chen //Department of Electrical Engineering, cjy2010@stanford.edu//
More informationOnline Pattern Recognition in Multivariate Data Streams using Unsupervised Learning
Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning
More informationChapter 10. Conclusion Discussion
Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with
More informationForward Feature Selection Using Residual Mutual Information
Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics
More informationMIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA
Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on
More information3 Feature Selection & Feature Extraction
3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy
More informationData Mining. Lecture 03: Nearest Neighbor Learning
Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F. Provost
More informationMachine Learning Feature Creation and Selection
Machine Learning Feature Creation and Selection Jeff Howbert Introduction to Machine Learning Winter 2012 1 Feature creation Well-conceived new features can sometimes capture the important information
More informationPreface to the Second Edition. Preface to the First Edition. 1 Introduction 1
Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches
More informationCT79 SOFT COMPUTING ALCCS-FEB 2014
Q.1 a. Define Union, Intersection and complement operations of Fuzzy sets. For fuzzy sets A and B Figure Fuzzy sets A & B The union of two fuzzy sets A and B is a fuzzy set C, written as C=AUB or C=A OR
More informationBig Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1
Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that
More informationCHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES
CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving
More information10601 Machine Learning. Model and feature selection
10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior
More information[Kaur, 5(8): August 2018] ISSN DOI /zenodo Impact Factor
GLOBAL JOURNAL OF ENGINEERING SCIENCE AND RESEARCHES EVOLUTIONARY METAHEURISTIC ALGORITHMS FOR FEATURE SELECTION: A SURVEY Sandeep Kaur *1 & Vinay Chopra 2 *1 Research Scholar, Computer Science and Engineering,
More informationMachine Learning with MATLAB --classification
Machine Learning with MATLAB --classification Stanley Liang, PhD York University Classification the definition In machine learning and statistics, classification is the problem of identifying to which
More informationOverview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8
Tutorial 3 1 / 8 Overview Non-Parametrics Models Definitions KNN Ensemble Methods Definitions, Examples Random Forests Clustering Definitions, Examples k-means Clustering 2 / 8 Non-Parametrics Models Definitions
More informationVariable Selection 6.783, Biomedical Decision Support
6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationData Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification
More information7. Decision or classification trees
7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,
More informationFinal Report: Kaggle Soil Property Prediction Challenge
Final Report: Kaggle Soil Property Prediction Challenge Saurabh Verma (verma076@umn.edu, (612)598-1893) 1 Project Goal Low cost and rapid analysis of soil samples using infrared spectroscopy provide new
More informationLecture 7: Decision Trees
Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...
More informationMachine Learning in Biology
Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant
More informationOutlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationFeatures: representation, normalization, selection. Chapter e-9
Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features
More informationEnsemble methods in machine learning. Example. Neural networks. Neural networks
Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you
More informationIntroduction to GE Microarray data analysis Practical Course MolBio 2012
Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical
More informationMTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen
MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen Lecture 2: Feature selection Feature Selection feature selection (also called variable selection): choosing k < d important
More informationMIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018
MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge
More informationCluster Analysis Gets Complicated
Cluster Analysis Gets Complicated Collinearity is a natural problem in clustering. So how can researchers get around it? Cluster analysis is widely used in segmentation studies for several reasons. First
More informationFeature Selection. A PhD Seminar Uni Cagliari. Gavin Brown School of Computer Science University of Manchester
Feature Selection A PhD Seminar Course @ Uni Cagliari Gavin Brown School of Computer Science University of Manchester Me Grew up near London. First degree in Computer Science (1998) PhD multiple classifier
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationChapter 3: Supervised Learning
Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example
More informationNoise-based Feature Perturbation as a Selection Method for Microarray Data
Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering
More informationMachine Learning. Computational biology: Sequence alignment and profile HMMs
10-601 Machine Learning Computational biology: Sequence alignment and profile HMMs Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Growth
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 12 Combining
More informationAccelerating Unique Strategy for Centroid Priming in K-Means Clustering
IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 07 December 2016 ISSN (online): 2349-6010 Accelerating Unique Strategy for Centroid Priming in K-Means Clustering
More informationMotion Estimation for Video Coding Standards
Motion Estimation for Video Coding Standards Prof. Ja-Ling Wu Department of Computer Science and Information Engineering National Taiwan University Introduction of Motion Estimation The goal of video compression
More informationLecture #11: The Perceptron
Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be
More informationData preprocessing Functional Programming and Intelligent Algorithms
Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute
More informationCS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008
CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More information5 Learning hypothesis classes (16 points)
5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated
More informationA FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM
A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationCISC 4631 Data Mining
CISC 4631 Data Mining Lecture 03: Nearest Neighbor Learning Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Prof. R. Mooney (UT Austin) Prof E. Keogh (UCR), Prof. F.
More informationKnowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA
Knowledge Discovery Javier Béjar URL - Spring 2019 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics
More informationR (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.
Assignment No. 4 Title: SD Module- Data Science with R Program R (2) C (4) V (2) T (2) Total (10) Dated Sign Data analysis case study using R for readily available data set using any one machine learning
More informationREMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,
More informationMass Spec Data Post-Processing Software. ClinProTools. Wayne Xu, Ph.D. Supercomputing Institute Phone: Help:
Mass Spec Data Post-Processing Software ClinProTools Presenter: Wayne Xu, Ph.D Supercomputing Institute Email: Phone: Help: wxu@msi.umn.edu (612) 624-1447 help@msi.umn.edu (612) 626-0802 Aug. 24,Thur.
More informationData mining fundamentals
Data mining fundamentals Elena Baralis Politecnico di Torino Data analysis Most companies own huge bases containing operational textual documents experiment results These bases are a potential source of
More informationParticle Swarm Optimization applied to Pattern Recognition
Particle Swarm Optimization applied to Pattern Recognition by Abel Mengistu Advisor: Dr. Raheel Ahmad CS Senior Research 2011 Manchester College May, 2011-1 - Table of Contents Introduction... - 3 - Objectives...
More informationUnsupervised learning in Vision
Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationBasic Data Mining Technique
Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationIntroduction to Data Mining and Data Analytics
1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns
More informationDepartment of Computer Science & Engineering The Graduate School, Chung-Ang University. CAU Artificial Intelligence LAB
Department of Computer Science & Engineering The Graduate School, Chung-Ang University CAU Artificial Intelligence LAB 1 / 17 Text data is exploding on internet because of the appearance of SNS, such as
More informationUniversity of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka
Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should
More informationWrapper Feature Selection using Discrete Cuckoo Optimization Algorithm Abstract S.J. Mousavirad and H. Ebrahimpour-Komleh* 1 Department of Computer and Electrical Engineering, University of Kashan, Kashan,
More informationA Keypoint Descriptor Inspired by Retinal Computation
A Keypoint Descriptor Inspired by Retinal Computation Bongsoo Suh, Sungjoon Choi, Han Lee Stanford University {bssuh,sungjoonchoi,hanlee}@stanford.edu Abstract. The main goal of our project is to implement
More informationInformation Fusion Dr. B. K. Panigrahi
Information Fusion By Dr. B. K. Panigrahi Asst. Professor Department of Electrical Engineering IIT Delhi, New Delhi-110016 01/12/2007 1 Introduction Classification OUTLINE K-fold cross Validation Feature
More informationUser Guide Written By Yasser EL-Manzalawy
User Guide Written By Yasser EL-Manzalawy 1 Copyright Gennotate development team Introduction As large amounts of genome sequence data are becoming available nowadays, the development of reliable and efficient
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
DataBase and Data Mining Group of Data mining fundamentals Data Base and Data Mining Group of Data analysis Most companies own huge databases containing operational data textual documents experiment results
More informationFEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION
FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT
More informationCOMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization
COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 6: k-nn Cross-validation Regularization LEARNING METHODS Lazy vs eager learning Eager learning generalizes training data before
More informationClassification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging
1 CS 9 Final Project Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging Feiyu Chen Department of Electrical Engineering ABSTRACT Subject motion is a significant
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization
More informationData Preprocessing. Data Preprocessing
Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should
More informationLab 9. Julia Janicki. Introduction
Lab 9 Julia Janicki Introduction My goal for this project is to map a general land cover in the area of Alexandria in Egypt using supervised classification, specifically the Maximum Likelihood and Support
More information