Clustering and Classification. Basic principles of clustering. Clustering. Classification
|
|
- Conrad Simmons
- 6 years ago
- Views:
Transcription
1 Classification Clustering and Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects Jean Yee Hwa Yang University of California, San Francisco Institute for Mathematical Sciences, National University of Singapore January 2-6, 2004 Unsupervised: classes unknown, want to discover them from the data (cluster analysis) Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations Basic principles of clustering Aim: to group observations that are similar based on predefined criteria. Clustering Issues: Which genes / arrays to use? Which similarity or dissimilarity measure? Which clustering algorithm? It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific. 1
2 Clustering Expression Data Which similarity or dissimilarity measure? For each gene, calculate a summary statistics and/or adjusted p-values A metric is a measure of the similarity or dissimilarity between two data objects Used to form data points into clusters Similarity metrics Clustering algorithm Set of candidate DE genes. Clustering Biological verification Descriptive interpretation Two main classes of distance: - Correlation coefficients - Compares shape of expression curves - Two types of correlation: - Centered. - Un-centered. - Distance metrics - City Block (Manhattan) distance - Euclidean distance Correlation (a measure between -1 and 1) Pearson Correlation Coefficient S x = Standard deviation of x n S y = Standard deviation of y 1 x i x y i y n 1 i= 1 Sx S y Others include Spearman s ρ and Kendall s τ Potential pitfalls You can use absolute correlation to capture both positive and negative correlation Correlation = 1 Positive correlation Negative correlation 2
3 City Block (Manhattan) distance: - Sum of differences across dimensions - Less sensitive to outliers - Diamond shaped clusters Distance metrics Euclidean distance: - Most commonly used distance - Sphere shaped cluster - Corresponds to the geometric distance into the multidimensional space d ( X, Y ) = x i y i d( X, Y) = ( x i y i ) i i 2 Euclidean vs Correlation (I) Euclidean distance Correlation Condition 2 X Y Condition 2 X Y Condition 1 Condition 1 where gene X = (x 1,,x n ) and gene Y=(y 1,,y n ) Distance between clusters Between-cluster dissimilarity measures Clustering algorithms Clustering algorithm comes in 2 basic flavors Partitioning Hierarchical Complete (minimum) Single (maximum) x x Distance between centroids Average (Mean) linkage 3
4 Partitioning methods Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups. Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares. K = 2 Partitioning methods Examples: - k-means, self-organizing maps (SOM), PAM, etc.; - Fuzzy: needs stochastic model, e.g. Gaussian mixtures. Partitioning methods Hierarchical methods K = 4 Hierarchical clustering methods produce a tree or dendrogram. They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level. The tree can be built in two distinct ways - bottom-up: agglomerative clustering. - top-down: divisive clustering. 4
5 Tree re-ordering? Illustration of points In two dimensional space Agglomerative Agglomerative 3 4 1,2,3,4, ,2,3,4,5 1,2,5 1,2, ,5 3, ,5 3, Partitioning vs. hierarchical Partitioning: Advantages Optimal for certain criteria. Genes automatically assigned to clusters Disadvantages Need initial k; Often require long computation times. All genes are forced into a cluster. Hierarchical Advantages Faster computation. Visual. Disadvantages Unrelated genes are eventually joined Rigid, cannot correct later for erroneous decisions made earlier. Hard to define clusters. Clustering microarray data Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space. Examples: We can cluster cell samples (cols), e.g. 1) for identification (profiles). Here, we might want to estimate the number of different neuron cell types in a set of samples, based on gene expression. 2) the identification of new / unknown tumor classes using gene expression profiles. We can cluster genes (rows), e.g. using large numbers of yeast experiments, to identify groups of co-regulated genes. We can cluster genes (rows) to reduce redundancy (cf. variable selection) in predictive models. 5
6 Clustering both cell samples and genes Clustering cell samples Discovering sub-groups Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of diffuse large B-cell lymphoma identified by Gene expression profiling, Taken from Alizadeh et al (Nature, 2000) Clustering genes Finding different patterns in the data Some other issues on clustering Yeast Cell Cycle (Cho et al, 1998) 6 5 SOM with 828 genes Taken from Tamayo et al, (PNAS, 1999) Two-way clustering. Comparing trees. Estimating the number of cluster. - Silhouette width (used in PAM). - Gap statistics Ref: Tibshirani, Walther and Hastie Estimating the number of clusters in a dataset via the gap statistic. - Clest: Ref: Dudoit & Fridlyand A prediction-based resampling method for estimating the number of clusters in a dataset. Note: select a subsets of genes using their class association before clustering. 6
7 Summary Which clustering method should I use? - What is the biological question? - Do I have a preconceived notion of how many clusters there should be? - Can a gene be in multiple clusters? - Hard or soft boundaries between clusters Keep in mind: - Clustering cannot NOT work. That is, every clustering methods will return clusters. - Clustering helps to group / order information and is a visualization tool for learning about the data. However, clustering results do not provide biological proof. Comparison of clustering and single gene approaches for microarray data analysis Cluster analysis: 1) Usually outside the normal framework of statistical inference. 2) Less appropriate when only a few genes are likely to change. 3) Needs lots of experiments. Single gene approaches 1) May be too noisy in general to show much. 2) May not reveal coordinated effects of positively correlated genes. 3) Harder to relate to pathways. Basic principles of discrimination Each object associated with a class label (or response) Y {1, 2,, K} and a feature vector (vector of predictor variables) of G measurements: X = (X 1,, X G ) Aim: predict Y from X. Discrimination 1 2 K Predefined Class {1,2, K} Classification procedure Feature selection Performance assessment Comparison study Y = Class Label = 2 X = Feature vector {colour, shape} Objects Classification rule? X = {red, square} Y =? 7
8 Learning Set Data with known classes Classification Technique Discrimination Classification Classification rule Prediction Data with unknown classes Class Assignment Predefine classes Clinical outcome Objects Array Feature vectors Gene expression Learning set Bad prognosis recurrence < 5yrs Reference L van t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan.. Good Prognosis recurrence > 5yrs Classification rule Good Prognosis? Matesis > 5 new array Learning set Predefine classes Tumor type Objects Array Feature vectors Gene expression B-ALL T-ALL AML T-ALL? new array Performance Assessment e.g. Cross validation Classification Rule -Classification procedure, -Feature selection, -Parameters [pre-determine, estimable], Distance measure, Aggregation methods Reference Golub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): Classification Rule One can think of the classification rule as a black box, some methods provides more insight into the box. Performance assessment needs to be looked at for all classification rule. 8
9 Classification rule Maximum likelihood discriminant rule A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest For known class conditional densities p k (X), the maximum likelihood (ML) discriminant rule predicts the class of an observation X by C(X) = argmax k p k (X) Gaussian ML discriminant rules For multivariate Gaussian (normal) class densities X Y= k ~ N(µ k, Σ k ), the ML classifier is C(X) = argmin k {(X - µ k ) Σ k -1 (X - µ k ) + log Σ k } In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA) In practice, population mean vectors µ k and covariance matrices Σ k are estimated by corresponding sample quantities ML discriminant rules - special cases [DLDA] Diagonal linear discriminant analysis class densities have the same diagonal covariance matrix = diag(s 12,, s p2 ) [DQDA] Diagonal quadratic discriminant analysis) class densities have different diagonal covariance matrix k = diag(s 1k 2,, s pk2 ) Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (wrong variance calculation). Nearest neighbor classification Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation). k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: - find the k observations in the learning set closest to X - predict the class of X by majority vote, i.e., choose the class that is most common among those k observations. The number of neighbors k can be chosen by cross-validation (more on this later). 9
10 Nearest neighbor rule Classification tree Partition the feature space into a set of rectangles, then fit a simple model in each one Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself) Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier Classification tree Three aspects of tree construction Gene 1 M i1 < Gene 2 Split selection rule: - Example, at each node, choose split maximizing decrease in impurity (e.g. Gini index, entropy, misclassification error). yes no 0 Split-stopping: yes Gene 2 M i2 > 0.18 no Gene 1 - Example, grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate. Class assignment: Example, for each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node. 10
11 Classification with SVMs Other classifiers include Neural networks Logistic regression Projection pursuit Bayesian belief networks Why select features? Explicit feature selection One-gene-at-a-time approaches. Genes are ranked based on the value of a univariate test statistic such as: t- or F-statistic or their non-parametric variants (Wilcoxon/Kruskal-Wallis); p-value. Possible meta-parameters include the number of genes G or a p-value cutoff. A formal choice of these parameters may be achieved by crossvalidation or bootstrap procedures. Multivariate approaches. More refined feature selection procedures consider the joint distribution of the expression measures, in order to detect genes with weak main effects but possibly strong interaction. No feature selection Top 100 feature selection Selection based on variance Correlation plot Data: Leukemia, 3 class Bo & Jonassen (2002): Subset selection procedures for screening gene pairs to be used in classification. Breiman (1999): ranks genes according to importance statistic define in terms of prediction accuracy. Note that tree building itself does not involve explicit feature selection. 11
12 Implicit feature selection Feature selection may also be performed implicitly by the classification rule itself. In classification trees, features are selected at each step based on reduction in impurity and The number of features used (or size of the tree) is determined by pruning the tree using cross-validation. Thus, feature selection is inherent part of tree-building and pruning deals with over-fitting. Shrinkage methods and adaptive distance function. May be used for LDA and knn. Performance assessment Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase. One needs to estimate future performance based on what is available: often the same set that is used to build the classifier. Assessing performance of the classifier based on - Cross-validation. - Test set - Independent testing on future dataset Diagram of performance assessment Classifier Training Set Resubstitution estimation Performance assessment (I) Resubstitution estimation: error rate on the learning set. - Problem: downward bias Training set Performance assessment Test set estimation: 1) divide learning set into two sub-sets, L and T; Build the classifier on L and compute the error rate on T. 2) Build the classifier on the training set (L) and compute the error rate on an independent test set (T). Classifier Independent test set Test set estimation - L and T must be independent and identically distributed (i.i.d). - Problem: reduced effective sample size 12
13 Training set Diagram of performance assessment Classifier (CV) Learning set Classifier Training Set Cross Validation Performance assessment Resubstitution estimation Performance assessment (II) V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. - Bias-variance tradeoff: smaller V can give larger bias but smaller variance - Computationally intensive. (CV) Test set Classifier Independent test set Test set estimation Leave-one-out cross validation (LOOCV). (Special case for V=n). Works well for stable classifiers (k- NN, LOOCV) Performance assessment (III) Another component in classification rule: aggregating classifiers Common practice to do feature selection using the learning, then CV only for model building and classification. Resample 1 Resample 2 Classifier 1 Classifier 2 However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased. Training Set X 1, X 2, X 100 Aggregate classifier Features (variables) should be selected only from the learning set used to build the model (and not the entire set) Resample 499 Resample 500 Classifier 499 Classifier 500 Examples: Bagging Boosting Random Forest 13
14 Aggregating classifiers: Bagging Test sample Comparison study Resample 1 X* 1, X* 2, X* 100 Resample 2 X* 1, X* 2, X* 100 Tree 1 Tree 2 Class 1 Class 2 Leukemia data Golub et al. (1999) - n = 72 samples, - G = 3,571 genes, - 3 classes (B-cell ALL, T-cell ALL, AML). Training Set (arrays) X 1, X 2, X 100 Resample 499 X* 1, X* 2, X* 100 Resample 500 X* 1, X* 2, X* 100 Tree 499 Tree 500 Lets the tree vote Class 1 Class 1 90% Class 1 10% Class 2 Reference: S. Dudoit, J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, No. 457, p Results In the main comparison, NN and DLDA had the smallest error rates. Aggregation improved the performance of CART classifiers. Leukemia data, 3 classes: Test set error rates;150 LS/TS runs For the leukemia datasets, increasing the number of genes to G=200 didn't greatly affect the performance of the various classifiers. 14
15 Comparison study discussion (I) Diagonal LDA: ignoring correlation between genes helped here. Unlike classification trees and nearest neighbors, DLDA is unable to take into account gene interactions. Classification trees are capable of handling and revealing interactions between variables. In addition, they have useful by-product of aggregated classifiers: prediction votes, variable importance statistics. Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions. Summary (I) Bias-variance trade-off. Simple classifiers do well on small datasets. As the number of samples increases, we expect to see that classifiers capable of considering higher-order interactions (and aggregated classifiers) will have an edge. Cross-validation. It is of utmost importance to crossvalidate for every parameter that has been chosen based on the data, including meta-parameters - what and how many features - how many neighbors - pooled or unpooled variance - classifier itself. If this is not done, it is possible to wrongly declare having discrimination power when there is none. Summary (II) Generalization error rate estimation. It is necessary to keep sampling scheme in mind. Thousands and thousands of independent samples from variety of sources are needed to be able to address the true performance of the classifier. We are not at that point yet with microarrays studies. Van Veer et al (2002) study is probably the only study to date with ~300 test samples. Learning set Bad Good 295 samples selected from Netherland Cancer Institute tissue bank ( ). Results Gene expression profile is a more powerful predictor then standard systems based on clinical and histologic criteria Classification Rule Feature selection. Correlation with class labels, very similar to t-test. Using cross validation to select 70 genes Agendia (formed by reseachers from the Netherlands Cancer Institute) Plan to start in Oct, ) 3000 subjects [Health Council of the Netherlands] 2) 5000 subjects New York based Avon Foundation. Custorm arrays are made by Aglient including 70 genes controls Case studies Reference 1 Retrospective study L van t Veer et al Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan Reference 2 Retrospective study M Van de Vijver et al. A gene expression signature as a predictor of survival in breast cancer. The New England Jouranl of Medicine, Dec Reference 3 Prospective trials. Aug 2003 Clinical trials 15
16 Acknowledgements Clustering Agnes Paquet David Erle Andrea Barczac, UCSF Sandler Genomics Core Facility. Discrimination Jane Fridlyand Mark Segal Terry Speed Sandrine Dudoit 16
Supervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationClassification with PAM and Random Forest
5/7/2007 Classification with PAM and Random Forest Markus Ruschhaupt Practical Microarray Analysis 2007 - Regensburg Two roads to classification Given: patient profiles already diagnosed by an expert.
More informationCANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.
CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationPreface to the Second Edition. Preface to the First Edition. 1 Introduction 1
Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches
More informationMachine Learning in Biology
Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant
More informationIntroduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering
Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationPredictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA
Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,
More informationOverview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010
INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,
More informationMultivariate Methods
Multivariate Methods Cluster Analysis http://www.isrec.isb-sib.ch/~darlene/embnet/ Classification Historically, objects are classified into groups periodic table of the elements (chemistry) taxonomy (zoology,
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationClustering. Lecture 6, 1/24/03 ECS289A
Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationSupervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationModel Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer
Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error
More informationDistances, Clustering! Rafael Irizarry!
Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationHigh throughput Data Analysis 2. Cluster Analysis
High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationClassification by Nearest Shrunken Centroids and Support Vector Machines
Classification by Nearest Shrunken Centroids and Support Vector Machines Florian Markowetz florian.markowetz@molgen.mpg.de Max Planck Institute for Molecular Genetics, Computational Diagnostics Group,
More informationCS 229 Midterm Review
CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationECG782: Multidimensional Digital Signal Processing
ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationMachine Learning. Chao Lan
Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian
More informationBioinformatics - Lecture 07
Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Learning on profiles
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationTree-based methods for classification and regression
Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1 Tree-based methods Tree-based based methods for predicting
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationMIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA
Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationEvaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München
Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics
More information1) Give decision trees to represent the following Boolean functions:
1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationSupervised Learning for Image Segmentation
Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationClustering. Supervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationLecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017
Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last
More informationPerformance Evaluation of Various Classification Algorithms
Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationClustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford
Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically
More informationCluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole
Cluster Analysis Summer School on Geocomputation 27 June 2011 2 July 2011 Vysoké Pole Lecture delivered by: doc. Mgr. Radoslav Harman, PhD. Faculty of Mathematics, Physics and Informatics Comenius University,
More informationCross-validation and the Bootstrap
Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationCOMP 551 Applied Machine Learning Lecture 13: Unsupervised learning
COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationGene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients
1 Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients 1,2 Keyue Ding, Ph.D. Nov. 8, 2014 1 NCIC Clinical Trials Group, Kingston, Ontario, Canada 2 Dept. Public
More informationBIOINF 585: Machine Learning for Systems Biology & Clinical Informatics
BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 12: Ensemble Learning I Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline Bias
More informationLast time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression
Machine learning, pattern recognition and statistical data modelling Lecture 3. Linear Methods (part 1) Coryn Bailer-Jones Last time... curse of dimensionality local methods quickly become nonlocal as
More informationEstimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification
1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore
More informationAN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION
AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationBig Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1
Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that
More informationComputational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions
Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationR (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.
Assignment No. 4 Title: SD Module- Data Science with R Program R (2) C (4) V (2) T (2) Total (10) Dated Sign Data analysis case study using R for readily available data set using any one machine learning
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationRESAMPLING METHODS. Chapter 05
1 RESAMPLING METHODS Chapter 05 2 Outline Cross Validation The Validation Set Approach Leave-One-Out Cross Validation K-fold Cross Validation Bias-Variance Trade-off for k-fold Cross Validation Cross Validation
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and
More informationDI TRANSFORM. The regressive analyses. identify relationships
July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,
More informationThe supclust Package
The supclust Package May 18, 2005 Title Supervised Clustering of Genes Version 1.0-5 Date 2005-05-18 Methodology for Supervised Grouping of Predictor Variables Author Marcel Dettling and Martin Maechler
More informationCSE 573: Artificial Intelligence Autumn 2010
CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationData Mining Lecture 8: Decision Trees
Data Mining Lecture 8: Decision Trees Jo Houghton ECS Southampton March 8, 2019 1 / 30 Decision Trees - Introduction A decision tree is like a flow chart. E. g. I need to buy a new car Can I afford it?
More informationMeasure of Distance. We wish to define the distance between two objects Distance metric between points:
Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationClassification/Regression Trees and Random Forests
Classification/Regression Trees and Random Forests Fabio G. Cozman - fgcozman@usp.br November 6, 2018 Classification tree Consider binary class variable Y and features X 1,..., X n. Decide Ŷ after a series
More informationAn Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods
An Empirical Comparison of Ensemble Methods Based on Classification Trees Mounir Hamza and Denis Larocque Department of Quantitative Methods HEC Montreal Canada Mounir Hamza and Denis Larocque 1 June 2005
More information10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski
10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 12 Combining
More informationClassification: Feature Vectors
Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12
More informationClustering Jacques van Helden
Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More information