Supervised vs unsupervised clustering

Similar documents
Classification with PAM and Random Forest

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Classification by Nearest Shrunken Centroids and Support Vector Machines

Topics in Machine Learning

Supervised Learning Classification Algorithms Comparison

Network Traffic Measurements and Analysis

The Curse of Dimensionality

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CS 229 Midterm Review

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Machine Learning in Biology

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Data Mining and Analytics

Applying Supervised Learning

ADVANCED CLASSIFICATION TECHNIQUES

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Performance Evaluation of Various Classification Algorithms

Lecture 11: Classification

Machine Learning. Chao Lan

Artificial Intelligence. Programming Styles

The supclust Package

Bioinformatics - Lecture 07

Clustering and Classification. Basic principles of clustering. Clustering. Classification

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Exploratory data analysis for microarrays

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

INF 4300 Classification III Anne Solberg The agenda today:

Generative and discriminative classification techniques

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Gene Clustering & Classification

Naïve Bayes for text classification

Support Vector Machines: Brief Overview" November 2011 CPSC 352

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Using Machine Learning to Optimize Storage Systems

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Content-based image and video analysis. Machine learning

CSE 158. Web Mining and Recommender Systems. Midterm recap

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Problem 1: Complexity of Update Rules for Logistic Regression

Network Traffic Measurements and Analysis

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Support Vector Machines

Stat 602X Exam 2 Spring 2011

Based on Raymond J. Mooney s slides

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Machine Learning: Think Big and Parallel

Contents. Preface to the Second Edition

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Classification Algorithms in Data Mining

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Classification: Linear Discriminant Functions

Support Vector Machines + Classification for IR

Clustering and Visualisation of Data

Classification by Support Vector Machines

6.034 Quiz 2, Spring 2005

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Support Vector Machines.

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

10-701/15-781, Fall 2006, Final

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Learning / Jan 27, 2010

Function Algorithms: Linear Regression, Logistic Regression

Classifiers and Detection. D.A. Forsyth

Support Vector Machines

Lecture on Modeling Tools for Clustering & Regression

Multi-label Classification. Jingzhou Liu Dec

The exam is closed book, closed notes except your one-page cheat sheet.

Supervised Learning for Image Segmentation

Introduction to Machine Learning CANB 7640

5 Learning hypothesis classes (16 points)

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Support Vector Machines

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

CSE4334/5334 DATA MINING

Lecture 7: Decision Trees

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning Classifiers and Boosting

What is machine learning?

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

SVM Classification in -Arrays

Unsupervised Learning

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Expectation Maximization (EM) and Gaussian Mixture Models

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Lecture 9: Support Vector Machines

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Machine Learning Techniques for Data Mining

Support Vector Machines

Transcription:

Classification

Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful features based on known class labels that separate classes in training set Assign new objects to classes based on rules developed on the training set

Different Classification methods Statistical methods: often aim to classify as well as to identify marker genes that characterize different classes Linear discriminant analysis Nearest neighbors Logistic regression Classification and regression tree Computer science methods: do not emphasize on parsimony or interpretation Bayesian network Neural network Support vector machine

General notation for classification X G x n

Toy example Space: 2 genes, finite range of expression measure.

Constructing and evaluating classifiers Training data: for constructing the classifiers Cross-validation: often cross-validation is used in training process -Leave one out: asymptotically equivalent to -Leave n ν out - (see Linear Model Selection by Cross-Validation, Shao J 1993 JASA for details) Test data: a separate set of data used to evaluate the performance

Bias-variance tradeoff High Bias Low Variance Low Bias High Variance test error error training error Low Model complexity High

Nearest-neighbors discriminant rule The training set has samples with known classes Define a distance measure Euclidean, 1-correlation, Mahalanobis For each sample in a test set, find k closest neighbors Predict the class by majority vote How to choose k: usually by cross-validation

Fisher s linear discriminant analysis

S pooled =[(N 1-1)S 1 +(N 2-1)S 2 ]/(N 1 +N 2-2) Discriminant rule: Assign x to Class 1 if otherwise to class 2. With microarray data, S is often singular, and generalized inverse of S, denoted by S - is often used

Fisher s linear discriminant analysis More general c>2 maximize between/within sum of squares

The problem is equivalent to Solution: find eigen values for Use the largest eigne vector v to form

Maximum likelihood discriminant rule ML discriminant rule Pr(x y=k) arg max k pr(x y=k) Recall Bayes rule: Sample ML discriminant rule Bayes rule

Maximum likelihood discriminant rule special cases Linear Discriminant analysis Diagonal quadratic discriminant analysis (DQDA): class densities have diagonal covariance matrices Diagonal linear discriminant analysis (DLDA):

Weighted gene voting scheme Variant of sample ML with same diagonal covariance For two-class case, classify a sample with gene expression profile x=(x 1,x 2,,x p ), vote from each gene j is weighted distance Classify to class 1 if i.e., In Golub et al (1999), is used instead of

Logistic discriminant function

Nearest centroid discriminant rule Variant of Bayes rule Ignoring covariance terms and assume same variance matrix for all k, If prior class probabilities are equal to 1/k, the rule assigns x to the class with the closest mean (centroid) Q: filter genes or not? How to filter genes?

nearest shrunken centroid method Prediction Analysis for Microarrays (PAM) Centroid distance classification Regularize by shrinking the centroids gene i (1~G), sample j (1~n, in K classes): S i is pooled within-class standard deviation notice that is the standard error of d = ( x j x) /[ m j( s s0)] j +

Centroid: From overall center, each gene in each class centroid deviates from it Some genes are not associated with the classes Let s keep gene i if its statistic d is large enough (larger than Δ) i.e., d =d- Δ if d> Δ; d =d+ Δ if d<- Δ ; and 0 otherwise Soft thresholding

Soft thresholding/hard thresholding Both shrink the values within threshold to 0 Direct thresholding leaves other values intact Soft thresholding shrinks everything

Centroid: From overall center, each gene in each class centroid deviates from it Some genes are not associated with the classes Let s keep gene i if its statistic d is large enough (larger than Δ) i.e., d =d- Δ if d> Δ; d =d+ Δ if d<- Δ ; and 0 otherwise Shrunken Centroid: Shrunken to the global mean if difference is not significant Lastly: How to choose Δ

Discriminant rule and probability For one test sample

Split data using set of binary decisions Root node (with all data points) has certain impurity, splitting reduces impurity Highest on root, lowest (0) at leaf node Measure of impurities Entropy Gini index impurity Prune the tree to prevent over fit

A separating hyperplane in the feature space may correspond to a non-linear boundary in the input space. The figure shows the classification boundary (solid line) in a two-dimensional input space as well as the accompanying soft margins (dotted lines). Positive and negative examples fall on opposite sides of the decision boundary. The support vectors (circled) are the points lying closest to the decision boundary.

Resources for learning SVM and application in microarrays SVM classification and validation of cancer tissue samples using microarray expression data (T S Furey et al, 2000 Bioinformatics) Support Vector Machine Classification of Microarray Gene Expression Data http://www.cse.ucsc.edu/research/compbio/genex/genextr2html/gene x.html CLASSIFYING MICROARRAY DATA USING SUPPORT VECTOR MACHINES http://cbcl.mit.edu/projects/cbcl/publications/ps/svmmicro.pdf