Supervised vs unsupervised clustering

Size: px

Start display at page:

Download "Supervised vs unsupervised clustering"

Harold Gaines
6 years ago
Views:

1 Classification

2 Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful features based on known class labels that separate classes in training set Assign new objects to classes based on rules developed on the training set

3 Different Classification methods Statistical methods: often aim to classify as well as to identify marker genes that characterize different classes Linear discriminant analysis Nearest neighbors Logistic regression Classification and regression tree Computer science methods: do not emphasize on parsimony or interpretation Bayesian network Neural network Support vector machine

4 General notation for classification X G x n

5 Toy example Space: 2 genes, finite range of expression measure.

6 Constructing and evaluating classifiers Training data: for constructing the classifiers Cross-validation: often cross-validation is used in training process -Leave one out: asymptotically equivalent to -Leave n ν out - (see Linear Model Selection by Cross-Validation, Shao J 1993 JASA for details) Test data: a separate set of data used to evaluate the performance

11 Bias-variance tradeoff High Bias Low Variance Low Bias High Variance test error error training error Low Model complexity High

14 Nearest-neighbors discriminant rule The training set has samples with known classes Define a distance measure Euclidean, 1-correlation, Mahalanobis For each sample in a test set, find k closest neighbors Predict the class by majority vote How to choose k: usually by cross-validation

15 Fisher s linear discriminant analysis

16 S pooled =[(N 1-1)S 1 +(N 2-1)S 2 ]/(N 1 +N 2-2) Discriminant rule: Assign x to Class 1 if otherwise to class 2. With microarray data, S is often singular, and generalized inverse of S, denoted by S - is often used

18 Fisher s linear discriminant analysis More general c>2 maximize between/within sum of squares

19 The problem is equivalent to Solution: find eigen values for Use the largest eigne vector v to form

20 Maximum likelihood discriminant rule ML discriminant rule Pr(x y=k) arg max k pr(x y=k) Recall Bayes rule: Sample ML discriminant rule Bayes rule

21 Maximum likelihood discriminant rule special cases Linear Discriminant analysis Diagonal quadratic discriminant analysis (DQDA): class densities have diagonal covariance matrices Diagonal linear discriminant analysis (DLDA):

22 Weighted gene voting scheme Variant of sample ML with same diagonal covariance For two-class case, classify a sample with gene expression profile x=(x 1,x 2,,x p ), vote from each gene j is weighted distance Classify to class 1 if i.e., In Golub et al (1999), is used instead of

23 Logistic discriminant function

24 Nearest centroid discriminant rule Variant of Bayes rule Ignoring covariance terms and assume same variance matrix for all k, If prior class probabilities are equal to 1/k, the rule assigns x to the class with the closest mean (centroid) Q: filter genes or not? How to filter genes?

25 nearest shrunken centroid method Prediction Analysis for Microarrays (PAM) Centroid distance classification Regularize by shrinking the centroids gene i (1~G), sample j (1~n, in K classes): S i is pooled within-class standard deviation notice that is the standard error of d = ( x j x) /[ m j( s s0)] j +

26 Centroid: From overall center, each gene in each class centroid deviates from it Some genes are not associated with the classes Let s keep gene i if its statistic d is large enough (larger than Δ) i.e., d =d- Δ if d> Δ; d =d+ Δ if d<- Δ ; and 0 otherwise Soft thresholding

27 Soft thresholding/hard thresholding Both shrink the values within threshold to 0 Direct thresholding leaves other values intact Soft thresholding shrinks everything

28 Centroid: From overall center, each gene in each class centroid deviates from it Some genes are not associated with the classes Let s keep gene i if its statistic d is large enough (larger than Δ) i.e., d =d- Δ if d> Δ; d =d+ Δ if d<- Δ ; and 0 otherwise Shrunken Centroid: Shrunken to the global mean if difference is not significant Lastly: How to choose Δ

31 Discriminant rule and probability For one test sample

35 Split data using set of binary decisions Root node (with all data points) has certain impurity, splitting reduces impurity Highest on root, lowest (0) at leaf node Measure of impurities Entropy Gini index impurity Prune the tree to prevent over fit

38 A separating hyperplane in the feature space may correspond to a non-linear boundary in the input space. The figure shows the classification boundary (solid line) in a two-dimensional input space as well as the accompanying soft margins (dotted lines). Positive and negative examples fall on opposite sides of the decision boundary. The support vectors (circled) are the points lying closest to the decision boundary.

40 Resources for learning SVM and application in microarrays SVM classification and validation of cancer tissue samples using microarray expression data (T S Furey et al, 2000 Bioinformatics) Support Vector Machine Classification of Microarray Gene Expression Data x.html CLASSIFYING MICROARRAY DATA USING SUPPORT VECTOR MACHINES

Classification with PAM and Random Forest

5/7/2007 Classification with PAM and Random Forest Markus Ruschhaupt Practical Microarray Analysis 2007 - Regensburg Two roads to classification Given: patient profiles already diagnosed by an expert.