Oliver Dürr. Statistisches Data Mining (StDM) Woche 5. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Statistisches Data Mining (StDM) Woche 5 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften oliver.duerr@zhaw.ch Winterthur, 17 Oktober 2017 1

Multitasking senkt Lerneffizienz: Keine Laptops im Theorie-Unterricht Deckel zu oder fast zu (Sleep modus)

Overview of the semester Part I (Unsupervised Learning) Dimension Reduction PCA Similarities, Distance between objects Euclidian, L-Norms, Gower, Visualizing Similarities (in 2D) MDS, t-sne Clustering K-Means Hierarchical Clustering Part II (Supervised Learning)

Clustering x 2 10.3 Clustering Methods in ILSR x 1

Now again in line with ILSR Inline again with ILSR See section 10.3 Clustering Methods in ILSR Aims: PCA (and other dimension reduction methods) look to find a low-dimensional representation of the observations Clustering looks to find homogeneous subgroups among the observations. Examples of applications Personalized medicine Segment into subgroups needing different medication Market segmentation

Descriptive and unsupervised: Cluster Analysis Cluster analysis or clustering is the task of assigning a set of objects into groups. x 2 Objects in the same cluster should be more similar to each other than to those in other clusters. x 1 To perform clustering one must define a measure of similarity or distance based on the observed values describing different properties of the objects. e. g. euclidean : dist( o, o ) = ( x x ) p k l ki li i= 1 2 6

Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters 7

Types of Clustering Ø Important distinction between hierarchical and partitional sets of clusters Ø Partitional Clustering A division data objects into nonoverlapping subsets (clusters) such that each data object is in exactly one subset Ø Hierarchical clustering A set of nested clusters organized as a hierarchical tree 8

Partitional Clustering (Recap)

What is optimized in K-means Clustering? The goal in k-means is to partition the observations into K clusters such that the total within-cluster variation (WCV), summed over all K clusters C k, is as small as possible. WCV is often based on Euclidian distances Squared Euclidian distance between data points i and i where C k denotes the number of observations in the kth cluster and p is the number of variables (dimensions). 10

Partitioning Clustering: K-means + Given a number of objects and an initial (randomly choosen) set of cluster centers, assign each object to the closest cluster center

Partitioning Clustering: K-means + Update the coordinates of each cluster center to the average coordinate of the objects associated with it

Partitioning Clustering: K-means + Re-assign each gene to the closest cluster centre

Partitioning Clustering: K-means + Recalculate the co-ordinates of each cluster centre according to the average co-ordinate of the genes associated with it

Partitioning Clustering: K-means + Repeat these steps until no reassignment is possible (or maximal number of iterations have been reached).

Hierarchical Clustering

Types of Clustering Ø Important distinction between hierarchical and partitional sets of clusters Ø Partitional Clustering A division data objects into nonoverlapping subsets (clusters) such that each data object is in exactly one subset. K-Means clustering needs K! Ø Hierarchical clustering A set of nested clusters organized as a hierarchical tree Constructs a complete tree of dependencies. No need specify K in advance 17

Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Hierarchical Clustering Dendrogram From bottom to top agglomerativ ç The usual way, done here From top to bottom divisive

How to do hierarchical Clustering? Without proof: The number of dendrograms with n leafs: = (2n -3)!/[(2 (n -2) ) (n -2)!] Number Number of Possible of Leafs Dendrograms 2 1 3 3 4 15 5 105... 10 34,459,425 Since we cannot test all possible dendrograms we will have to heuristic search of all possible dendrograms. We could do this.. Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides. 19

Dissimilarity between samples or observations Any dissimilarity we have seen before can be used - euclidean - manhattan - simple matching coefficent - Jaccard dissimilarity - Gower s dissimilarity - etc. 20

Aglomatrative Hierarchical Clustering Feature 2 gene Feature 1 dendrogram Problem: Need a generalization of the distance between the objects to compound of objects. Feature 2 What is the distance between those objects. -> Linkage Feature 1

Dissimilarity between clusters: Linkages Single link: smallest distance between point-pairs linking both clusters single link (min) Complete link: largest distance Average: avg distance between complete link (max) Wards: In this method, we try to minimize the variance of the merged clusters average Wards 22

How to read a dendrogram The position of the join node on the distance-scale indicates the distance between clusters (this distance depends on the linkage method). For example, if you see two clusters merged at a height 22, it means that the distance between those clusters was 22. Distance (R: clust$height) When you read a dendrogram, you want to determine at what stage the distance between clusters that are combined is large. You look for large distances between sequential join nodes (here vertical lines). 23

Simple example Zeichnen sie das Dendrogram für die eindimensionale Datenmatrix Verwenden Sie dazu die Euklidische Distanzen und die single-linkage 24

Simple example x = c(1,3,6,6.5) names(x) = c('1','3','6', '6.5') d = dist(x) cluster = hclust(d, method = 'single') plot(cluster, hang=-10, axes = FALSE) axis(2) 25

Compare linkage methods Single-Linkage produce long and skinny clusters. Wards produce of ten very separated clusters Average linkage yield more round clusters Generally clustering is an exploratory tool. Use the linkage which produces the best results. 7 25 6 20 5 15 4 10 3 2 5 1 29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7 Average linkage Single linkage 0 5 14 23 7 4 12 19 21 24 15 16 18 1 3 8 9 29 2 10 11 20 28 17 26 27 25 6 13 22 30 Wards linkage 26

Cluster result depend on data structure, distances and linkage methods Data: we simulated 2 2D-Gaussian Clusters with very different sizes Single linkage complete linkage average linkage Ward likes to produce clusters of equal sizes 27

Heatmaps [see code]

What is a classification task? Classification is a prediction method Dromedaries Idea: Train a classifier based on training data (examples with known class labels) and use the classifier to classify new test observations with unknown class label.??? Camels Which feature should we use to describe an observation (animal)? 29

Feature extraction ID of animal Class label Number of legs Number of bumps Length of legs [cm] 1 Dromedar 4 1 98 2 Kamel 3 2 87 150 Kamel 4 2 103 Defining appropriate features is essential for the success of the classification task! It is not always as simple as it is in this example: Features can be combined to new features or selected. 30

Data Matrix Class label Dromedar Kamel Kamel Y Labels are categorical Labels continuous è regression Number of legs Number of bumps Length of legs [cm] 4 1 98 3 2 87 4 2 103 X Data Matrix with several features (can also contain categorical values). One row called feature vector In classification aka supervised learning we try to predict the class labels using the features.

Principal Idea Classification Training Data Klassifikatoren Neuronale Netze Entscheidungsbäum id Type Sepal.L ength Sepal.Wi dth Petal.Len gth Petal.Width 1 setosa 5.1 3.5 1.4 0.2 2 setosa 4.9 3 1.4 0.2 3 virinica 3.3 3.2 1.6 0.5 4 setosa 5.1 3.5 1.4 0.2 Learn a classifier Classifier 150 virinica 4.9 3 1.4 0.2 Unknown data / Test data Predict d Type Sepal.L ength Petal.Lengt Sepal.Width h Petal.Width 1? 3.1 3.5 1.4 0.2 2? 4.9 3 1.4 0.2 3? 3.3 3.2 1.6 0.5 Classifier Type 4? 5.1 3.5 31.4 0.2 Note: To evaluate the performance a part of the labelled data not used to train the classifier but left aside to check the performance of the classifier to new data.

Examples of Classification Task Is a given text e.g. tweet about a product positive, negative or neutral. Sentiment Analysis The movie XXX actually neither that funny, nor super witty à Negative Churn in Marketing: Predict which customer wants to quit and offer them a discount Face detection. Image (array of pixels) è John 33

K-Nearest-Neighbors in a nutshell Idea of knn classification: - Start with an observation x 0 with unknown class label - Find the k training observations, that have the smallest distance to x 0 - Use the majority class among the k neighbors as class label for x 0 x 0 R functions to know - From package class : knn 34

knn-classificator with 3 class-labels after training with k=1 data with true class label Trained classificator (knn with k=1) x2 x2 x1 x1 35

The effect of K

The effect of K Which k to use? Let s quantify the error / accuracy.

Accuracy as performance measure Evaluate prediction accuracy on data Confusion Matrix: ACTUAL CLASS Class=Yes Class=No PREDICTED CLASS Class=Yes a (TP) c (FP) Class=No b (FN) d (TN) For an ideal classifier the offdiagonal entries should be zero: c=0, b=0, or Accuracy=1 a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) Source: Tan, Steinbach, Kumar a+ d TP + TN Accuracy = = a+ b+ c + d TP + TN + FP + Simply count the # correct / all FN

Types of Errors Training error or in-sample-error: Error on data that have been used to train the model Test error or out-of-sample-error Error on previously unseen records (out of sample) Overfitting phenomenon: Model fits the training data well (small training error) but shows high generalization error 39

Perfect Vs. Simple classifier Simple classifier x 2 0 1 2 3 4 5 6 Which is better? 0 1 2 3 4 5 6 x 1 Perfect classifier Check on a test-set (don t use all you labeled data to train) 40

Cross validation of the simple classifier Training set: 6/29=20% misclassification Train Test Test set: 2/25=8% misclassification x 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x 1 41

Cross validation of the Perfect classifier Training set: 0%misclassification Train Test Test set: 8/25=24% misclassification x 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x 1 42

Cross validation of the Perfect classifier Train Test x 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x 1 43

Which one to use?

What is the right level of complexity 1/k 45

What is the right level of complexity Example from ILSR 46

Occam s razor A more complex model performs always better on training data than a simpler model. Models should be evaluated on test data to determine the generalization error. If comparing performance on training data the model complexity should be taken into account (penalize for complexity). Given two models of similar generalization errors, one should prefer the simpler model over the more complex model Make everything as simple as possible, but not simpler. A. Einstein 47