Oliver Dürr. Statistisches Data Mining (StDM) Woche 5. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Similar documents
Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

CSE 158. Web Mining and Recommender Systems. Midterm recap

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Hierarchical Clustering

INF 4300 Classification III Anne Solberg The agenda today:

Clustering CS 550: Machine Learning

What to come. There will be a few more topics we will cover on supervised learning

Introduction to Artificial Intelligence

Unsupervised Learning

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis: Agglomerate Hierarchical Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised Learning : Clustering

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Chapter 6: Cluster Analysis

Cluster Analysis. Ying Shen, SSE, Tongji University

Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Introduction to Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Dimension reduction : PCA and Clustering

Artificial Intelligence. Programming Styles

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Classification and Regression

Classification Part 4

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Lecture 25: Review I

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Exploratory data analysis for microarrays

Cluster analysis. Agnieszka Nowak - Brzezinska

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

CS7267 MACHINE LEARNING

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Data Clustering. Danushka Bollegala

Machine Learning (BSMC-GA 4439) Wenke Liu

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Cluster Analysis for Microarray Data

Based on Raymond J. Mooney s slides

Classification. Instructor: Wei Ding

Gene Clustering & Classification

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering and Visualisation of Data

Unsupervised Learning and Clustering

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Hierarchical clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Network Traffic Measurements and Analysis

Unsupervised Learning and Clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering and The Expectation-Maximization Algorithm

Unsupervised Learning

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

CHAPTER 4: CLUSTER ANALYSIS

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Data Informatics. Seon Ho Kim, Ph.D.

Data Mining Concepts & Techniques

Machine Learning (BSMC-GA 4439) Wenke Liu

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Unsupervised learning, Clustering CS434

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Chapter DM:II. II. Cluster Analysis

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Unsupervised Learning: Clustering

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

10701 Machine Learning. Clustering

Lecture 15 Clustering. Oct

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Introduction to Machine Learning. Xiaojin Zhu

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Hierarchical Clustering 4/5/17

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Chapter 9. Classification and Clustering

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

CSE 5243 INTRO. TO DATA MINING

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

CS145: INTRODUCTION TO DATA MINING

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

COMS 4771 Clustering. Nakul Verma

Information Retrieval and Web Search Engines

Statistical Methods for Data Mining

Finding Clusters 1 / 60

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Transcription:

Statistisches Data Mining (StDM) Woche 5 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften oliver.duerr@zhaw.ch Winterthur, 17 Oktober 2017 1

Multitasking senkt Lerneffizienz: Keine Laptops im Theorie-Unterricht Deckel zu oder fast zu (Sleep modus)

Overview of the semester Part I (Unsupervised Learning) Dimension Reduction PCA Similarities, Distance between objects Euclidian, L-Norms, Gower, Visualizing Similarities (in 2D) MDS, t-sne Clustering K-Means Hierarchical Clustering Part II (Supervised Learning)

Clustering x 2 10.3 Clustering Methods in ILSR x 1

Now again in line with ILSR Inline again with ILSR See section 10.3 Clustering Methods in ILSR Aims: PCA (and other dimension reduction methods) look to find a low-dimensional representation of the observations Clustering looks to find homogeneous subgroups among the observations. Examples of applications Personalized medicine Segment into subgroups needing different medication Market segmentation

Descriptive and unsupervised: Cluster Analysis Cluster analysis or clustering is the task of assigning a set of objects into groups. x 2 Objects in the same cluster should be more similar to each other than to those in other clusters. x 1 To perform clustering one must define a measure of similarity or distance based on the observed values describing different properties of the objects. e. g. euclidean : dist( o, o ) = ( x x ) p k l ki li i= 1 2 6

Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters 7

Types of Clustering Ø Important distinction between hierarchical and partitional sets of clusters Ø Partitional Clustering A division data objects into nonoverlapping subsets (clusters) such that each data object is in exactly one subset Ø Hierarchical clustering A set of nested clusters organized as a hierarchical tree 8

Partitional Clustering (Recap)

What is optimized in K-means Clustering? The goal in k-means is to partition the observations into K clusters such that the total within-cluster variation (WCV), summed over all K clusters C k, is as small as possible. WCV is often based on Euclidian distances Squared Euclidian distance between data points i and i where C k denotes the number of observations in the kth cluster and p is the number of variables (dimensions). 10

Partitioning Clustering: K-means + Given a number of objects and an initial (randomly choosen) set of cluster centers, assign each object to the closest cluster center

Partitioning Clustering: K-means + Update the coordinates of each cluster center to the average coordinate of the objects associated with it

Partitioning Clustering: K-means + Re-assign each gene to the closest cluster centre

Partitioning Clustering: K-means + Recalculate the co-ordinates of each cluster centre according to the average co-ordinate of the genes associated with it

Partitioning Clustering: K-means + Repeat these steps until no reassignment is possible (or maximal number of iterations have been reached).

Hierarchical Clustering

Types of Clustering Ø Important distinction between hierarchical and partitional sets of clusters Ø Partitional Clustering A division data objects into nonoverlapping subsets (clusters) such that each data object is in exactly one subset. K-Means clustering needs K! Ø Hierarchical clustering A set of nested clusters organized as a hierarchical tree Constructs a complete tree of dependencies. No need specify K in advance 17

Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Hierarchical Clustering Dendrogram From bottom to top agglomerativ ç The usual way, done here From top to bottom divisive

How to do hierarchical Clustering? Without proof: The number of dendrograms with n leafs: = (2n -3)!/[(2 (n -2) ) (n -2)!] Number Number of Possible of Leafs Dendrograms 2 1 3 3 4 15 5 105... 10 34,459,425 Since we cannot test all possible dendrograms we will have to heuristic search of all possible dendrograms. We could do this.. Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides. 19

Dissimilarity between samples or observations Any dissimilarity we have seen before can be used - euclidean - manhattan - simple matching coefficent - Jaccard dissimilarity - Gower s dissimilarity - etc. 20

Aglomatrative Hierarchical Clustering Feature 2 gene Feature 1 dendrogram Problem: Need a generalization of the distance between the objects to compound of objects. Feature 2 What is the distance between those objects. -> Linkage Feature 1

Dissimilarity between clusters: Linkages Single link: smallest distance between point-pairs linking both clusters single link (min) Complete link: largest distance Average: avg distance between complete link (max) Wards: In this method, we try to minimize the variance of the merged clusters average Wards 22

How to read a dendrogram The position of the join node on the distance-scale indicates the distance between clusters (this distance depends on the linkage method). For example, if you see two clusters merged at a height 22, it means that the distance between those clusters was 22. Distance (R: clust$height) When you read a dendrogram, you want to determine at what stage the distance between clusters that are combined is large. You look for large distances between sequential join nodes (here vertical lines). 23

Simple example Zeichnen sie das Dendrogram für die eindimensionale Datenmatrix Verwenden Sie dazu die Euklidische Distanzen und die single-linkage 24

Simple example x = c(1,3,6,6.5) names(x) = c('1','3','6', '6.5') d = dist(x) cluster = hclust(d, method = 'single') plot(cluster, hang=-10, axes = FALSE) axis(2) 25

Compare linkage methods Single-Linkage produce long and skinny clusters. Wards produce of ten very separated clusters Average linkage yield more round clusters Generally clustering is an exploratory tool. Use the linkage which produces the best results. 7 25 6 20 5 15 4 10 3 2 5 1 29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7 Average linkage Single linkage 0 5 14 23 7 4 12 19 21 24 15 16 18 1 3 8 9 29 2 10 11 20 28 17 26 27 25 6 13 22 30 Wards linkage 26

Cluster result depend on data structure, distances and linkage methods Data: we simulated 2 2D-Gaussian Clusters with very different sizes Single linkage complete linkage average linkage Ward likes to produce clusters of equal sizes 27

Heatmaps [see code]

What is a classification task? Classification is a prediction method Dromedaries Idea: Train a classifier based on training data (examples with known class labels) and use the classifier to classify new test observations with unknown class label.??? Camels Which feature should we use to describe an observation (animal)? 29

Feature extraction ID of animal Class label Number of legs Number of bumps Length of legs [cm] 1 Dromedar 4 1 98 2 Kamel 3 2 87 150 Kamel 4 2 103 Defining appropriate features is essential for the success of the classification task! It is not always as simple as it is in this example: Features can be combined to new features or selected. 30

Data Matrix Class label Dromedar Kamel Kamel Y Labels are categorical Labels continuous è regression Number of legs Number of bumps Length of legs [cm] 4 1 98 3 2 87 4 2 103 X Data Matrix with several features (can also contain categorical values). One row called feature vector In classification aka supervised learning we try to predict the class labels using the features.

Principal Idea Classification Training Data Klassifikatoren Neuronale Netze Entscheidungsbäum id Type Sepal.L ength Sepal.Wi dth Petal.Len gth Petal.Width 1 setosa 5.1 3.5 1.4 0.2 2 setosa 4.9 3 1.4 0.2 3 virinica 3.3 3.2 1.6 0.5 4 setosa 5.1 3.5 1.4 0.2 Learn a classifier Classifier 150 virinica 4.9 3 1.4 0.2 Unknown data / Test data Predict d Type Sepal.L ength Petal.Lengt Sepal.Width h Petal.Width 1? 3.1 3.5 1.4 0.2 2? 4.9 3 1.4 0.2 3? 3.3 3.2 1.6 0.5 Classifier Type 4? 5.1 3.5 31.4 0.2 Note: To evaluate the performance a part of the labelled data not used to train the classifier but left aside to check the performance of the classifier to new data.

Examples of Classification Task Is a given text e.g. tweet about a product positive, negative or neutral. Sentiment Analysis The movie XXX actually neither that funny, nor super witty à Negative Churn in Marketing: Predict which customer wants to quit and offer them a discount Face detection. Image (array of pixels) è John 33

K-Nearest-Neighbors in a nutshell Idea of knn classification: - Start with an observation x 0 with unknown class label - Find the k training observations, that have the smallest distance to x 0 - Use the majority class among the k neighbors as class label for x 0 x 0 R functions to know - From package class : knn 34

knn-classificator with 3 class-labels after training with k=1 data with true class label Trained classificator (knn with k=1) x2 x2 x1 x1 35

The effect of K

The effect of K Which k to use? Let s quantify the error / accuracy.

Accuracy as performance measure Evaluate prediction accuracy on data Confusion Matrix: ACTUAL CLASS Class=Yes Class=No PREDICTED CLASS Class=Yes a (TP) c (FP) Class=No b (FN) d (TN) For an ideal classifier the offdiagonal entries should be zero: c=0, b=0, or Accuracy=1 a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) Source: Tan, Steinbach, Kumar a+ d TP + TN Accuracy = = a+ b+ c + d TP + TN + FP + Simply count the # correct / all FN

Types of Errors Training error or in-sample-error: Error on data that have been used to train the model Test error or out-of-sample-error Error on previously unseen records (out of sample) Overfitting phenomenon: Model fits the training data well (small training error) but shows high generalization error 39

Perfect Vs. Simple classifier Simple classifier x 2 0 1 2 3 4 5 6 Which is better? 0 1 2 3 4 5 6 x 1 Perfect classifier Check on a test-set (don t use all you labeled data to train) 40

Cross validation of the simple classifier Training set: 6/29=20% misclassification Train Test Test set: 2/25=8% misclassification x 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x 1 41

Cross validation of the Perfect classifier Training set: 0%misclassification Train Test Test set: 8/25=24% misclassification x 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x 1 42

Cross validation of the Perfect classifier Train Test x 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 x 1 43

Which one to use?

What is the right level of complexity 1/k 45

What is the right level of complexity Example from ILSR 46

Occam s razor A more complex model performs always better on training data than a simpler model. Models should be evaluated on test data to determine the generalization error. If comparing performance on training data the model complexity should be taken into account (penalize for complexity). Given two models of similar generalization errors, one should prefer the simpler model over the more complex model Make everything as simple as possible, but not simpler. A. Einstein 47