Statistics 202: Statistical Aspects of Data Mining
|
|
- Toby Little
- 5 years ago
- Views:
Transcription
1 Statistics 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 11 = Chapter 8 Agenda: 1)Reminder about final exam 2)Finish Chapter 5 3)Chapter 8 1
2 Class Project The class project is due on August 15 th at 11:59 PM. If you turn it in early, I will try to grade it within the next 48 hours so you have an idea of whether you should take the final. Please submit your relevance predictions on the test set as well. 2
3 Introduction to Data Mining by Tan, Steinbach, Kumar Chapter 5: Classification: Alternative Techniques
4 Naive Bayes Classifier (Section 5.3.3, page 231) The naive Bayes classifier assumes that the x attributes are conditionally independent given the class attribute y Thus, P(Y X) = P(Y) * P(X Y) / P(X) = P(Y) * P(X 1 Y)*. * P(X d Y) / P(X) Then for any x you choose the class y that gives you the largest numerator You estimate the P(X i Y) values based on the data (see next slide)
5 How to Estimate the P(X i Y) For categorical x s, just use counts (although some people modify this to fix problems with zero or small counts, see page 236) For continuous x s, fit some distribution function. The normal distribution using the observed sample mean and observed sample standard deviation is popular The normal probability density function is given by p(x) = 1 e 2πσ (1/2)[(X μ)/σ] where μ is the mean and σ is the standard deviation 2
6 10 Example of the Naive Bayes Classifier For this data, use naive Bayes to classify an observation with X = ( Refund = No, Married,Income = 120K) Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes
7 10 Example of the Naive Bayes Classifier For this data, use naive Bayes to classify an observation with X = ( Refund = No, Married,Income = 120K) P(refund=yes yes)=0/3 P(refund=no yes)=3/3 P(refund=yes no)=3/7 P(refund=no no)=4/7 P(ms=single yes)=2/3 P(ms=single no)=2/7 P(ms=divorced yes)=1/3 P(ms=divorced no)=1/7 P(ms=married yes)=0/3 P(ms=married no)=4/7 Given yes, Given no, income has mean=90 income has mean=110 and sd=5 and sd=54.4 P(120 yes)= P(120 no)= dnorm(120,90,5)= dnorm(120,110,54.4)= 1.2* Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Evade 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes P(yes X) = 3/10*1/P(X)*3/3*0/3*1.2*10-9 < P(no X) = 7/10*1/P(X) *4/7*4/7*.0072 So we classify this X as NO
8 Introduction to Data Mining by Tan, Steinbach, Kumar Chapter 8: Cluster Analysis 8
9 What is Cluster Analysis? Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both (page 487) It is similar to classification, only now we don t know the answer (we don t have the labels) For this reason, clustering is often called unsupervised learning while classification is often called supervised learning (page 491 but the book says classification instead of learning ) Note that there also exists semi-supervised learning which is a combination of both and is a hot research area right now 9
10 What is Cluster Analysis? Because there is no right answer, your book characterizes clustering as an exercise in descriptive statistics rather than prediction Cluster analysis groups data objects based only on information found in the data that describes the objects and their similarities (page 490) The goal is that objects within a group be similar (or related) to one another and different from (or unrelated to) the objects in other groups (page 490) 10
11 Examples of Clustering (P. 488) Information Retrieval: search engine documents = cluster documents about the same person Climate: Clusters = regions of similar climate Psychology and Medicine: patterns in spatial or temporal distribution of a disease Business: Segment customers into groups for marketing activities 11
12 Two Reasons for Clustering (P. 488) Clustering for Understanding (see examples from previous slide) Clustering for Utility -Summarizing: different algorithms can run faster on a data set summarized by clustering -Compression: storing cluster information is more efficient that storing the entire data - example: quantization -Finding Nearest Neighbors 12
13 How Many Clusters is Tricky/Subjective How many clusters? 13
14 How Many Clusters is Tricky/Subjective How many clusters? Two Clusters 14
15 How Many Clusters is Tricky/Subjective How many clusters? Two Clusters Four Clusters 15
16 How Many Clusters is Tricky/Subjective How many clusters? Six Clusters Two Clusters Four Clusters 16
17 K-Means Clustering K-means clustering is one of the most common/popular techniques Each cluster is associated with a centroid (center point) this is often the mean it is the cluster prototype Each point is assigned to the cluster with the closest centroid The number of clusters, K, must be specified ahead of time 17
18 K-Means Clustering The most common version of k-means minimizes the sum of the squared distances of each point from its cluster center (page 500) K 2 SSE = dist ( c, x) i= 1 x C For a given set of cluster centers, (obviously) each point should be matched to the nearest center i i For a given cluster, the best center is the mean The basic algorithm is to iterate over these two relationships 18
19 K-Means Clustering Algorithms This is Algorithm 8.1 on page 497 of your text Other algorithms also exist In R, the function kmeans() does k means clustering no special package or library is needed 19
20 In class exercise #41: Use kmeans() in R with all the default values to find the k=2 solution for the 2-dimensional data at Plot the data. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. 20
21 In class exercise #41: Use kmeans() in R with all the default values to find the k=2 solution for the 2-dimensional data at Plot the data. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution: x<-read.csv("cluster.csv",header=f) plot(x,pch=19,xlab=expression(x[1]), ylab=expression(x[2])) fit<-kmeans(x, 2) points(fit$centers,pch=19,col="blue",cex=2) 21
22 In class exercise #41: Use kmeans() in R with all the default values to find the k=2 solution for the 2-dimensional data at Plot the data. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): points(x,col=fit$cluster,pch=19) 22
23 In class exercise #41: Use kmeans() in R with all the default values to find the k=2 solution for the 2-dimensional data at Plot the data. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): 23
24 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. 24
25 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution: data<-read.csv("sonar_train.csv",header=false) x<-data[,1:2,] plot(x,pch=19,xlab=expression(x[1]), ylab=expression(x[2])) 25
26 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): fit<-kmeans(x, 2) points(fit$centers,pch=19,col="blue",cex=2) 26
27 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): points(x,col=fit$cluster,pch=19) 27
28 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): 28
29 In class exercise #43: Graphically compare the cluster memberships from the previous problem to the actual labels in the training data. 29
30 In class exercise #43: Graphically compare the cluster memberships from the previous problem to the actual labels in the training data. Solution: plot(x,pch=19,xlab=expression(x[1]), ylab=expression(x[2])) y<-data[,61] points(x,col=2+2*y,pch=19) x x 1
31 In class exercise #44: For the previous exercise compute the misclassification error that would result if you used your clustering rule to classify the data. 31
32 In class exercise #44: For the previous exercise compute the misclassification error that would result if you used your clustering rule to classify the data. Solution: sum(fit$cluster*2-3==y)/length(y) (the *2-3 part just forces both vectors to use the same labels) 32
33 In class exercise #45: Repeat the previous exercise using all 60 columns. 33
34 In class exercise #45: Repeat the previous exercise using all 60 columns. Solution: x<-data[,1:60,] fit<-kmeans(x, 2) sum(fit$cluster*2-3==y)/length(y) 34
35 In class exercise #46: Consider the one-dimensional data set given by x<-c(1,2,3,5,6,7,8) (I left out 4 on purpose). Starting with initial cluster center values of 1 and 2 carry out algorithm 8.1 until convergence by hand for k=2 clusters. 35
36 In class exercise #47: Repeat the previous exercise by writing a loop in R and verify that the final answer is the same. 36
37 In class exercise #47: Repeat the previous exercise by writing a loop in R and verify that the final answer is the same. Solution: x<-c(1,2,3,5,6,7,8) center1<-1 center2<-2 for (k in 2:10){ cluster1<-x[abs(x-center1[k-1])<=abs(x-center2[k- 1])] cluster2<-x[abs(x-center1[k-1])>abs(x-center2[k-1])] center1[k]<-mean(cluster1) center2[k]<-mean(cluster2) } 37
38 In class exercise #48: Verify that the kmeans function in R gives the same solution for the previous exercise when you use all of the default values. 38
39 In class exercise #48: Verify that the kmeans function in R gives the same solution for the previous exercise when you use all of the default values. Solution: kmeans(x,2) 39
40 Measuring Distance Many of the techniques for clustering and classification rely on some notion of distance Section 2.4 in the book discusses different ways of measuring distance (dissimilarity) For numeric variables, the distance you are used to is called Euclidean distance, but other methods exist For categorical variables or mixtures of categorical and numeric variables it is tricky to compute distance Remember, scaling is important if scales differ 40
41 Euclidean Distance (P.69) Euclidean distance is the usual method of computing distance that you are used to In 1 dimension it is the absolute value In 2 dimensions it is the Pythagorean Theorem In more than 2 dimensions it is just a generalization of the Pythagorean Theorem dist = In R, the function dist() computes distances n k= 1 ( p k q k 2 ) 41
42 In class exercise #49: Compute the distance between the points c(2,2) and c(5,7) by hand and verify that the function dist in R gives the same value. 42
43 In class exercise #49: Compute the distance between the points c(2,2) and c(5,7) by hand and verify that the function dist in R gives the same value. Solution: x1<-c(2,2) x2<-c(5,7) data<-matrix(c(x1,x2),nrow=2,byrow=t) dist(data) 43
44 In class exercise #50: Compute the distance between the points c(2,2,3) and c(5,7,10) by hand and verify that the function dist in R gives the same value. 44
45 In class exercise #50: Compute the distance between the points c(2,2,3) and c(5,7,10) by hand and verify that the function dist in R gives the same value. Solution: x1<-c(2,2,3) x2<-c(5,7,10) data<-matrix(c(x1,x2),nrow=2,byrow=t) dist(data) 45
46 Cosine Similarity (P.75) This is a common measure for computing the similarity between two documents by using their word counts for every possible word In practice, some normalization is used to account for different forms of the same word, differing document lengths, and different word frequencies The cosine similarity is defined as cos( d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2 where indicates vector dot product and d is the norm of vector d This can be interpreted geometrically as the cosine of the angle between the vectors 46
47 Cosine Similarity Examples a=(2,5) b=(2,5) Cosine=1 a=(0,1) b=(0,7) Cosine=1 a=(0,1) b=(1,0) Cosine=0 a=(0,1) b=(1,1) Cosine=.71 a=(3,2,0,5,0,0,0,2,0,0) b=(1,0,0,0,0,0,0,1,0,2) Cosine=.31 a=(0,0,0,0,0,1,1,1,1,1) b=(1,1,1,1,1,1,1,1,1,1) Cosine=.71 47
DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines
DATA MINING LECTURE 10B Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines NEAREST NEIGHBOR CLASSIFICATION 10 10 Illustrating Classification Task Tid Attrib1
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationCSE4334/5334 DATA MINING
CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy
More information9 Classification: KNN and SVM
CSE4334/5334 Data Mining 9 Classification: KNN and SVM Chengkai Li Department of Computer Science and Engineering University of Texas at Arlington Fall 2017 (Slides courtesy of Pang-Ning Tan, Michael Steinbach
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationMIS2502: Data Analytics Clustering and Segmentation. Jing Gong
MIS2502: Data Analytics Clustering and Segmentation Jing Gong gong@temple.edu http://community.mis.temple.edu/gong What is Cluster Analysis? Grouping data so that elements in a group will be Similar (or
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationClustering & Dimensionality Reduction. 273A Intro Machine Learning
Clustering & Dimensionality Reduction 273A Intro Machine Learning What is Unsupervised Learning? In supervised learning we were given attributes & targets (e.g. class labels). In unsupervised learning
More informationMachine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves
Machine Learning A 708.064 11W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence I [2 P] a) [1 P] Give an example for a probability distribution P (A, B, C) that disproves
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationIntroduction to Clustering
Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of
More informationStatistics 202: Statistical Aspects of Data Mining
Statistics 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 9 = More of Chapter 5 Agenda: 1) Lecture over more of Chapter 5 1 Introduction to Data Mining by Tan, Steinbach, Kumar Chapter
More informationHomework #4 Programming Assignment Due: 11:59 pm, November 4, 2018
CSCI 567, Fall 18 Haipeng Luo Homework #4 Programming Assignment Due: 11:59 pm, ovember 4, 2018 General instructions Your repository will have now a directory P4/. Please do not change the name of this
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationCIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]
CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas [100 points] Instructions. Please write up your responses to the following problems clearly and concisely.
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationChapter 5: Outlier Detection
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.
More informationClassification. Instructor: Wei Ding
Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute
More informationIntroduction to Computer Science
DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationData Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes
0 Data Mining: Data What is Data? Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Collection of data objects and their attributes An attribute is a property or characteristic
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
10 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Data? Collection of data objects
More informationMachine Learning - Clustering. CS102 Fall 2017
Machine Learning - Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for
More informationCLUSTERING. JELENA JOVANOVIĆ Web:
CLUSTERING JELENA JOVANOVIĆ Email: jeljov@gmail.com Web: http://jelenajovanovic.net OUTLINE What is clustering? Application domains K-Means clustering Understanding it through an example The K-Means algorithm
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationPart I. Instructor: Wei Ding
Classification Part I Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition Given a collection of records (training set ) Each record contains a set
More informationLecture-17: Clustering with K-Means (Contd: DT + Random Forest)
Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More information10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors
Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple
More information数据挖掘 Introduction to Data Mining
数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 S8700113C 1 Introduction Last week: Association Analysis
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,
More informationClassification and Regression
Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining
More informationLecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017
Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More informationLecture Notes for Chapter 5
Classifcation - Alternative Techniques Lecture Notes for Chapter 5 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Rule-Based Classifier
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationIBL and clustering. Relationship of IBL with CBR
IBL and clustering Distance based methods IBL and knn Clustering Distance based and hierarchical Probability-based Expectation Maximization (EM) Relationship of IBL with CBR + uses previously processed
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationClustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford
Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationCSE 494/598 Lecture-11: Clustering & Classification
CSE 494/598 Lecture-11: Clustering & Classification LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **With permission, content adapted from last year s slides and from Intro to DM dmbook@cs.umn.edu
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationCluster Analysis: Basic Concepts and Algorithms
7 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should
More informationSupervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationUnsupervised Learning I: K-Means Clustering
Unsupervised Learning I: K-Means Clustering Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552 (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)
More informationClustering. Supervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationK-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.
K-means clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 K-means Outline K-means, K-medoids Choosing the number of clusters: Gap test, silhouette plot. Mixture
More information9.1. K-means Clustering
424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific
More informationMixture Models and the EM Algorithm
Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationCS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University
CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM Mingon Kang, PhD Computer Science, Kennesaw State University KNN K-Nearest Neighbors (KNN) Simple, but very powerful classification algorithm Classifies
More informationLecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, nd Edition by Tan, Steinbach, Karpatne, Kumar What is Cluster Analysis? Finding groups
More informationData Mining Course Overview
Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical
More informationUnsupervised Learning
Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover
More informationComputer Vision. Exercise Session 10 Image Categorization
Computer Vision Exercise Session 10 Image Categorization Object Categorization Task Description Given a small number of training images of a category, recognize a-priori unknown instances of that category
More informationData Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC
Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC Clustering Idea Given a set of data can we find a natural grouping? Essential R commands: D =rnorm(12,0,1)
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More information2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.
Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss
More informationNote Set 4: Finite Mixture Models and the EM Algorithm
Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for
More informationClassification: Basic Concepts, Decision Trees, and Model Evaluation
Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Warehousing and Mining Lecture 4 by Hossen Asiful Mustafa Classification: Definition Given a collection of records (training set
More informationK-Means Clustering. Sargur Srihari
K-Means Clustering Sargur srihari@cedar.buffalo.edu 1 Topics in Mixture Models and EM Mixture models K-means Clustering Mixtures of Gaussians Maximum Likelihood EM for Gaussian mistures EM Algorithm Gaussian
More informationRedefining and Enhancing K-means Algorithm
Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationLecture Notes for Chapter 5
Classification - Alternative Techniques Lecture tes for Chapter 5 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Rule-Based Classifier
More informationCase-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley 1 1 Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationCOMP90049 Knowledge Technologies
COMP90049 Knowledge Technologies Data Mining (Lecture Set 3) 2017 Rao Kotagiri Department of Computing and Information Systems The Melbourne School of Engineering Some of slides are derived from Prof Vipin
More informationCluster Analysis: Basic Concepts and Algorithms
Cluster Analysis: Basic Concepts and Algorithms Data Warehousing and Mining Lecture 10 by Hossen Asiful Mustafa What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationMachine Learning / Jan 27, 2010
Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,
More informationFigure (5) Kohonen Self-Organized Map
2- KOHONEN SELF-ORGANIZING MAPS (SOM) - The self-organizing neural networks assume a topological structure among the cluster units. - There are m cluster units, arranged in a one- or two-dimensional array;
More informationClustering. Shishir K. Shah
Clustering Shishir K. Shah Acknowledgement: Notes by Profs. M. Pollefeys, R. Jin, B. Liu, Y. Ukrainitz, B. Sarel, D. Forsyth, M. Shah, K. Grauman, and S. K. Shah Clustering l Clustering is a technique
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationClassification Algorithms in Data Mining
August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms
More informationLecture 7: Decision Trees
Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationUnsupervised Learning: Clustering
Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationMachine Learning Classifiers and Boosting
Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve
More informationCOMP 551 Applied Machine Learning Lecture 13: Unsupervised learning
COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551
More informationToday. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time
Today Lecture 4: We examine clustering in a little more detail; we went over it a somewhat quickly last time The CAD data will return and give us an opportunity to work with curves (!) We then examine
More informationIntroduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering
Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical
More informationColorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.
Professor William Hoff Dept of Electrical Engineering &Computer Science http://inside.mines.edu/~whoff/ 1 Image Segmentation Some material for these slides comes from https://www.csd.uwo.ca/courses/cs4487a/
More informationIntelligent Image and Graphics Processing
Intelligent Image and Graphics Processing 智能图像图形处理图形处理 布树辉 bushuhui@nwpu.edu.cn http://www.adv-ci.com Clustering Clustering Attach label to each observation or data points in a set You can say this unsupervised
More informationR (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.
Assignment No. 4 Title: SD Module- Data Science with R Program R (2) C (4) V (2) T (2) Total (10) Dated Sign Data analysis case study using R for readily available data set using any one machine learning
More informationGenerative and discriminative classification techniques
Generative and discriminative classification techniques Machine Learning and Category Representation 2014-2015 Jakob Verbeek, November 28, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationTan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter Introduction to Data Mining b Tan, Steinbach, Kumar What is Cluster Analsis? Finding groups of objects such that the
More informationCS 584 Data Mining. Classification 3
CS 584 Data Mining Classification 3 Today Model evaluation & related concepts Additional classifiers Naïve Bayes classifier Support Vector Machine Ensemble methods 2 Model Evaluation Metrics for Performance
More information% Now, do your feature extraction here and store the features in some matrix / array
MIR Course 2010 Page 1 Lab 3 Monday, June 22, 2009 4:50 PM PURPOSE Sometimes, an unsupervised learning technique is preferred. Perhaps you do not have access to adequate training data. Or perhaps the classifications
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationRoad map. Basic concepts
Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?
More information