MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Similar documents
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Machine Learning - Clustering. CS102 Fall 2017

University of Florida CISE department Gator Engineering. Clustering Part 2

Cluster Analysis: Basic Concepts and Algorithms

Clustering Basic Concepts and Algorithms 1

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning : Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Introduction to Clustering

数据挖掘 Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Lecture Notes for Chapter 8. Introduction to Data Mining

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Cluster analysis. Agnieszka Nowak - Brzezinska

Statistics 202: Statistical Aspects of Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Warehousing and Machine Learning

Clustering fundamentals

Unsupervised Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering CS 550: Machine Learning

Clustering Part 2. A Partitional Clustering

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Data Preprocessing. Slides by: Shree Jaswal

Clustering: Overview and K-means algorithm

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Unsupervised Learning I: K-Means Clustering

Data Mining. Lecture 03: Nearest Neighbor Learning

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Gene Clustering & Classification

CS7267 MACHINE LEARNING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Cluster Analysis: Basic Concepts and Algorithms

Clustering: Overview and K-means algorithm

CISC 4631 Data Mining

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

CSE 5243 INTRO. TO DATA MINING

Exploratory Analysis: Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Application of K-Means Clustering Methodology to Cost Estimation

CHAPTER 4: CLUSTER ANALYSIS

Jarek Szlichta

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

ECLT 5810 Data Preprocessing. Prof. Wai Lam

University of Florida CISE department Gator Engineering. Clustering Part 5

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

CS570: Introduction to Data Mining

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Clustering Part 4 DBSCAN

CS570: Introduction to Data Mining

Data Mining: Models and Methods

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Supervised vs. Unsupervised Learning

Cluster Analysis. CSE634 Data Mining

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Clustering algorithms and autoencoders for anomaly detection

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Introduction to Computer Science

CS570: Introduction to Data Mining

Unsupervised Learning

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

Online Social Networks and Media. Community detection

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Introduction to Data Mining

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 5243 INTRO. TO DATA MINING

Preprocessing DWML, /33

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

statistical mapping outline

What to come. There will be a few more topics we will cover on supervised learning

ECLT 5810 Clustering

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Clustering Lecture 3: Hierarchical Methods

DBSCAN. Presented by: Garrett Poppe

What s New in Spotfire DXP 1.1. Spotfire Product Management January 2007

Information Retrieval and Web Search Engines

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Understanding Clustering Supervising the unsupervised

DS504/CS586: Big Data Analytics Big Data Clustering II

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Data mining, 4 cu Lecture 6:

Descriptive Statistics

Transcription:

MIS2502: Data Analytics Clustering and Segmentation Jing Gong gong@temple.edu http://community.mis.temple.edu/gong

What is Cluster Analysis? Grouping data so that elements in a group will be Similar (or related) to one another Different (or unrelated) from elements in other groups Distance within clusters is minimized Distance between clusters is maximized http://www.baseball.bornbybits.com/blog/uploaded_images/ Takashi_Saito-703616.gif

Applications Understanding data Group related documents for browsing Create groups of similar customers Discover which stocks have similar price fluctuations Summarizing data Reduce the size of large data sets Data in similar groups can be combined into a single data point

Even more examples Marketing Discover distinct customer groups for targeted promotions Insurance Finding good customers (low claim costs, reliable premium payments) Healthcare Find patients with high-risk behaviors

What cluster analysis is NOT Manual ( supervised ) classification People simply place items into categories Simple categorization by attributes Dividing students into groups by last name The clusters must come from the data, not from external specifications. Creating the buckets beforehand is categorization, but not clustering.

(Partitional) Clustering Three distinct groups emerge, but some curveballs behave more like splitters. some splitters look more like fastballs.

Clusters can be ambiguous How many clusters? 6 2 4 The difference is the threshold you set. How distinct must a cluster be to be it s own cluster? adapted from Tan, Steinbach, and Kumar. Introduction to Data Mining (2004)

K-means (partitional) Choose K clusters Select K points as initial centroids Assign all points to clusters based on distance The K-means algorithm is one method for doing partitional clustering Yes Recompute the centroid of each cluster Did the center change? No DONE!

K-Means Demonstration Here is the initial data set

K-Means Demonstration Choose K points as initial centroids

K-Means Demonstration Assign data points according to distance

K-Means Demonstration Recalculate the centroids

K-Means Demonstration And re-assign the points

K-Means Demonstration And keep doing that until you settle on a final set of clusters

Choosing the initial centroids It matters Choosing the right number Choosing the right initial location Bad choices create bad groupings They won t make sense within the context of the problem Unrelated data points will be included in the same group

Example of Poor Initialization This may work mathematically but the clusters don t make much sense.

Evaluating K-Means Clusters On the previous slides, we did it visually, but there is a mathematical test Sum-of-Squares Error (SSE) The distance to the nearest cluster center How close does each point get to the center? SSE = K i= 1 x C i dist ( m, x) This just means For each cluster ii, compute distance from a point (xx) to the cluster center (mm ii ) Square that distance (so sign isn t an issue) Add them all together 2 i

Example: Evaluating Clusters Cluster 1 Cluster 2 1 1.3 3 3.3 2 1.5 SSE 1 = 1 2 + 1.3 2 + 2 2 = 1 + 1.69 + 4 = 6.69 SSE 2 = 3 2 + 3.3 2 + 1.5 2 = 9 + 10.89 + 2.25 = 22.14 Considerations Lower individual cluster SSE = a better cluster Lower total SSE = a better set of clusters More clusters will reduce SSE Reducing SSE within a cluster increases cohesion (we want that)

Pre-processing: Getting the right centroids There s no single, best way to choose initial centroids Normalize the data Reduces dispersion and influence of outliers Adjusts for differences in scale (income in dollars versus age in years) Remove outliers altogether Also reduces dispersion that can skew the cluster centroids They don t represent the population anyway

Limitations of K-Means Clustering K-Means gives unreliable results when Clusters vary widely in size Clusters vary widely in density Clusters are not in rounded shapes The data set has a lot of outliers The clusters may never make sense. In that case, the data may just not be well-suited for clustering!

Similarity between clusters (inter-cluster) Most common: distance between centroids Also can use SSE Look at distance between cluster 1 s points and other centroids You d want to maximize SSE between clusters Cluster 1 Cluster 5 Increasing SSE across clusters increases separation (we want that)

Figuring out if our clusters are good Good means Meaningful Useful Provides insight The pitfalls Poor clusters reveal incorrect associations Poor clusters reveal inconclusive associations There might be room for improvement and we can t tell This is somewhat subjective and depends upon the expectations of the analyst

The Keys to Successful Clustering We want high cohesion within clusters (minimize differences) Low SSE, high correlation And high separation between clusters (maximize differences) High SSE, low correlation In R, cohesion is measured by within cluster sum of squares error and separation measured by between cluster sum of squares error Choose the right number of clusters Choose the right initial centroids No easy way to do this Trial-and-error, knowledge of the problem, and looking at the output