Data Mining Clustering

Similar documents
Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Mining Association Rules in Large Databases

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Jarek Szlichta

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Association Rule Mining. Entscheidungsunterstützungssysteme

Chapter 7: Frequent Itemsets and Association Rules

Supervised and Unsupervised Learning (II)

Data Mining Algorithms

Chapter 4: Association analysis:

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Unsupervised Learning and Data Mining

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Association mining rules

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017

Chapter 4 Data Mining A Short Introduction

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Nesnelerin İnternetinde Veri Analizi

Association Rules Apriori Algorithm

Machine Learning using MapReduce

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Chapter 7: Frequent Itemsets and Association Rules

Data Mining Course Overview

Data Mining Concepts

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Association Rules Apriori Algorithm

Machine Learning: Symbolische Ansätze

Association Rules Outline

Association Rule Mining. Introduction 46. Study core 46

Association Pattern Mining. Lijun Zhang

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Introduction to Data Mining

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Data mining techniques for actuaries: an overview

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CHAPTER 4: CLUSTER ANALYSIS

Interestingness Measurements

Jeffrey D. Ullman Stanford University

Association Rules. Berlin Chen References:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Hierarchical Clustering

Lecture 15 Clustering. Oct

Decision Support Systems

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

Improved Frequent Pattern Mining Algorithm with Indexing

Chapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Network Traffic Measurements and Analysis

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Lecture 25: Review I

Unsupervised Learning

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS

Unsupervised Learning and Clustering

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Mining Association Rules in Large Databases

Hierarchical Clustering 4/5/17

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Introduction to Data Mining

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

ECLT 5810 Clustering

Tutorial on Association Rule Mining

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

ECLT 5810 Clustering

Unsupervised learning, Clustering CS434

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Interestingness Measurements

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

Association Rules. A. Bellaachia Page: 1

COMP90049 Knowledge Technologies

Market basket analysis

Chapter DM:II. II. Cluster Analysis

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Frequent Pattern Mining

Data Preprocessing UE 141 Spring 2013

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Association Rule Discovery

Using Machine Learning to Optimize Storage Systems

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Clustering Part 4 DBSCAN

Distances, Clustering! Rafael Irizarry!

Sequential Data. COMP 527 Data Mining Danushka Bollegala

A Survey On Data Mining Algorithm

Association Rule Mining and Clustering

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Exploratory data analysis for microarrays

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

What is Unsupervised Learning?

A Comparative Study of Association Mining Algorithms for Market Basket Analysis

Transcription:

Data Mining Clustering Jingpeng Li 1 of 34 Supervised Learning F(x): true function (usually not known) D: training sample (x, F(x)) 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 1 69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 0 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0 54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 1 G(x): model learned from D 71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0? Goal: E[(F(x)-G(x)) 2 ] is small (near zero) for future samples 2 of 34 1

Training dataset: Instance Test dataset: Un- Supervised Learning Input 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1 Output 71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0? 0 1 0 1 1 0 1 0 0 1 0 Data set: Un-Supervised Learning 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1 4 of 34 2

Supervise Vs. Un-Supervised Learning Supervised y=f(x): true function D: labeled training set D: {x i, F(x i )} Learn: G(x): model trained to predict labels of new cases Goal: E[(F(x)-G(x)) 2 ] 0 Well defined criteria: Mean square error Un-supervised y=?: no true function D: unlabeled data set D: {x i } Learn?????????? Goal:?????????? Well defined criteria:?????????? Clustering (Unsupervised Learning) What we have: o Data Set D o Similarity/distance metric What we need to do: o Find Partitioning of data, or groups of similar/close items An illumination o Find natural grouping of instances given unlabeled data 3

Similarity Groups of similar customers Similar demographics Similar buying behavior Similar health Similar products Similar cost Similar function Similar store Similarity usually is domain/problem specific Similarity: Distance Functions Numeric data: Euclidean distance Manhattan distance Categorical data (0/1 indicating presence/absence): Hamming distance (# dissimilarity) Jaccard coefficient (% of #similarity in 1s) Combined numeric and categorical data: weighted normalized distance 8 of 34 4

Manhattan & Euclidean Distance Consider two records x=(x 1,,x d ), y=(y 1,,y d ): p p p p d ( x, y) x y x y... x y 1 1 2 2 d d Special cases: p=1: Manhattan distance d( x, y) x y1 x2 y2... x p y 1 p p=2: Euclidean distance d( x, y) ( x p= : Chebyshev distance 2 2 2 y ) ( x y )... ( x y ) 1 1 2 2 d d d x, y) max( x y, x y,..., x p y ) ( 1 1 2 2 p 9 of 34 Manhattan & Euclidean Distances: Example Euclidean: dist(x,y) = (4 2 +3 2 ) = 5 x = (5,5) 5 4 3 y = (9,8) Manhattan: dist(x,y) = 4+3 = 7 10 of 34 5

Mean Clustering We will look at an approach to clustering numeric data based on picking a number of mean values one for each cluster You hopefully know that the mean (average) of a data set of size S is: x S S 11 of 34 x More Than One Mean What if we suspect that our data set is actually a number of data sets mixed together, Each one has a mean value of its own But we don t know which data point belongs to which set Clustering algorithms separate out the data and calculate the means 12 of 34 6

Mean Clustering Target Imagine we think there are 5 clusters in our data We want to calculate 5 means: m 1, m 2, m 3, m 4, m 5 And assign each data point, x i, to one mean only That would lead to 5 data sets, S 1 S 5 13 of 34 Aim Target is to minimise the total distance between the data points and the means to which they are assigned: arg min( S) k i 1 x s j x j m i 2 14 of 34 7

Temperature K-Means Clustering Algorithm Goal: minimize sum of square of distance from all data points to their means Algorithm: o Pick K different points from the data and assumes they are the cluster centers random, first K, K separated points o Repeat until stabilization: Assign each point to the closest cluster center Generate new cluster centers Centers 15 of 34 K-Means Example Imagine a machine that worked in two distinct states, e.g fast and slow Mean temperature might be 50 for the slow speed and 80 for the fast speed 100 90 80 70 60 50 40 30 20 10 0 0 50 100 150 200 Time 16 of 34 8

K-Means Example The machine might have a number of distinct states, all with differing acceptable ranges of temperature, pressure etc We don t know what these different states are, nor how many there are of them A clustering algorithm will find them K means does so by finding the middle point of each 17 of 34 K-Means Disadvantages Only measures the mean for each cluster tells you nothing of its shape. You must assume the cluster is round, but they rarely are You need to know K before you start The distance measure, in its simple form, assumes that all ranges are equally important 18 of 34 9

Hierarchical Clustering Algorithm Algorithm: o Start with the same number of clusters as you have data points every point is a cluster of its own o Find the two clusters that are closest together and join them into one. o Calculate their new centre o Repeat Until you have the desired number of clusters, or everything is in one cluster a b c d e f abcdef bc bcdef de def 19 of 34 Hierarchical Clustering Minimum Spanning Tree Looks for clusters within clusters Cluster 1 (root) is the whole data set That splits into a small number of subsets Each subset splits into 0 or more subsets etc. Clusters Dendrogram 10

Qualities of a Cluster The cluster hierarchy may store other data about its clusters: Population size: how many data points are in that cluster? Variance and range how far from the centre does most of the data lie 21 of 34 Association Rules 22 of 34 11

Market Basket Analysis Consider shopping cart filled with several items. Market basket analysis tries to answer the following questions: What do customers buy together? 80% of customers purchase items X, Y and Z together (i.e. {X, Y, Z}) In what pattern do customers purchase items? 60% of customers who purchase X and Y also buy Z (i.e. {X, Y} {Z}) 23 of 34 Association Rule Discovery: Definition Giving a set of records, each of which contain some number of items Produce dependency rules, which predict occurrence of some items based on occurrences of other items Market-basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example association rules {Diaper} {Beer}, {Beer, Bread} {Milk}, {Milk, Bread} {Eggs, Coke} 24 of 34 12

Itemset o A collection of one or more items Example: {Bread, Milk, Diaper} o k-itemset An itemset that contains k items Support count ( ) o Frequency of occurrence of an itemset o E.g. ({Bread, Milk, Diaper}) = Support Frequent Itemset o Fraction of transactions that contain an itemset o E.g. s({milk, Bread, Diaper}) = 2/5 Frequent Itemset o An itemset whose support is greater than or equal to a minsup threshold 2 TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 25 of 34 Rule Evaluation Metrics Association Rule is expressed as X Y, where X and Y are itemsets. Example: {Milk, Diaper} {Beer} Support (s) = % of transactions that contain both X and Y o 40% of customers buy milk, diaper and beer Confidence (c) = % of (transactions that contain X) that also contain Y o If someone buys milk and diaper, they will buy beer 67% of the time TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example: s (Milk, Diaper,Beer) T 2 0.4 5 (Milk, Diaper,Beer) 2 c 0.67 (Milk, Diaper) 3 26 of 34 13

Direction The direction is important: X Y is not the same as Y X For example, 80% of people who buy a torch buy batteries 5% of people who buy batteries buy a torch 27 of 34 Mining Association Rules TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Rules of itemset: {Milk, Diaper, Beer} {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: Computationally expensive as all the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but have different confidence 28 of 34 14

Finding the Rules The apriori algorithm works as follows: 1. Find all the acceptable itemsets - Support 2. Use them to generate acceptable rules Confidence So, we find all the itemsets with more than our chosen support and combine them into every possible rule, keeping those with an acceptable confidence 29 of 34 Step 1 Generate Itemsets 1. Find all the acceptable itemsets of size 1 2. Use the items from step 1 to generate all itemsets of size two and count their support. Keep those that are supported. 3. Repeat for increasingly large itemsets until none of the current size are supported 30 of 34 15

Example With a minimum support of 20% Bread = 40%: Keep Milk = 60%: Keep Porcini = 2%: Discard {Bread, Milk} = 30%: Keep {Bread, Milk, Sardines} = 15%: Discard These are NOT rules yet! Just itemsets 31 of 34 Step 2: Generate Rules Generate every combination from the acceptable itemsets: X Y where X Y = Empty That is, where nothing in X appears in Y, and vice-versa. 32 of 34 16

Example {Bread} {Milk} is good {Bread, Milk} {Coffee} is good {Bread} {Bread, Milk} is not allowed 33 of 34 Finally Discard all the rules that have a confidence score lower than some predefined target Remember, confidence is the percentage of baskets that contain both parts of the rule 34 of 34 17