Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Similar documents
Cluster Analysis. CSE634 Data Mining

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning : Clustering

Unsupervised Learning Partitioning Methods

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

ECLT 5810 Clustering

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Kapitel 4: Clustering

ECLT 5810 Clustering

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

CS145: INTRODUCTION TO DATA MINING

Gene Clustering & Classification

A Review on Cluster Based Approach in Data Mining

Unsupervised Learning

Clustering Analysis Basics

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

What to come. There will be a few more topics we will cover on supervised learning

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

New Approach for K-mean and K-medoids Algorithm

Unsupervised Learning

Data Mining Algorithms

DBSCAN. Presented by: Garrett Poppe

Data Informatics. Seon Ho Kim, Ph.D.

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Jarek Szlichta

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Clustering. Chapter 10 in Introduction to statistical learning

Clustering in Data Mining

Cluster Analysis: Basic Concepts and Algorithms

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Lecture 7 Cluster Analysis: Part A

Clustering part II 1

Introduction to Artificial Intelligence

Clustering CS 550: Machine Learning

Clustering Basic Concepts and Algorithms 1

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Introduction to Clustering

CS249: ADVANCED DATA MINING

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Clustering in Ratemaking: Applications in Territories Clustering

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

An Enhanced K-Medoid Clustering Algorithm

DATA MINING AND WAREHOUSING

CS Data Mining Techniques Instructor: Abdullah Mueen

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Exploratory data analysis for microarrays

K-Means. Oct Youn-Hee Han

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Improved Performance of Unsupervised Method by Renovated K-Means

Network Traffic Measurements and Analysis

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Clustering Part 4 DBSCAN

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Exploratory Analysis: Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Cluster Analysis. Ying Shen, SSE, Tongji University

A Review of K-mean Algorithm

Information Retrieval and Organisation

Cluster analysis. Agnieszka Nowak - Brzezinska

6. Learning Partitions of a Set

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering Techniques

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1


Comparative Study of Clustering Algorithms using R

CS573 Data Privacy and Security. Li Xiong

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

2. Background. 2.1 Clustering

Clustering and Visualisation of Data

CHAPTER 4: CLUSTER ANALYSIS

Supervised vs. Unsupervised Learning

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

COMP5331: Knowledge Discovery and Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 4

3. Cluster analysis Overview

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10

Enhancing K-means Clustering Algorithm with Improved Initial Center

Cluster Analysis: Basic Concepts and Methods

Unsupervised Learning and Clustering

Artificial Intelligence. Programming Styles

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Unsupervised Learning and Clustering

Clustering. Supervised vs. Unsupervised Learning

Transcription:

Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201

Road map What is Cluster Analysis? Characteristics of Clustering Applications of Cluster Analysis Clustering: Application Examples Basic Steps to Develop a Clustering Task Quality: What Is Good Clustering? k-means Clustering The K-Means Clustering Method What is the problem of k-means Method? The K-Medoids Clustering Method Department of CS- DM - UHD 2

What is Cluster Analysis? Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). Intra-cluster distances are minimized Inter-cluster distances are maximized

Understanding data mining clustering methods Department of CS- DM - UHD 4

Understanding data mining clustering methods you could first cluster your customer data into groups that have similar structures and then plan different marketing operations for each group. Department of CS- DM - UHD 5

What is the difference between supervised and unsupervised learning? In supervised learning, the output datasets are provided which are used to train the machine and get the desired outputs. whereas in unsupervised learning no output datasets are provided, instead the data is clustered into different classes. Department of CS- DM - UHD 6

Characteristics of Clustering Cluster analysis (or clustering) Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes. Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Clustering: Application Examples 1. Biology: family, genes, species, 2. Information retrieval: document clustering. Land use: Identification of areas of similar land use in an earth observation database 4. Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

Clustering: Application Examples 5. City-planning: Identifying groups of houses according to their house type, value, and geographical location 6. Earth-quake studies: Observed earth quake epicenters should be clustered.. Climate: understanding earth climate, find patterns of atmospheric and ocean. Economic Science: market research

Quality: What Is Good Clustering? 1. A good clustering method will produce high quality clusters high intra-class similarity low inter-class similarity 2. The quality of a clustering method depends on the similarity measure used by the method its implementation, and Its ability to discover some or all of the hidden patterns

Major Clustering Approaches 1. Partitioning approach: 2. Hierarchical approach:. Density-based approach: 4. Grid-based approach: 11

Partitioning Algorithms: Basic Concept Partitional clustering decomposes a data set into a set of disjoint clusters. Given a data set of N points, a partitioning method constructs K partitions of the data, with each partition representing a cluster. That is, it classifies the data into K groups by satisfying the following requirements: (1) Each group contains at least one point, (2) Each point belongs to exactly one group. 12

Partitioning Algorithms: Basic Concept Methods: k-means and k-medoids algorithms:- 1. k-means : Each cluster is represented by the center of the cluster. 2. k-medoids or PAM (Partition Around Medoids): Each cluster is represented by one of the objects in the cluster 1

k-means Clustering K-clustering algorithm Result: Given the input set S and a fixed integer k, a partition of S into k subsets must be returned. K-means clustering is the most common partitioning algorithm. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity 14

K-Means Clustering Algorithm 1. Separate the objects (data points) into K clusters. 2. Cluster center (centroid) = the average of all the data points in the cluster.. Assigns each data point to the cluster whose centroid is nearest (using distance function.) 4. Go back to Step 2, stop when the assignment does not change. 15

The K-Means Clustering Method Example 6 5 4 2 1 0 0 1 2 4 5 6 Assign each objects to most similar center 6 5 4 2 1 0 0 1 2 4 5 6 reassign Update the cluster means 6 5 4 2 1 0 0 1 2 4 5 6 reassign K=2 Arbitrarily choose K object as initial cluster center 6 5 4 2 1 0 0 1 2 4 5 6 Update the cluster means 6 5 4 2 1 0 0 1 2 4 5 6

K-mean algorithm 1. It accepts the number of clusters to group data into, and the dataset to cluster as input values. 2. It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. For Example, if there are,000 rows of data in the dataset and clusters need to be formed, then the first K= initial clusters will be created by selecting records randomly from the dataset as the initial clusters. Each of the initial clusters formed will have just one row of data.

K-mean algorithm. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, their is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. For Example if the dataset we are discussing is a set of Height, Weight and Age measurements for students in a UHD, where a record P in the dataset S is represented by a Height, Weight and Age measurement, then P = {Age, Height, Weight). Then a record containing the measurements of a student John, would be represented as John = {20,, 0} where John's Age = 20 years, Height = 1.0 metres and Weight = 0 Pounds. Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for John as a member = {20,, 0}.

K-mean algorithm 4. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City- Block Distance Measure. 5. K-Means re-assigns each record in the dataset to the most similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. For Example, if a cluster contains two records where the record of the set of measurements for John = {20,, 0} and Henry = {0, 160, 120}, then the arithmetic mean P mean is represented as P mean = {Age mean, Height mean, Weight mean ). Age mean = (20 + 0)/2, Height mean = ( + 160)/2 and Weight mean = (0 + 120)/2. The arithmetic mean of this cluster = {25, 165, 0}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters.

K-mean algorithm 6. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.

What is the problem of k-means Method? The k-means algorithm is sensitive to noise and outliers! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 6 5 4 2 1 0 0 1 2 4 5 6 6 5 4 2 1 0 0 1 2 4 5 6

The K-Medoids Clustering Method Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1): 1. Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering. 2. PAM works effectively for small data sets, but does not scale well for large data sets

Typical k-medoids algorithm (PAM) Total Cost = 20 6 5 4 2 1 0 0 1 2 4 5 6 K=2 Arbitrary choose k object as initial medoids 6 5 4 2 1 0 0 1 2 4 5 6 Total Cost = 26 Assign each remainin g object to nearest medoids 6 5 4 2 1 0 0 1 2 4 5 6 Randomly select a nonmedoid object,o ramdom Do loop Until no change Swapping O and O ramdom If quality is improved. 6 5 4 2 Compute total cost of swapping 6 5 4 2 1 1 0 0 1 2 4 5 6 0 0 1 2 4 5 6

Department of IT- DMDW - UHD 24