COMP33111: Tutorial and lab exercise 7

Similar documents
Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Cluster Analysis: Agglomerate Hierarchical Clustering

IMPLEMENTATION OF ANT COLONY ALGORITHMS IN MATLAB R. Seidlová, J. Poživil

Basic Concepts Weka Workbench and its terminology

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Machine Learning Methods. Majid Masso, PhD Bioinformatics and Computational Biology George Mason University

Unsupervised: no target value to predict

Hierarchical Clustering Lecture 9

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Homework 1 Sample Solution

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Unsupervised Learning

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

MACHINE LEARNING Example: Google search

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Machine Learning Chapter 2. Input

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Data Mining Practical Machine Learning Tools and Techniques

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Data Mining Practical Machine Learning Tools and Techniques

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Input: Concepts, Instances, Attributes

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CSE 5243 INTRO. TO DATA MINING

Chapter 4: Algorithms CS 795

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Summary. Machine Learning: Introduction. Marcin Sydow

The Explorer. chapter Getting started

Gene Clustering & Classification

Chapter 4: Algorithms CS 795

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Hierarchical Clustering 4/5/17

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering Lecture 3: Hierarchical Methods

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Hierarchical and Ensemble Clustering

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Exploratory Analysis: Clustering

Sabbatical Leave Report

Introduction to Machine Learning CANB 7640

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Information Retrieval and Organisation

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Cluster Analysis. Ying Shen, SSE, Tongji University

Decision Trees In Weka,Data Formats

Lecture 25: Review I

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Network Traffic Measurements and Analysis

CSE 5243 INTRO. TO DATA MINING

Supervised and Unsupervised Learning (II)

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Classification with Decision Tree Induction

Unsupervised Learning and Clustering

Introduction to Artificial Intelligence

Clustering and Dimensionality Reduction

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Hierarchical Clustering

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Unsupervised Learning: Clustering

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM

Data Mining and Machine Learning: Techniques and Algorithms

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Cluster analysis. Agnieszka Nowak - Brzezinska

Clustering Part 3. Hierarchical Clustering

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

UNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Clustering algorithms

CS7267 MACHINE LEARNING

Clustering: K-means and Kernel K-means

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Chapter 6: Cluster Analysis

Data Mining Algorithms: Basic Methods

Applied Clustering Techniques. Jing Dong

Kernels and Clustering

Unsupervised Learning and Clustering

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

What is Unsupervised Learning?

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Data Mining and Analytics

Analyzing Genomic Data with NOJAH

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Clustering CS 550: Machine Learning

Decision Trees: Discussion

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Transcription:

COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised vs. supervised, known number of classes in advance for classification, interpretation of outcomes, prediction vs. exploration, classification is easier for evaluation etc. 2. An example dataset consists of five products whose amounts of sale in two regions are shown bellow. Cluster these products into two groups using the k-means algorithm, the Euclidean distance, and products A and E as initial cluster members. data point product region 1 region 2 1 A 22 21 2 B 19 20 3 C 18 22 4 D 1 3 5 E 4 2 consider data as 2-dimensional vectors with attributes region1 and region2. Step1: centroids C 1 = A(22, 21) and C 2 = E(4,2) product C 1 C 2 cluster A(22, 21) 0 26.17 1 B(19, 20) 3.16 23.43 1 C(18, 22) 4.12 25.61 1 D(1, 3) 27.66 3.16 2 E(4, 2) 26.17 0 2 Step2: new centroids C 1 (19.67, 21) and C 2 (2.5, 2.5). Step1: after one more iteration, there is no change in cluster membership, so the two clusters are {A, B, C} and {D, E}. 3. Cluster the data from the previous example using the k-means algorithm, the Manhattan distance and product A and E as initial cluster members. Use the same procedure as above; the table after the first step should be: product C 1 C 2 cluster A(22, 21) 0 37 1 B(19, 20) 4 33 1 C(18, 22) 5 36 1 D(1, 3) 39 4 2 E(4, 2) 37 0 2

4. Briefly describe the idea of agglomerative clustering. What is the difference between single and complete linkage methods for measuring inter-cluster distances? see slides 46, 48-52, Lecture 7 (Clustering). 5. Cluster the data from question 2 using the agglomerative clustering with single linkage method. The distance between points (i.e. products) should be calculated using the Euclidean distance. Compare the results. Distances between products Product A B C D E A(22, 21) 0 3.16 4.12 27.66 26.17 B(19, 20) 0 2.27 24.76 23.43 C(18, 22) 0 25.50 25.61 D(1, 3) 0 3.16 E(4, 2) 0 Step 1: initial clusters are {A}, {B}, {C}, {D}, {E}, with the distances as above Step 2: minimal distance is between {B} and {C}, so merge {B, C} Step 3: re-calculate the inter-cluster distances (single linkage = MIN) cluster A B, C D E A 0 3.16 27.66 26.17 B, C 0 24.76 23.43 D 0 3.16 E 0 Step 2: minimal distance is now between {A} and {B, C}, so merge {A, B, C} (Note: the same minimal distance between {D} and {E}, so we could merge them alternatively) Step 3: re-calculate the inter-cluster distances cluster A, B, C D E A, B, C 0 24.76 23.43 D 0 3.16 E 0 Step 2: minimal distance between {D} and {E}, so merge {D, E} Step 3: re-calculate the inter-cluster distances cluster A, B, C D, E A, B, C 0 23.43 D, E 0 Merge the remaining two clusters into {A, B, C, D, E}. Resulting dendrogram:

C B A D E B1. The resulting dendrogram leaf 2 [1] leaf 3 [1] node 4 [2] leaf 5 [1] node 4 [2] leaf 6 [1] leaf 7 [1] node 9 [2] leaf 10 [1] node 9 [2] leaf 11 [1] leaf 12 [1] leaf 14 [1] leaf 15 [1] leaf 16 [1] leaf 18 [1] leaf 19 [1] leaf 20 [1] Guide answers for Part 2: Clustering in WEKA

0 1 8 17 2 3 7 9 12 18 19 20 4 10 11 13 5 6 14 15 16 Input data @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,false,no sunny,80,90,true,no overcast,83,86,false,yes rainy,70,96,false,yes rainy,68,80,false,yes rainy,65,70,true,no overcast,64,65,true,yes sunny,72,95,false,no sunny,69,70,false,yes rainy,75,80,false,yes sunny,75,70,true,yes overcast,72,90,true,yes overcast,81,75,false,yes rainy,71,91,true,no Clustered data @relation weather_clustered @attribute Instance_number numeric @attribute outlook {sunny,overcast,rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE,FALSE} @attribute play {yes,no} @attribute Cluster {cluster0,cluster1,cluster2,cluster3,cluster4,cluster5,cluster6,cluster7,cluster8,cluster9, cluster10,cluster11,cluster12,cluster13,cluster14,cluster15,cluster16,cluster17, cluster18,cluster19,cluster20}

@data 0,sunny,85,85,FALSE,no,cluster5 1,sunny,80,90,TRUE,no,cluster7 2,overcast,83,86,FALSE,yes,cluster10 3,rainy,70,96,FALSE,yes,cluster15 4,rainy,68,80,FALSE,yes,cluster14 5,rainy,65,70,TRUE,no,cluster2 6,overcast,64,65,TRUE,yes,cluster18 7,sunny,72,95,FALSE,no,cluster6 8,sunny,69,70,FALSE,yes,cluster16 9,rainy,75,80,FALSE,yes,cluster12 10,sunny,75,70,TRUE,yes,cluster19 11,overcast,72,90,TRUE,yes,cluster20 12,overcast,81,75,FALSE,yes,cluster11 13,rainy,71,91,TRUE,no,cluster3 5 18 19 20