CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Similar documents
Data Preprocessing. Supervised Learning

Basic Data Mining Technique

Unsupervised Learning : Clustering

Algorithms: Decision Trees

Artificial Intelligence. Programming Styles

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Wild Mushrooms Classification Edible or Poisonous

COMP 465: Data Mining Classification Basics

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Data Preparation. Nikola Milikić. Jelena Jovanović

Clustering CS 550: Machine Learning

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

What to come. There will be a few more topics we will cover on supervised learning

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Intro to Artificial Intelligence

Data Preprocessing. Slides by: Shree Jaswal

SOCIAL MEDIA MINING. Data Mining Essentials

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Classification and Regression

Homework Assignment #3

Data Preprocessing. Komate AMPHAWAN

Unsupervised Learning

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Clustering. Supervised vs. Unsupervised Learning

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Topics in Machine Learning

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Slides for Data Mining by I. H. Witten and E. Frank

Machine Learning Techniques for Data Mining

4. Feedforward neural networks. 4.1 Feedforward neural network structure

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

CSE4334/5334 DATA MINING

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering and Visualisation of Data

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Exploratory Analysis: Clustering

node2vec: Scalable Feature Learning for Networks

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Unsupervised Learning I: K-Means Clustering

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Semi-Supervised Clustering with Partial Background Information

Supervised vs unsupervised clustering

K- Nearest Neighbors(KNN) And Predictive Accuracy

6. Learning Partitions of a Set

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Preprocessing. Data Preprocessing

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Unsupervised Learning Partitioning Methods

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Supervised vs. Unsupervised Learning

Network Traffic Measurements and Analysis

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Supervised and Unsupervised Learning (II)

Exam Advanced Data Mining Date: Time:

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Data Collection, Preprocessing and Implementation

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

9.1. K-means Clustering

ECLT 5810 Clustering

Extra readings beyond the lecture slides are important:

Further Applications of a Particle Visualization Framework

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Clustering in Data Mining

Table Of Contents: xix Foreword to Second Edition

Domestic electricity consumption analysis using data mining techniques

Code No: R Set No. 1

CS573 Data Privacy and Security. Li Xiong

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Clustering & Classification (chapter 15)

Enterprise Miner Software: Changes and Enhancements, Release 4.1

Clustering algorithms and autoencoders for anomaly detection

Introduction to Artificial Intelligence

Iteration Reduction K Means Clustering Algorithm

Performance Analysis of Data Mining Classification Techniques

Hierarchical Clustering 4/5/17

Contents. Foreword to Second Edition. Acknowledgments About the Authors

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Data preprocessing Functional Programming and Intelligent Algorithms

Applying Supervised Learning

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

االمتحان النهائي للفصل الدراسي الثاني للعام 2015/2014

Analyzing Outlier Detection Techniques with Hybrid Method

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering. Chapter 10 in Introduction to statistical learning

2. Data Preprocessing

Data Warehousing and Machine Learning

Data Mining Classification - Part 1 -

Chapter 2 Basic Structure of High-Dimensional Spaces

Using Machine Learning to Optimize Storage Systems

ECLT 5810 Clustering

Transcription:

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem I: Problem II: Problem III: Problem IV: TOTAL SCORE: (/25 points) General Data Mining (/40 points) Numeric Predictions (/10 points) Instance Based Learning (/35 points) Clustering (/100 points (+ 10 points of extra credit)) Instructions: Show your work and justify your answers Use the space provided to write your answers Ask in case of doubt Problem I. General Data Mining [25 points] 1. [9 points] What is the difference between supervised and unsupervised discretization? Explain decisively. In supervised discretization, the target attribute is used to help determine appropriate bins for the attribute being discretized. In unsupervised discretization, the attribute being discretized is split into bins that are determined without taking the target attribute (if any) under consideration. Page 1 of 10

2. [8 points] What is the difference between classification and regression? In classification, the target attribute is nominal. In regression, the target attribute is numeric. 3. [8 points] What is stratified sampling? A stratified sample is a sample that preserves the distribution of the target attribute. Page 2 of 10

Problem II. Numeric Predictions [40 points] Consider the following dataset which is a small subset adapted from the Auto Miles per gallon (MPG) dataset, which is available at the University of California Irvine (UCI) Data Repository. Note that instances have been labeled with a number in parentheses so that you can refer to them in your solutions. car name (disregard) cylinders weight model year mpg (1) chevrolet 8 3504 70 18 (2) chevrolet 4 2950 82 27 (3) chevrolet 4 2395 82 34 (4) toyota 4 2372 70 24 (5) toyota 4 2155 76 28 (6) toyota 4 2665 82 32 (7) volkswagen 4 1835 70 26 (8) volkswagen 4 1937 76 29 (9) volkswagen 4 2130 82 44 (10) ford 8 4615 70 10 (11) ford 4 2665 82 28 (12) ford 8 4335 77 16 The purpose of this problem is to construct a model/regression tree to predict the attribute mpg (milesper gallon) using the three predicting attributes (cylinders, weight, model year). The partial tree below is the result of applying the model/regression tree construction algorithm. The tree contains 5 leaves, marked with LM1, LM2, LM3, LM4, and LM5. For the solutions of Problem II, see the solutions of Exam 2, CS4445 B term 2006 available at: http://www.cs.wpi.edu/~cs4445/b06/exams/solutions_exam2_cs4445_b06.html Page 3 of 10

cylinders <= 6 : model year <= 79 : model year <= 73 : LM1 This leaf contains instances: (4) and (7) model year > 73 : LM2 This leaf contains instances: (5) and (8) model year > 79 : weight <=? : LM3 This leaf contains instances: weight >? : LM4 This leaf contains instances: cylinders > 6 : LM5 This leaf contains instances: (1), (10), and (12) 1. Constructing the remaining internal node. We need to determine the correct split point for the attribute weight to split the node marked with a "?" in the tree above. That is, the split point x of weight that will have the highest SDR. 1. (5 Points) Relevant data instances. List all the data instances that need to be considered when splitting the node marked with "?" above. Suggestion: list these instances sorted in increasing order by weight value. (The table below is provided for your convenience, but you might need less or more rows than those provided.) car name (disregard) cylinders weight model year mpg 2. (5 Points) Candidate split points List all the candidate split points for the attribute weight that need to be considered to find the correct value for "? in the nodes "weight <=?" and "weight >?" in the tree above. Page 4 of 10

3. (12 Points) Evaluating candidate split points Compute the SDR (Standard Deviation Reduction) of each of the candidate split points that you listed above. For your convenience, the following standard deviations (std) are provided: std({44, 34, 32, 28, 27}) = 6.8 std({44}) = 0 std({34, 32, 28, 27}) = 3.3 std({44, 34}) = 7.1 std({32, 28, 27}) = 2.6 std({44, 34, 32, 28}) = 6.8 std({27}) = 0 SHOW YOUR WORK AND STATE WHAT FORMULA(S) YOU ARE USING EXPLICITLY. 4. (2 Points) Choosing the best candidate split point According to the SDR's that you computed above select the best split point. 5. (3 Points) Completing the tree Replace the "?" marks in the following tree with the split point that you determined to be the correct one. ALSO, WRITE DOWN WHICH INSTANCES BELONG TO leaves LM3 and LM4. cylinders <= 6 : model year <= 79 : model year <= 73 : LM1 This leaf contains instances: (4) and (7) model year > 73 : LM2 This leaf contains instances: (5) and (8) model year > 79 : weight <=? : LM3 This leaf contains instances:? weight >? : LM4 This leaf contains instances:? cylinders > 6 : LM5 This leaf contains instances: (1), (10), and (12) 2. Constructing the leaves of the tree 1. (5 Points) Regression Tree. Assume that we will use the tree above as a regression tree. DESCRIBE how to calculate the leaf values (that is, the value that each of the leaf nodes will output as its prediction). Page 5 of 10

CALCULATE the precise value that the leaf marked as LM5 in the tree above will output. Show your work. LM5 = 2. (8 Points) Model Tree Assume that we will use the tree above as a model tree. DESCRIBE how to calculate the leaf values (that is, the value that each of the leaf nodes will output as its prediction). ILLUSTRATE what the function/formula that the leaf marked as LM5 in the tree above will use to produce its output is like. (You don't have to produce the precise function just illustrate what the function will be like.) To simplify your answer, you can disregard the nominal attribute car name. LM5 = Page 6 of 10

Problem III. Instance Based Learning [10 points] Consider a dataset with a nominal target attribute (i.e., a nominal CLASS). Suppose that the dataset contains n instances. Consider the IBn: n nearest neighbors classifier (without distance weighting) for this dataset. That is, the classifier that will use all the n nearest neighbors of a test instance to assign a classification (without any distance weighting) to the test instance. 1. [5 Points] What will be the classification assigned to a test instance? Explain. 2. [5 Points] What other classifier that you know always outputs the same result as this n nearest neighbors classifier (without distance weighting)? Explain your answer. For the solutions of Problem III, see the solutions of Exam 2, CS4445 D term 2004 available at: http://web.cs.wpi.edu/~cs444x/d04/exams/solutions_exam2_cs444x_d04.html Page 7 of 10

Problem IV. Clustering [35 points] Consider the following variation of the k means clustering algorithm called the k medoids clustering algorithm. The k medoids algorithm is identical to the k means algorithm except that instead of using centroids (which might not be data instances in the dataset) as cluster representatives, it uses medoids (which are guaranteed to be data instances in the dataset) as cluster representatives. Remember that the centroid of a cluster is constructed as the average over all data instances in the cluster. In contrast, the medoid of a cluster is the data instance in the cluster that minimizes the average dissimilarity (i.e., average distance) to all the other data instances in the same cluster. That is, if a cluster C contains data instances a 1,, a n, and d(a i,a j ) denotes the distance between instances a i and a j, (this is the same distance function used to cluster the dataset) then the medoid of the cluster C = {a 1,, a n } is the data instance a i in C for which the average distance from a i to each of the other data instances in the cluster 1, is the smallest, among all ai, 1 i n, in the cluster C. This medoid is a most centrally located point in the cluster. 1. Assume that the following 3 instances belong to the same cluster and that the Euclidean distance is used to measure the distance between two data instances over all the four attributes: car name, cylinders, model year, and mpg, WITHOUT using normalization or attribute weights. car name cylinders model year mpg a 1 chevrolet 8 70 18 a 2 chevrolet 4 82 27 a 3 toyota 4 70 24 a. [5 points] Calculate the centroid of this cluster of 3 instances. Show your work. For the centroid, we need to take the average of each attribute above car name cylinders model year mpg centroid average(chevrolet, chevrolet, toyota) = chevrolet average(8,4,4) = 5.33 average(70,82,70) = 74 average(18,27,24) = 23 Page 8 of 10

b. [15 points] Calculate the medoid of this cluster of 3 instances. Show your work. For this, you need to calculate the following Euclidean distances: d(a 1,a 2 ) = =. d(a 1,a 3 ) = =. d(a 2,a 3 ) = =. Now calculate: The average distance from a 1 to the other two instances in the cluster ½ (15.52 + 7.28) = ½(22.8) = 11.4 (If you divided by 3 instead of 2, as the formula or the average distance in the project statement above implied, that s fine too as it won t change the relative order of the averages.) The average distance from a 2 to the other two instances in the cluster ½(15.52 + 12.4) = ½(27.92) = 13.96 The average distance from a 3 to the other two instances in the cluster ½(7.28 + 12.4) = ½(19.68) = 9.84 Then, what data instance is the medoid of this cluster and why? a 3 is the medoid as it is the instance with the lowest average distance to the other instances in the cluster. Page 9 of 10

2. [5 points] In the k means algorithm, is the centroid of a cluster always unique? Yes? No? Justify your answer. No, the centroid might not be unique if there are nominal attributes in the dataset. The average/mode of a nominal attribute might not be unique, as there might be two or more nominal values that are the most common ones. For example, the average/mode(chevrolet, chevrolet, ford, toyota, toyota) is either chevrolet or toyota as both are equally common and the most common among the values of the attribute. In contrast, the average of a numeric attribute is unique. 3. [5 points] In the k medoids algorithm, is the medoid of a cluster always unique? Yes? No? Justify your answer. No, the medoid might not be unique EVEN if there are NO nominal attributes in the dataset. Note that two or more data instances in the cluster might have the same average distance to the other instances in the dataset. For example, consider a cluster with 3 instances located at the vertices of an equilateral triangle in a two dimensional space. Then, the average distances of each data instance to the others is the same. Hence, any of the 3 instances might be the chosen medoid. 4. [5 points] The k medoids algorithm is in general more robust to noise and outliers than the k means algorithm. Explain why. Illustrate with an example. Outliers in a cluster skew the centroid towards the outliers, as the centroid is the average of all the instances in the cluster. In contrast, the medoid of a cluster tends to be the most centrally located point in the cluster. As an example, consider the following cluster of 4 points (in black), the one to the right being a clear outlier: medoid centroid Page 10 of 10