Final Exam DATA MINING I - 1DL360

Similar documents
DATA MINING II - 1DL460

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Association Pattern Mining. Lijun Zhang

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Jarek Szlichta

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

DATA MINING - 1DL105, 1DL111

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CS 229 Midterm Review

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Data Mining Concepts

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Principles of Algorithm Design

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Oracle9i Data Mining. Data Sheet August 2002

Credit card Fraud Detection using Predictive Modeling: a Review

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Outline. Project Update Data Mining: Answers without Queries. Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone

Network Traffic Measurements and Analysis

Lecture outline. Decision-tree classification

2. Discovery of Association Rules

Association Rules. Berlin Chen References:

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Part 3. Associations Rules

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Using Machine Learning to Optimize Storage Systems

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

PATTERN DISCOVERY IN TIME-ORIENTED DATA

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Clustering in Data Mining

Chapter 4: Algorithms CS 795

Chapter 4: Association analysis:

Machine Learning in Real World: C4.5

The exam is closed book, closed notes except your one-page cheat sheet.

Homework Assignment #3

Algorithms: Decision Trees

Contents. Preface to the Second Edition

10-701/15-781, Fall 2006, Final

Clustering algorithms and autoencoders for anomaly detection

Exam Advanced Data Mining Date: Time:

Comparison of FP tree and Apriori Algorithm

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Name: UW CSE 473 Midterm, Fall 2014

Data Mining Lecture 8: Decision Trees

Data Mining Clustering

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

TIM 50 - Business Information Systems

Final Examination CSE 100 UCSD (Practice)

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

Classification with Decision Tree Induction

14 Data Compression by Huffman Encoding

Classification and Regression

Nearest neighbor classification DSE 220

Computer Science Foundation Exam

TIM 50 - Business Information Systems

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Supervised and Unsupervised Learning (II)

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

COMPARISON OF K-MEAN ALGORITHM & APRIORI ALGORITHM AN ANALYSIS

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

A Program demonstrating Gini Index Classification

Chapter 5: Outlier Detection

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Decision Tree Learning

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Knowledge Discovery and Data Mining

DATA MINING II - 1DL460. Spring 2014"

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Chapter 4: Algorithms CS 795

Frequent Pattern Mining

Lecture 25: Review I

Data Mining Classification - Part 1 -

Chapter 2: Classification & Prediction

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

DATA MINING II - 1DL460

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Data warehouses Decision support The multidimensional model OLAP queries

List of Exercises: Data Mining 1 December 12th, 2015

ECLT 5810 Evaluation of Classification Quality

Introduction to Data Mining

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

1 Document Classification [60 points]

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Section 4 General Factorial Tutorials

CPSC 211, Sections : Data Structures and Implementations, Honors Final Exam May 4, 2001

Data Mining Algorithms

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Managing Data Resources

COMP 465: Data Mining Still More on Clustering

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Transcription:

Uppsala University Department of Information Technology Kjell Orsborn Final Exam 2012-10-17 DATA MINING I - 1DL360 Date... Wednesday, October 17, 2012 Time... 08:00-13:00 Teacher on duty... Kjell Orsborn, phone 471 11 54 or 070 425 06 91 Instructions: Read through the complete exam and note any unclear directives before you start solving the questions. The following guidelines hold: Write readably and clearly! Answers that cannot be read can obviously not result in any points and unclear formulations can be misunderstood. Assumptions outside of what is stated in the question must be explained. Any assumptions made should not alter the given question. Write your answer on only one side of the paper and use a new paper for each new question to simplify the correction process and to avoid possible misunderstandings. Please write your name on each page you hand in. When you are finished, please staple these pages together in an order that corresponds to the order of the questions. NOTE! This examination contains 40 points in total and their distribution between sub-questions is clearly identifiable. Note that you will get credit only for answers that are correct. To pass, you must score at least 22. The examiner reserves the right to lower these numbers. You are allowed to use dictionaries to and from English, a calculator but no other material.

1. Data mining: 6 pts Discuss (shortly) whether or not each of the following activities is a data mining task. (a) Dividing the customers of a company according to their profitability. Answer: No. This is an accounting calculation, followed by the application of a threshold. However, predicting the profitability of a new customer would be data mining. (b) Predicting the outcomes of tossing a (fair) pair of dice. Answer: No. Since the die is fair, this is a probability calculation. If the die were not fair, and we needed to estimate the probabilities of each outcome from the data, then this is more like the problems considered by data mining. However, in this specific case, solutions to this problem were developed by mathematicians a long time ago, and thus, we wouldnt consider it to be data mining. (c) Predicting the future stock price of a company using historical records. Answer: Yes. We would attempt to create a model that can predict the continuous value of the stock price. This is an example of the area of data mining known as predictive modelling. We could use regression for this modelling, although researchers in many fields have developed a wide variety of techniques for predicting time series. (d) Monitoring the heart rate of a patient for abnormalities. Answer: Yes. We would build a model of the normal behavior of heart rate and raise an alarm when an unusual heart behavior occurred. This would involve the area of data mining known as anomaly detection. This could also be considered as a classification problem if we had examples of both normal and abnormal heart behavior. (e) Extracting the frequencies of a sound wave. Answer: No. This is signal processing. (f) Monitoring and predicting failures in a hydropower plant. Answer: Yes. In this case,... 2. Classification: 8 pts Consider the training examples shown in Table 1 for a binary classication problem. Table 1: Instance a 1 a 2 a 3 Target Class 1 T T 1.0 + 2 T T 6.0 + 3 T F 5.0-4 F F 4.0 + 5 F T 7.0-6 F T 3.0-7 F F 8.0-8 T F 7.0 + 9 F T 5.0 -

(a) The entropy is given by: Entropy(t) = c 1 i=0 p(i t) log 2 p(i t), where c is the number of classes and p(j t) is the relative frequency of class j at node t. What is the entropy of this collection of training examples with respect to the positive class? (1 pt) Answer: The entropy is given by: Entropy(t) = c 1 i=0 p(i t) log 2 p(i t), where c is the number of classes and p(j t) is the relative frequency of class j at node t. There are four positive examples and ve negative examples. Thus, P (+) = 4/9 and P ( ) = 5/9. The entropy of the training examples is 4/9 log 2 (4/9) 5/9 log 2 (5/9) = 0.9911. (b) What are the information gains of a 1 and a 2 relative to these training examples? (3 pts) Answer: For attribute a 1, the corresponding counts and probabilities are: a 1 + - T 3 1 F 1 4 The entropy for a 1 is: 4 9 [ (3/4) log 2(3/4) (1/4) log 2 (1/4)] + 5 9 [ (1/5) log 2(1/5) (4/5) log 2 (4/5)] = 0.7616. Therefore, the information gain for a 1 is 0.9911 0.7616 = 0.2294. For attribute a 2, the corresponding counts and probabilities are: a 2 + - T 2 3 F 2 2 The entropy for a 2 is: 5 9 [ (2/5) log 2(2/5) (3/5) log 2 (3/5)] + 4 9 [ (2/4) log 2(2/4) (2/4) log 2 (2/4)] = 0.9839. Therefore, the information gain for a 2 is 0.9911 0.9839 = 0.0072. (c) For a 3, which is a continuous attribute, compute the information gain for every possible split. (3 pts) Answer: a 3 Class label Split point Entropy Info Gain 1.0 + 2.0 0.8484 0.1427 3.0-3.5 0.9885 0.0026 4.0 + 4.5 0.9183 0.0728 5.0-5.0-5.5 0.9839 0.0072 6.0 + 6.5 0.9728 0.0183 7.0 + 7.0-7.5 0.8889 0.1022 The best split for aoccurs at split point equals to 2. (d) What is the best split (among a 1, a 2, and a 3 ) according to the information gain? (1 pt)

Answer: According to information gain, a 1, produces the best split. 3. Clustering: 8 pts (a) Explain the definition of a core point in DBSCAN. (1 pt) (b) Explain the definition of a centroid in k-means. (1 pt) (c) The following is a set of one-dimensional points: {1, 1, 2, 3, 5, 8, 13, 21, 33, 54}. Perform two iterations of k-means on these points using the two initial centroids 0 and 11. (2 pts) (d) Assume that you have to explore a large data set of high dimensionality. You know nothing about the distribution of the data. In text of no more than one page, discuss the following. (4 pts) i. How can k-means and DBSCAN be used to find the number of clusters in that data? ii. Explain how PCA can help find the dimensions where clusters separate. iii. Explain why PCA might neglect cluster separation in some dimensions. iv. Can k-means or DBSCAN be applied in a way that would help you find the dimensions in which the clusters separate? Answer: (a) core point in DBSCAN: a point surrounded by at least MinPts points in its Eps neighborhood. (b) centroid: The average of all points in the cluster. (c) Practical k-means Compute distances between each data point and each init centroid. 1, 1, 2, 3, 5 belongs to c1. 8, 13, 21, 33, 54 belongs to c2. New c1 = 12/5 = 2.4. New c2 = 129/5 = 25.8. Compute distances between each data point and each init centroid. 1, 1, 2, 3, 5, 8, 13 belongs to c1. 21, 33, 54 belongs to c2. New c1 = 33/7 = 4.7. New c2 = 108/3 = 36. (d) Explanations I would like to see at least the following: i. Try k-means with increasing k. For which k does MSE start to converge? Try DBSCAN with different parameter settings. ii. PCA finds the dimensions with high variance. We will find the clusters that separate along those dimensions. iii. If clusters separate in a dimension with low variance, PCA will overlook that separation.

iv. Perform k-means and DBSCAN with different subsets of the dimensions. If we find a set of dimensions d that don t separate any clusters, try to re-run the clustering on all dimensions but d. If we still get the same number of clusters, d are probably not significant. 4. Association analysis: 6 pts Consider the data set shown in Table 2. Customer ID Transaction ID Items bought 1 1001 {i1, i4, i5} 1 1024 {i1, i2, i3, i5} 2 1012 {i1, i2, i4, i5} 2 1031 {i1, i3, i4, i5} 3 1015 {i2, i3, i5} 3 1022 {i2, i4, i5} 4 1029 {i3, i4} 4 1040 {i1, i2, i3} 5 1033 {i1, i4, i5} 5 1038 {i1, i2, i5} Table 2: Market basket transactions for question 4. (a) Compute the support for itemsets {i5}, {i2, i4}, and {i2, i4, i5} by treating each transaction ID as a market basket. (1 pt) Answer: s( {i5}) = 8 / 10 = 0.8 s( {i2, i4}) = 2 / 10 = 0.2 s( {i2, i4, i5}) = 2 / 10 = 0.2 (b) Use the results in part 4a to compute the confidence for the association rules {i2, i4} {i5} and {i5} {i2, i4}. (1 pt) (c) Is confidence a symmetric measure? (1 pt) Answer: c(i2i4 i5) = 0.2/0.2 = 100% c(i5 i2i4) = 0.2/0.8 = 25% No, condence is not a symmetric measure. (d) Repeat part 4a by treating each customer ID as a market basket. Each item should be treated as a binary variable (1 if an item appears in at least one transaction bought by the customer, and 0 otherwise.) (1 pt) Answer: s( {i5}) = 4 / 5 = 0.8 s( {i2, i4}) = 5 / 5 = 1 s( {i2, i4, i5}) = 4 / 5 = 0.8 (e) Use the results in part 4d to compute the confidence for the association rules {i2, i4} {i5} and {i5} {i2, i4}. (1 pt) Answer: c(i2i4 i5) = 0.8/1 = 80% c(i5 i2i4) = 0.8/0.8 = 100% (f) Suppose s 1 and c 1 are the support and confidence values of an association rule r when treating each transaction ID as a market basket. Also, let s 2 and c 2 be the support and confidence values of r when treating each customer ID as a market basket. Discuss whether there are any relationships between s 1 and s 2 or c 1 and c 2. (1 pt) Answer: There are no apparent relationships between s1, s2, c1, and c2.

5. Association analysis and hash trees: 6 pts The Apriori algorithm uses a hash tree data structure to efficiently count the support of candidate itemsets. Consider the following set of candidate 3-itemsets: {1, 2, 3}, {1, 2, 5}, {1, 2, 6}, {1, 3, 4}, {2, 3, 4}, {2, 4, 5}, {2, 4, 6}, {2, 4, 7}, {3, 4, 6}, {4, 5, 6} (a) Construct a hash tree for the above candidate 3-itemsets. Assume the tree uses a hash function where all odd-numbered items are hashed to the left child of a node, while the even-numbered items are hashed to the right child. A candidate k-itemset is inserted into the tree by hashing on each successive item in the candidate and then following the appropriate branch of the tree according to the hash value. Once a leaf node is reached, the candidate is inserted based on one of the following conditions: Condition 1: If the depth of the leaf node is equal to k (the root is assumed to be at depth 0), then the candidate is inserted regardless of the number of itemsets already stored at the node. Condition 2: If the depth of the leaf node is less than k, then the candidate can be inserted as long as the number of itemsets stored at the node is less than maxsize. Assume maxsize = 2 for this question. Condition 3: If the depth of the leaf node is less than k and the number of itemsets stored at the node is equal to maxsize, then the leaf node is converted into an internal node. New leaf nodes are created as children of the old leaf node. Candidate itemsets previously stored in the old leaf node are distributed to the children based on their hash values. The new candidate is also hashed to its appropriate leaf node. (3 pts) Answer: 134$ L1$ 234$ 456$ L4$ 123$ 126$ 125$ 346$ L2$ L3$ 245$ 246$ 247$ L6$ L5$ Figure 1: Hash tree for problem 5 (b) How many leaf nodes are there in the candidate hash tree? How many internal nodes are there? (1 pt) Answer: There are 6 leaf nodes and 5 internal nodes. (c) Consider a transaction that contains the following items:

{1, 2, 3, 5, 6}. Using the hash tree constructed in part (a), which leaf nodes will be checked against the transaction? What are the candidate 3-itemsets contained in the transaction? (2 pts) Answer: (to be updated!) The leaf nodes L1, L2, L3, and L4 will be checked against the transaction. The candidate itemsets contained in the transaction include 1,2,3 and 1,2,6. 6. Data mining and privacy: 6 pts (a) Explain what deidentification of a data set in data mining is and its possible advantages or disadvantages. (2 pts). Answer: Deidentification of a data object is the process to remove information of the data object that can reveal its identity and thereby also revealing potential private and sensible information. So the advantage is to protect privacy but this also means that one will lose some possibilities of retrieving detailed information and it can be difficult to accomplish a high privacy protection level without removing to much useful information. Even if all direct identification information is removed, there are possibilities to retrieve the identity by combining publicly available data with non-direct identifying attribute data from the data object and still be able to find the correct identity with a high probability. Techniques such as K-anonymity and others can prohibit this problem to some extent. (b) Explain what restriction-based inference protection is and give an example of a specific type of restriction-based techniques. (2 pts). Answer: Restriction-based techniques Prevent queries for certain types of statistical queries query-set size control expanded query-set size control query-set overlap control audit-based control (c) Explain what perturbation-based inference protection is and give an example of a specific type of perturbation-based techniques. (2 pts). Answer: Perturbation-based techniques Modies information that is stored or presented data swapping random-sample queries fixed perturbation query-based perturbation

rounding (systematic, random, controlled) Good Luck! / Kjell