Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Similar documents
Chapter 4: Algorithms CS 795

Data Mining Algorithms: Basic Methods

Chapter 4: Algorithms CS 795

Data Mining Part 4. Tony C Smith WEKA Machine Learning Group Department of Computer Science University of Waikato

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Homework 1 Sample Solution

Data Mining and Machine Learning: Techniques and Algorithms

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Input: Concepts, Instances, Attributes

Machine Learning and Pervasive Computing

Summary. Machine Learning: Introduction. Marcin Sydow

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

For Monday. Read chapter 18, sections Homework:

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Unsupervised Learning Hierarchical Methods

Data Mining Practical Machine Learning Tools and Techniques

Advanced learning algorithms

Machine Learning. Unsupervised Learning. Manfred Huber

Data Mining and Knowledge Discovery Practice notes 2

Hierarchical Clustering 4/5/17

Data Mining and Knowledge Discovery: Practice Notes

Association Rule Mining and Clustering

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

6.034 Quiz 2, Spring 2005

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Supervised vs unsupervised clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

1 Case study of SVM (Rob)

CMPT 882 Week 3 Summary

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Based on Raymond J. Mooney s slides

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

5 Learning hypothesis classes (16 points)

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Clustering in Data Mining

Unsupervised Learning

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Unsupervised Learning and Clustering

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Machine Learning Chapter 2. Input

What to come. There will be a few more topics we will cover on supervised learning

COMP33111: Tutorial and lab exercise 7

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Unsupervised Learning: Clustering

Geometric data structures:

Data Mining Part 5. Prediction

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Artificial Neural Networks (Feedforward Nets)

Data Mining and Analytics

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Lecture on Modeling Tools for Clustering & Regression

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Network Traffic Measurements and Analysis

Instructor: Jessica Wu Harvey Mudd College

Artificial Intelligence. Programming Styles

Data Mining Practical Machine Learning Tools and Techniques

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

The exam is closed book, closed notes except your one-page cheat sheet.

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Midterm Examination CS540-2: Introduction to Artificial Intelligence

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Problem 1: Complexity of Update Rules for Logistic Regression

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Unsupervised Learning and Clustering

Instance-based Learning

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Machine Learning Classifiers and Boosting

Slides for Data Mining by I. H. Witten and E. Frank

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Cluster Analysis. Ying Shen, SSE, Tongji University

COMP 465: Data Mining Still More on Clustering

CS 229 Midterm Review

Introduction to ANSYS DesignXplorer

The Explorer. chapter Getting started

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Data Mining. Neural Networks

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Gene Clustering & Classification

Data Mining Practical Machine Learning Tools and Techniques

CHAPTER 4: CLUSTER ANALYSIS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Linear Regression and K-Nearest Neighbors 3/28/18

Transcription:

Data Mining Chapter 4. Algorithms: The Basic Methods (Covering algorithm, Association rule, Linear models, Instance-based learning, Clustering) 1 Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.6 (a) 2 1

A set of rules covering the a s if x > 1.2 and y > 2.6 then class = a if x > 1.4 and y > 2.4 then class = a A set of rules covering the b s if x 1.2 then class = b if x > 1.2 and y 2.6 then class = b 3 Covering algorithm Choosing an attribute-value pair to maximize the probability of the desired classification Including as many instances of the desired class as possible Excluding as many instances of other classes as possible Weka rules: PRISM method 4 2

A basic rule learner Maximizing the accuracy p(positive examples) / t(instances) 5 Contact lens problem IF? THEN recommendation = hard Age = young, 2/8 Age = pre-presbyopic, 1/8 Age = presbyopic, 1/8 Spectacle prescription = myope, 3/12 Spectacle prescription = hypermetrope, 1/12 Astigmatism = no, 0/12 Astigmatism = yes, 4/12 Tear production rate = reduced, 0/12 Tear production rate = normal, 4/12 6 3

Accuracy p / t Break ties by choosing the condition with the largest p Selecting largest fraction : 4/12 (at random) IF astigmatism = yes THEN recommendation = hard 7 Refinement IF astigmatism = yes AND?, THEN recommendation = hard Age = young, 2/4 Age = pre-presbyopic, 1/4 Age = presbyopic, 1/4 Spectacle prescription = myope, 3/6 Spectacle prescription = hypermetrope, 1/6 Tear production rate = reduced, 0/6 Tear production rate = normal, 4/6 IF astigmatism=yes AND tear production rate=normal 8 4

Exact rules Age=young, 2/2 Age=pre-presbyopic, 1/2 Age=presbyopic, 1/2 Spectacle prescription=myope, 3/3 greater coverage Spectacle prescription=hypermetrope, 1/3 IF astigmatism=yes AND tear production rate=normal AND spectacle prescription=myope, THEN recommendation = hard 9 Checking the coverage rate The above rule covers ¾. (4: hard ) delete! (3 instances) Looking for another rule IF? THEN recommendation = hard best choice : age=young (coverage: 7) IF age=young AND astigmatism=yes AND tear production rate=normal, 1/1 Covering 2 out of the original instances! 10 5

For another class For soft and none PRISM Adding clauses to each rule until it s perfect only correct rules. 11 Rules vs trees Tree : taking all classes into account Rule : one class at a time (more compact!) Decision tree Applied in order Execution stops as soon as one rule applies. Order-independent rules Independent nuggets of knowledge Disadvantage Being not clear what to do when conflicting rules apply e.g.) rules different classes 12 6

Mining association rules Association rules Weka Apriori algorithm Coverage support Accuracy confidence Association rules with high coverage Attribute-value pair: item Item sets Table 4.10 Item sets for weather data with coverage 2 or greater One-item sets, Two-item sets, Three-item sets 6 Four-item sets 13 Mining association rules Association rules A three-item set with a coverage of 4 (Table 4.10): humidity=normal, windy=false, play=yes 7 potential rules IF humidity = normal and windy = false, THEN play = yes 4/4 IF humidity = normal and play = yes, THEN windy = false 4/6 IF windy = false and play = yes, THEN humidity = normal 4/6 IF humidity = normal, THEN windy = false and play = yes 4/7 IF windy = false, THEN humidity = normal and play = yes 4/8 IF play = yes, THEN humidity = normal and windy = false 4/9 IF, THEN humidity = normal and windy = false and play = yes 4/14 4: coverage, 4/4: accuracy Assuming that minimum specified accuracy is 100%, then 1 st rule! Table 4.11: the final rule set 14 7

Linear models Numeric prediction: Linear regression Class and all attributes: numeric Expressing the class as a linear combination of the attributes x = w 0 + w 1 a 1 +w 2 a 2 + +w k a k where x: class, w k : weights, a k : attribute values 15 Linear models Minimizing the sum of the squared differences Predicted value w 0 a (1) 0 + w 1 a (1) 1 + + w k a (1) k k = j=0 w j a j The sum of the squared differences» n : training instances» x (i) : ith instance s actual class (1) (1): 1st instance n k i=1 ( x(i) - j=0 w j a (i) j ) 2 16 8

Linear models Disadvantages Linearity For a nonlinear dependency the best-fitting straight line the least mean-squared difference Linear models serve well as building blocks for more complex learning methods. 17 Linear classification using the perceptron From a biological viewpoint, a mathematical model for the operation of brain a method of representing functions using networks 18 9

Linear classification using the perceptron Neural networks Input units (nodes), hidden units, output units Links, (numeric) weights Network structure : feed-forward (unidirectional, no cycles) Input function in i, activation function g I 1 w 13 H 3 w 35 w 23 w 14 O 5 I 2 H 4 w 45 w 24 19 Linear classification using the perceptron a i g in i = g( w j,i a j ) j Input links in i g a i output links a 5 = g(w 3,5 a 3 + w 4,5 a 4 ) = g(w 3,5 g(w 1,3 a 1 + w 2,3 a 2 ) + w 4,5 g(w 1,4 a 1 + w 2,4 a 2 )) A complex nonlinear function 20 10

Linear classification using the perceptron a 1 a 2 w 1 w 2 +/- a n threshold w n T If a 1 w 1 + a 2 w 2 + + a n w n > T then positive examples else negative examples 21 Linear classification using the perceptron activation function g 1(firing): when the input is greater than its threshold 0(no firing): otherwise hard threshold 1: positive 0 or -1: negative Sigmoid function : smooth transition To determine a predicted value anywhere between 0 and 1 22 11

Linear classification using the perceptron Perceptrons Layered feed-forward networks A single-layer: no hidden layer Weight update rule Predicted output for the single output unit: O Correct output: T Error: T-O If error is positive, increase O. If it s negative, decrease O. Learning rate(gain factor) W j W j + I J Error Activation of input I j 23 Linear classification using the perceptron Example) classification errors lead to changes in weights When the misclassified instance is positive, w i = ƞv i When the misclassified instance is negative, w i = ƞv i 24 12

Linear classification using the perceptron Initial hypothesis 1.0 Height + -2.0 Girth 1.5 ƞ=0.04, Instance(Girth, Height) = {(1.75,6.0), (2.0,5.0), (2.5,5.0), (3.0,6.25)} Positives = {(1.75,6.0), (2.0,5.0)} Negatives = {(2.5,5.0), (3.0,6.25)} The misclassified instance is positive: (2.0,5.0) w for threshold = 0.04 1.0=0.04 w for girth = 0.04 2.0=0.08 w for height = 0.04 5.0=0.2 1.2 H + -1.92G 1.54 25 Linear classification using the perceptron The misclassified instance is negative: (3.0,6.25) w for threshold = 0.04 w for girth = 0.04 3.0=0.12 w for height = 0.04 6.25=0.25 0.95H + -2.04G 1.5 Final revised hypothesis 1.15Height + -1.96Girth 1.54 if the training set is linearly separable, it is guaranteed to converge in a finite number of iterations. useful approximations even when the target concepts are not linearly separable 26 13

Once the nearest training instance has been located, its class is predicted for test instance. Distance function Determining which member of the training set is closest to an unknown test instance Euclidian distance Distance between an instance with a 1 (1), a 2 (1),.. a k (1) and one with values a 1 (2), a 2 (2),, a k (2) k: attributes, (#): instances 27 the sum of squares (a 1 (1) a 1 2 ) 2 + (a 2 1 a 2 2 ) 2 + + (a k 1 a k 2 ) 2 Normalization ([0..1]) a i = v i min v i max v i min v i Nominal attributes If the values are the same, difference: 0. Otherwise, difference: 1. 28 14

Finding nearest neighbors efficiently Finding which member of training set is closest to an unknown test instance Calculating the distance from every member of the training set and selecting the smallest Being linear in the number of training instances Representing the training set as a tree kd-tree Storing a set of points in k-dimensional space» k-dimensional space: the number of attributes 29 Root (horizontally) (vertically) 30 15

Speeding up nearest-neighbor calculations 6 5 1 7 3 2 4 31 1 ; h 2 ; v 5 ; v 3 4 6 7 closest Good first approximation! (log 2 n ) where n: depth of the tree 32 16

Using hyperspheres, not hyperrectangles Squares are not the best shape, because of their corners Ball tree Fig 4.14 The nodes of actual ball trees: the center and radius of their ball The leaf nodes: the points they contain 33 Splitting method Choose the point in the ball that is farthest from its center. Then choose a 2 nd point that is farthest from the 1 st one. Assign all data points in the ball to the closest one of these two cluster centers. Then compute the centroid of each cluster and the minimum radius required for it to enclose all the data points. 34 17

35 To use a ball tree to find the nearest neighbor to a given target Traversing the tree from the top down to locate the leaf that contains the target and find the closest point to the target in that ball If the distance from target to the sibling s center exceeds its radius plus the current upper bound, it cannot possibly contain a close point. Otherwise, the sibling must be examined by descending the tree further. Ruling out (Fig. 4.15) 36 18

37 Clustering techniques Clustering When there is no class to be predicted but rather when the instances are to be divided into natural groups (unsupervised learning) Iterative distance-based clustering k-means Specifying in advance how many clusters are being sought: k k points are chosen at random as cluster centers. All instances are assigned to their closest cluster center according to Euclidean distance metric. The mean of the instances in each cluster is calculated. 38 19

Clustering These means are taken to be new center values for their respective clusters. Repeated until the cluster centers have stabilized. Minimizing V V = k i=1 j S i x j - u i 2 where» k: clusters» S i for i = 1, 2,, k» U i : the mean point of the points x j S i 39 Clustering The overall effect Minimizing the total squared distance from all points to their cluster centers The minimum: local optimum not a global optimum Final clusters: being sensitive to the initial cluster centers Running the algorithm several times with different initial choices and choosing the best final result The one with the smallest total squared distance 40 20

http://cis.catholic.ac.kr/sunoh 41 21