Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Size: px

Start display at page:

Download "Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4."

Kristian Hampton
6 years ago
Views:

1 Data Mining Chapter 4. Algorithms: The Basic Methods (Covering algorithm, Association rule, Linear models, Instance-based learning, Clustering) 1 Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.6 (a) 2 1

2 A set of rules covering the a s if x > 1.2 and y > 2.6 then class = a if x > 1.4 and y > 2.4 then class = a A set of rules covering the b s if x 1.2 then class = b if x > 1.2 and y 2.6 then class = b 3 Covering algorithm Choosing an attribute-value pair to maximize the probability of the desired classification Including as many instances of the desired class as possible Excluding as many instances of other classes as possible Weka rules: PRISM method 4 2

3 A basic rule learner Maximizing the accuracy p(positive examples) / t(instances) 5 Contact lens problem IF? THEN recommendation = hard Age = young, 2/8 Age = pre-presbyopic, 1/8 Age = presbyopic, 1/8 Spectacle prescription = myope, 3/12 Spectacle prescription = hypermetrope, 1/12 Astigmatism = no, 0/12 Astigmatism = yes, 4/12 Tear production rate = reduced, 0/12 Tear production rate = normal, 4/12 6 3

4 Accuracy p / t Break ties by choosing the condition with the largest p Selecting largest fraction : 4/12 (at random) IF astigmatism = yes THEN recommendation = hard 7 Refinement IF astigmatism = yes AND?, THEN recommendation = hard Age = young, 2/4 Age = pre-presbyopic, 1/4 Age = presbyopic, 1/4 Spectacle prescription = myope, 3/6 Spectacle prescription = hypermetrope, 1/6 Tear production rate = reduced, 0/6 Tear production rate = normal, 4/6 IF astigmatism=yes AND tear production rate=normal 8 4

5 Exact rules Age=young, 2/2 Age=pre-presbyopic, 1/2 Age=presbyopic, 1/2 Spectacle prescription=myope, 3/3 greater coverage Spectacle prescription=hypermetrope, 1/3 IF astigmatism=yes AND tear production rate=normal AND spectacle prescription=myope, THEN recommendation = hard 9 Checking the coverage rate The above rule covers ¾. (4: hard ) delete! (3 instances) Looking for another rule IF? THEN recommendation = hard best choice : age=young (coverage: 7) IF age=young AND astigmatism=yes AND tear production rate=normal, 1/1 Covering 2 out of the original instances! 10 5

6 For another class For soft and none PRISM Adding clauses to each rule until it s perfect only correct rules. 11 Rules vs trees Tree : taking all classes into account Rule : one class at a time (more compact!) Decision tree Applied in order Execution stops as soon as one rule applies. Order-independent rules Independent nuggets of knowledge Disadvantage Being not clear what to do when conflicting rules apply e.g.) rules different classes 12 6

7 Mining association rules Association rules Weka Apriori algorithm Coverage support Accuracy confidence Association rules with high coverage Attribute-value pair: item Item sets Table 4.10 Item sets for weather data with coverage 2 or greater One-item sets, Two-item sets, Three-item sets 6 Four-item sets 13 Mining association rules Association rules A three-item set with a coverage of 4 (Table 4.10): humidity=normal, windy=false, play=yes 7 potential rules IF humidity = normal and windy = false, THEN play = yes 4/4 IF humidity = normal and play = yes, THEN windy = false 4/6 IF windy = false and play = yes, THEN humidity = normal 4/6 IF humidity = normal, THEN windy = false and play = yes 4/7 IF windy = false, THEN humidity = normal and play = yes 4/8 IF play = yes, THEN humidity = normal and windy = false 4/9 IF, THEN humidity = normal and windy = false and play = yes 4/14 4: coverage, 4/4: accuracy Assuming that minimum specified accuracy is 100%, then 1 st rule! Table 4.11: the final rule set 14 7

8 Linear models Numeric prediction: Linear regression Class and all attributes: numeric Expressing the class as a linear combination of the attributes x = w 0 + w 1 a 1 +w 2 a 2 + +w k a k where x: class, w k : weights, a k : attribute values 15 Linear models Minimizing the sum of the squared differences Predicted value w 0 a (1) 0 + w 1 a (1) w k a (1) k k = j=0 w j a j The sum of the squared differences» n : training instances» x (i) : ith instance s actual class (1) (1): 1st instance n k i=1 ( x(i) - j=0 w j a (i) j )

9 Linear models Disadvantages Linearity For a nonlinear dependency the best-fitting straight line the least mean-squared difference Linear models serve well as building blocks for more complex learning methods. 17 Linear classification using the perceptron From a biological viewpoint, a mathematical model for the operation of brain a method of representing functions using networks 18 9

10 Linear classification using the perceptron Neural networks Input units (nodes), hidden units, output units Links, (numeric) weights Network structure : feed-forward (unidirectional, no cycles) Input function in i, activation function g I 1 w 13 H 3 w 35 w 23 w 14 O 5 I 2 H 4 w 45 w Linear classification using the perceptron a i g in i = g( w j,i a j ) j Input links in i g a i output links a 5 = g(w 3,5 a 3 + w 4,5 a 4 ) = g(w 3,5 g(w 1,3 a 1 + w 2,3 a 2 ) + w 4,5 g(w 1,4 a 1 + w 2,4 a 2 )) A complex nonlinear function 20 10

11 Linear classification using the perceptron a 1 a 2 w 1 w 2 +/- a n threshold w n T If a 1 w 1 + a 2 w a n w n > T then positive examples else negative examples 21 Linear classification using the perceptron activation function g 1(firing): when the input is greater than its threshold 0(no firing): otherwise hard threshold 1: positive 0 or -1: negative Sigmoid function : smooth transition To determine a predicted value anywhere between 0 and

12 Linear classification using the perceptron Perceptrons Layered feed-forward networks A single-layer: no hidden layer Weight update rule Predicted output for the single output unit: O Correct output: T Error: T-O If error is positive, increase O. If it s negative, decrease O. Learning rate(gain factor) W j W j + I J Error Activation of input I j 23 Linear classification using the perceptron Example) classification errors lead to changes in weights When the misclassified instance is positive, w i = ƞv i When the misclassified instance is negative, w i = ƞv i 24 12

13 Linear classification using the perceptron Initial hypothesis 1.0 Height Girth 1.5 ƞ=0.04, Instance(Girth, Height) = {(1.75,6.0), (2.0,5.0), (2.5,5.0), (3.0,6.25)} Positives = {(1.75,6.0), (2.0,5.0)} Negatives = {(2.5,5.0), (3.0,6.25)} The misclassified instance is positive: (2.0,5.0) w for threshold = =0.04 w for girth = =0.08 w for height = = H G Linear classification using the perceptron The misclassified instance is negative: (3.0,6.25) w for threshold = 0.04 w for girth = =0.12 w for height = = H G 1.5 Final revised hypothesis 1.15Height Girth 1.54 if the training set is linearly separable, it is guaranteed to converge in a finite number of iterations. useful approximations even when the target concepts are not linearly separable 26 13

14 Once the nearest training instance has been located, its class is predicted for test instance. Distance function Determining which member of the training set is closest to an unknown test instance Euclidian distance Distance between an instance with a 1 (1), a 2 (1),.. a k (1) and one with values a 1 (2), a 2 (2),, a k (2) k: attributes, (#): instances 27 the sum of squares (a 1 (1) a 1 2 ) 2 + (a 2 1 a 2 2 ) (a k 1 a k 2 ) 2 Normalization ([0..1]) a i = v i min v i max v i min v i Nominal attributes If the values are the same, difference: 0. Otherwise, difference:

15 Finding nearest neighbors efficiently Finding which member of training set is closest to an unknown test instance Calculating the distance from every member of the training set and selecting the smallest Being linear in the number of training instances Representing the training set as a tree kd-tree Storing a set of points in k-dimensional space» k-dimensional space: the number of attributes 29 Root (horizontally) (vertically) 30 15

16 Speeding up nearest-neighbor calculations ; h 2 ; v 5 ; v closest Good first approximation! (log 2 n ) where n: depth of the tree 32 16

17 Using hyperspheres, not hyperrectangles Squares are not the best shape, because of their corners Ball tree Fig 4.14 The nodes of actual ball trees: the center and radius of their ball The leaf nodes: the points they contain 33 Splitting method Choose the point in the ball that is farthest from its center. Then choose a 2 nd point that is farthest from the 1 st one. Assign all data points in the ball to the closest one of these two cluster centers. Then compute the centroid of each cluster and the minimum radius required for it to enclose all the data points

18 35 To use a ball tree to find the nearest neighbor to a given target Traversing the tree from the top down to locate the leaf that contains the target and find the closest point to the target in that ball If the distance from target to the sibling s center exceeds its radius plus the current upper bound, it cannot possibly contain a close point. Otherwise, the sibling must be examined by descending the tree further. Ruling out (Fig. 4.15) 36 18

37 Clustering techniques Clustering When there is no class to be predicted but rather when the instances are to be divided into natural groups (unsupervised learning) Iterative distance-based

19 37 Clustering techniques Clustering When there is no class to be predicted but rather when the instances are to be divided into natural groups (unsupervised learning) Iterative distance-based clustering k-means Specifying in advance how many clusters are being sought: k k points are chosen at random as cluster centers. All instances are assigned to their closest cluster center according to Euclidean distance metric. The mean of the instances in each cluster is calculated

20 Clustering These means are taken to be new center values for their respective clusters. Repeated until the cluster centers have stabilized. Minimizing V V = k i=1 j S i x j - u i 2 where» k: clusters» S i for i = 1, 2,, k» U i : the mean point of the points x j S i 39 Clustering The overall effect Minimizing the total squared distance from all points to their cluster centers The minimum: local optimum not a global optimum Final clusters: being sensitive to the initial cluster centers Running the algorithm several times with different initial choices and choosing the best final result The one with the smallest total squared distance 40 20

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that