Topic 1 Classification Alternatives

Topic 1 Classification Alternatives [Jiawei Han, Micheline Kamber, Jian Pei. 2011. Data Mining Concepts and Techniques. 3 rd Ed. Morgan Kaufmann. ISBN: 9380931913.] 1

Contents 2. Classification Using Frequent Patterns 3. Support Vector Machines (SVMs) 4. Classification by Backpropagation (ANNs) 5. Bayesian Belief Networks 6. Other Classification Methods 2

Introduction Basic Concepts Eager learning (e.g., decision tree) spends a lot of time for model building (training/learning). - Once a model has been built, classifying a test example is extremely fast. Lazy learning (e.g., k-nearest-neighbor classifier) does not require model building (no training). - Classifying a test example is quite expensive because we need to compute the proximity values individually between the test and training examples. 3

When we want to classify an unknown (unseen) tuple, a k-nearest-neighbor (k-nn) classifier searches the pattern space for the k training tuples that are closest to the unknown tuple. These k training tuples are the k nearest neighbors of the unknown tuple. For k-nn classification, the unknown tuple is assigned the most common class among its k- nearest neighbors (i.e., majority class of its k nearest neighbors). 4

The 1-, 2-, and 3-nearest neighbors of an instance x. In (b), we may randomly choose one of class labels (i.e., + or ) to classify the data point x. 5

The Euclidean distance between two points or tuples X 1 = (x 11, x 12,..., x 1n ) and X 2 = (x 21, x 22,..., x 2n ) is defined as Other distance metrics (e.g., Manhattan, Minkowski, Cosine, and Mahalanobis distance) can be used. 6

The importance of choosing the right value for k. - If k is too small, then the k-nn classifier may be susceptible to overfitting because of noise in the training data. - If k is too large, the k-nn classifier may misclassify the test instance because its list of nearest neighbors may include data points that are located far away from its neighborhood, as shown below. 7

. k-nn classification with large k. (x is classified as instead of +) 8

Algorithm v1: Basic k-nn classification algorithm 1. Find the k training instances that are closest to the unseen instance. 2. Take the most commonly occurring class label of these k instances and assign it to the class label of the unseen instance. 9

Algorithm v2: Basic k-nn classification algorithm. 1. Let k be the number of nearest neighbors and D be the set of training examples. 2. for each test example z = (x, y ) do 3. Compute d(x, x), the distance between z and every example (x, y) D. 4. Select D z D, the set of k closest training examples to z. 5. y = 6. end for 10

Once the k-nn list D z is obtained, the test example is classified based on the majority class of its k nearest neighbors: where v is a class label, y i is the class label for one of the k nearest neighbors, and I( ) is an indicator function that returns the value 1 if its argument is true and 0 otherwise. 11

In the majority voting approach, every neighbor has the same impact on the classification. This makes the algorithm sensitive to the choice of k. 12

One way to reduce the impact of k is to weight the influence of each nearest neighbor x i according to its distance: w i = 1/d(x, x i ) 2. As a result, training examples that are located far away from z have a weaker impact on the classification compared to those that are located close to z. 13

Using the distance-weighted voting scheme, the class label can be determined as follows: 14

k-nn classifiers can produce wrong predictions due to varying scales of attribute values of tuples. For example, suppose we want to classify a group of people based on attributes such as height (measured in meters) and weight (measured in pounds). 15

The height attribute has a low variability, ranging from 1.5 m to 1.85 m, whereas the weight attribute may vary from 90 lb. to 250 lb. If the scale of the attributes are not taken into consideration, the proximity measure may be dominated by differences in the weights of a person. 16

Data normalization (aka feature scaling): We normalize the values of each attribute before computing proximity measure (e.g., Euclidean distance). - This helps prevent attributes with large ranges (e.g., weight) from outweighing attributes with smaller ranges (e.g., height). 17

Min-max normalization (aka unity-based normalization): can be used to transform a value v of a numeric attribute A to v in the range [0, 1] by computing v = (v min A ) / (max A min A ) [0, 1], where min A and max A are the minimum and maximum values of attribute A. 18

In general, min-max normalization (aka unitybased normalization): can be used to transform a value v of a numeric attribute A to v in the range [0, 1] by computing or v = l + [(v min A ) / (max A min A )] (u l) [l, u], where min A and max A are the minimum and maximum values of attribute A. 19

Note that it is possible that an unseen instance may have a value of A that is less than min or greater than max. If we want to keep the adjusted numbers in the range from 0 to 1, we can just convert any values of A that are less than min or greater than max to 0 or 1, respectively. 20

Dealing with non-numeric attributes: For nonnumeric attributes (e.g., nominal or categorical), a simple method is to compare the corresponding value of the non-numeric attribute in tuple X 1 with that in tuple X 2. - If the two are identical (e.g., tuples X 1 and X 2 both have the blue color), then the difference between the two is 0. - If the two are different (e.g., tuple X 1 is blue but tuple X 2 is red), then the difference is 1. 21

Contents 2. Classification Using Frequent Patterns 3. Support Vector Machines (SVMs) 4. Classification by Backpropagation (ANNs) 5. Bayesian Belief Networks 6. Other Classification Methods 22

2. Classification Using Frequent Patterns. 23

Contents 2. Classification Using Frequent Patterns 3. Support Vector Machines (SVMs) 4. Classification by Backpropagation (ANNs) 5. Bayesian Belief Networks 6. Other Classification Methods 24

3. Support Vector Machines (SVMs). 25

Contents 2. Classification Using Frequent Patterns 3. Support Vector Machines (SVMs) 4. Classification by Backpropagation (ANNs) 5. Bayesian Belief Networks 6. Other Classification Methods 26

4. Classification by Backpropagation (ANNs). 27

Contents 2. Classification Using Frequent Patterns 3. Support Vector Machines (SVMs) 4. Classification by Backpropagation (ANNs) 5. Bayesian Belief Networks 6. Other Classification Methods 28

5. Bayesian Belief Networks. 29

Contents 2. Classification Using Frequent Patterns 3. Support Vector Machines (SVMs) 4. Classification by Backpropagation (ANNs) 5. Bayesian Belief Networks 6. Other Classification Methods 30

6. Other Classification Methods Genetic Algorithms (GAs) Rough Set Approach Fuzzy Set Approach 31

Summary 32

Exercises. 33

References 1. Jiawei Han, Micheline Kamber, Jian Pei. 2011. Data Mining Concepts and Techniques. 3 rd Ed. Morgan Kaufmann. ISBN: 9380931913. 2. Pang-Ning Tan, Michael Steinbach, Vipin Kumar. 2005. Introduction to Data Mining. 1 st Ed. Pearson. ISBN: 0321321367. 3. Charu C. Aggarwal. 2015. Data Mining The Textbook. Springer. ISBN: 3319141414. 34

References 4. Nong Ye. 2013. Data Mining: Theories, Algorithms, and Examples. CRC Press. ISBN: 1439808384. 5. Uday Kamath, Krishna Choppella. 2017. Mastering Java Machine Learning. Packt Publishing. ISBN: 1785880519. 35

Extra Slides Distance Metrics 1. Euclidean distance between two points x = (x 1, x 2,..., x d ) and y = (y 1, y 2,..., y d ) is defined as (also denoted as L 2 (x, y), L 2 (x, y), x y 2, x y 2 ) https://en.wikipedia.org/wiki/euclidean_distance 36

Extra Slides Distance Metrics 2. Manhattan distance between two points x = (x 1, x 2,..., x d ) and y = (y 1, y 2,..., y d ) is defined as (the sum of the absolute differences of their Cartesian coordinates) (also denoted as L 1 (x, y), L 1 (x, y), x y 1, x y 1 ) https://en.wikipedia.org/wiki/taxicab_geometry 37

Extra Slides Distance Metrics 3. Minkowski distance between two points x = (x 1, x 2,..., x d ) and y = (y 1, y 2,..., y d ) is defined as (a generalization of both the Euclidean distance (p = 2) and the Manhattan distance (p = 1)) (also denoted as L p (x, y), L p (x, y)) https://en.wikipedia.org/wiki/minkowski_distance 38

Extra Slides Distance Metrics 4. Cosine distance between two points x = (x 1, x 2,..., x d ) and y = (y 1, y 2,..., y d ) is defined as - dot (or inner) product x y = - length (or magnitude) of a vector x is x = https://en.wikipedia.org/wiki/cosine_similarity 39

Extra Slides Distance Metrics 5. Mahalanobis distance between two points x = (x 1, x 2,..., x d ) and y = (y 1, y 2,..., y d ) is defined as where - S is a covariance matrix (also denoted as ). - S 1 is the inverse of S - x T is the transpose of x https://en.wikipedia.org/wiki/mahalanobis_distance 40

Extra Slides 41