Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k too low means that the result can be sensitive to noise too high means that the neighborhood may include too many items from other classes choice of distance measure should have the property that a smaller distance means a greater likelihood of belonging to the same class specific measure may depend on the domain Euclidean distance becomes less discriminating as the number of attributes increases may need to scale attribute values to avoid having some dominate CPSC 444: Artificial Intelligence Spring 2019 43 CPSC 444: Artificial Intelligence Spring 2019 45 Idea. find the k nearest neighbors to the item in the dataset choose the majority class of the neighbors as the class for the item Challenges. combining class labels majority vote can be problematic if the neighbors vary widely in distance as all are given the same weight weighted vote weights a neighbor's vote by its distance d commonly 1/d 2 classification of an item is relatively expensive must locate the k nearest neighbors there are improvements e.g. condensing (eliminating stored items), proximity graphs (to quickly find neighbors) CPSC 444: Artificial Intelligence Spring 2019 44 CPSC 444: Artificial Intelligence Spring 2019 46
easy to understand and implement fast to build model can perform well in many situations in spite of its simplicity Kernel used: Linear Kernel: K(x,y) = <x,y> 0.543 * outlook=sunny + 1.0266 * outlook=overcast + 0.4837 * outlook=rainy + 0.2834 * temperature=hot + 0.2614 * temperature=mild + 0.0219 * temperature=cool + 1.0219 * humidity=normal + 0.7872 * windy=false + 0.1354 1 if input instance matches the specified value, 0 if not result < 0 denotes one class, > 0 the other a b < classified as 7 2 a = yes 3 2 b = no Correctly Classified Instances 9 64.2857 % CPSC 444: Artificial Intelligence Spring 2019 47 CPSC 444: Artificial Intelligence Spring 2019 49 Idea binary classification (two classes). find the hyperplane that maximizes the margin between the two classes margin = shortest distance between closest item to the plane and the plane H 1 does not separate H 2 separates with a small margin H 3 separates with maximum margin outlook temperature humidity windy play rainy mild normal FALSE yes -2.4169 overcast hot normal FALSE yes -1.935 rainy cool normal FALSE yes -1.4514 rainy mild high FALSE yes -1.395 overcast hot high FALSE yes -1.2119 overcast cool normal TRUE yes -1.1526 sunny cool normal FALSE yes -1.1526 overcast mild high TRUE yes -0.6049 sunny mild normal TRUE yes -0.4295 rainy mild high TRUE no -0.4247 rainy cool normal TRUE no -0.3702 sunny mild high FALSE no 0.1746 sunny hot high TRUE no 0.3577 sunny hot high FALSE no 0.9618 CPSC 444: Artificial Intelligence Spring 2019 https://en.wikipedia.org/wiki/support-vector_machine 48 CPSC 444: Artificial Intelligence Spring 2019 50
Extensions. use a soft margin to handle errors allow some items to be on the wrong side of the plane with different kernel functions, can be used when classes aren't linearly separable outlook temperature humidity windy play rainy mild normal FALSE yes -1.8843 overcast hot normal FALSE yes -1.7728 rainy cool normal FALSE yes -1.1417 rainy mild high FALSE yes -1.0008 overcast hot high FALSE yes -1.0003 overcast cool normal TRUE yes -1 sunny cool normal FALSE yes -0.9994 overcast mild high TRUE yes -0.9993 sunny mild normal TRUE yes -0.9992 rainy mild high TRUE no 0.999 rainy cool normal TRUE no 0.9997 sunny mild high FALSE no 0.9997 sunny hot high TRUE no 1.0009 sunny hot high FALSE no 1.2399 CPSC 444: Artificial Intelligence Spring 2019 https://en.wikipedia.org/wiki/support-vector_machine 51 CPSC 444: Artificial Intelligence Spring 2019 53 Kernel used: Poly Kernel: K(x,y) = <x,y>^2.0 0.8235 * <0 0 1 0 1 0 0 0 > * X] 0.3287 * <1 0 0 0 1 0 1 0 > * X] + 0.5026 * <1 0 0 0 1 0 0 1 > * X] 0.0933 * <0 1 0 0 0 1 1 0 > * X] + 0.1628 * <1 0 0 1 0 0 0 0 > * X] 0.661 * <0 0 1 0 1 0 0 1 > * X] 0.1146 * <1 0 0 0 0 1 1 1 > * X] 0.3832 * <0 1 0 0 1 0 0 0 > * X] 0.088 * <0 1 0 1 0 0 0 1 > * X] + 0.1799 * <0 0 1 0 0 1 1 0 > * X] + 0.3784 <x,y> denotes the dot product of vectors x and y (dot product = sum of the pairwise product of the components) X is the input instance to be classified <0 0 1 0 1 0 0 0 > * X refers to K(<0 0 1 0 1 0 0 0>,X) a b < classified as 6 3 a = yes 3 2 b = no Correctly Classified Instances 8 57.1429 % Extensions. for multiple classes, use pairwise classification (1-vs-1) or one-against-all method pairwise train separate classifiers for each pairing of classes pick the majority classification one-against-all train separate classifiers for each class to distinguish that class from everything else pick the highest-confidence classification CPSC 444: Artificial Intelligence Spring 2019 52 CPSC 444: Artificial Intelligence Spring 2019 54
SVM Multiple Classes Kernel used: Linear Kernel: K(x,y) = <x,y> Classifier for classes: Iris setosa, Iris versicolor 0.0459 * sepallength + 0.5219 * sepalwidth + 1.0031 * petallength + 0.4641 * petalwidth 1.4491 Classifier for classes: Iris setosa, Iris virginica 0.0095 * sepallength + 0.1796 * sepalwidth + 0.5367 * petallength + 0.2946 * petalwidth 1.5143 Classifier for classes: Iris versicolor, Iris virginica 0.5962 * sepallength + 0.972 * sepalwidth + 2.0313 * petallength + 2.008 * petalwidth a b c 6.786 Correctly Classified Instances 145 96.6667 % < classified as 50 0 0 a = Iris setosa 0 47 3 b = Iris versicolor 0 2 48 c = Iris virginica CPSC 444: Artificial Intelligence Spring 2019 55 Idea binary classification (two classes). based on the posterior probability = probability of an occurrence given evidence assume attributes are independent idea example for yes outcomes, consider separately the probability of a rainy outlook, a mild temperature, a normal humidity, and not windy for independent attributes, the probability of all of these things happening at once is the product of the individual probabilities also factor in the likelihood of a yes outcome compare to no outcomes outlook temperature humidity windy rainy mild normal FALSE CPSC 444: Artificial Intelligence Spring 2019 57 has proven to be robust and accurate in many cases does not require large training sets not sensitive to the number of dimensions efficient training methods solid theoretical foundation Compute ln ( P(1 x) P(x 1) P(1) )=ln( P (0 x) P(x 0) P(0) ) P(i x) = probability of x belonging to class i P(i) = probability of an object belonging to class i the sign of the log indicates whether the probability of x belonging to class 1 is larger or smaller than the probability of x belonging to class 0 P(x i) = probability of x within class i if the components of x are independent, can estimate as the product of P(x j i) for each component x j of x sign of the result indicates the class challenge: if probabilities are estimated from the training set, it could be the case that P(x i i) = 0 solution: use Laplace smoothing use count+1 and total+number of possible values instead CPSC 444: Artificial Intelligence Spring 2019 56 CPSC 444: Artificial Intelligence Spring 2019 58
Example Class Attribute yes no (0.63) (0.38) ============================= outlook sunny 3.0 4.0 overcast 5.0 1.0 rainy 4.0 3.0 [total] 12.0 8.0 temperature hot 3.0 3.0 mild 5.0 3.0 cool 4.0 2.0 [total] 12.0 8.0 humidity high 4.0 5.0 normal 7.0 2.0 [total] 11.0 7.0 windy TRUE 4.0 4.0 FALSE 7.0 3.0 [total] 11.0 7.0 uses Laplace smoothing, so counts are increased by 1 and totals are increased by the number of possible values for the attribute (avoids 0s if there are no training instances with a given value) a b < classified as 7 2 a = yes 4 1 b = no Correctly Classified Instances 8 57.1429 % CPSC 444: Artificial Intelligence Spring 2019 59 for more than two classes compute P(x i) P(i) for each class i choose the class i that maximizes P(x i) P(i) CPSC 444: Artificial Intelligence Spring 2019 61 Example Example outlook temperature humidity windy play sunny hot high TRUE no -1.7149004637 sunny hot high FALSE no -0.8676026033 rainy mild high TRUE no -0.628710695 sunny mild high FALSE no -0.3567769795 rainy mild high FALSE yes 0.2185871654 sunny mild normal TRUE yes 0.2718316799 overcast mild high TRUE yes 0.6930451449 overcast hot high FALSE yes 1.0295173816 rainy cool normal TRUE no 1.0295173816 sunny cool normal FALSE yes 1.3014510971 rainy mild normal FALSE yes 1.6944936852 rainy cool normal FALSE yes 1.876815242 overcast cool normal TRUE yes 2.3512732216 overcast hot normal FALSE yes 2.5054239014 CPSC 444: Artificial Intelligence Spring 2019 60 Class Attribute soft hard none (0.22) (0.19) (0.59) ========================================== age young 3.0 3.0 5.0 pre presbyopic 3.0 2.0 6.0 presbyopic 2.0 2.0 7.0 [total] 8.0 7.0 18.0 spectacle prescrip myope 3.0 4.0 8.0 hypermetrope 4.0 2.0 9.0 [total] 7.0 6.0 17.0 astigmatism no 6.0 1.0 8.0 yes 1.0 5.0 9.0 [total] 7.0 6.0 17.0 tear prod rate reduced 1.0 1.0 13.0 normal 6.0 5.0 4.0 [total] 7.0 6.0 17.0 a b c < classified as 4 0 1 a = soft 0 1 3 b = hard 1 2 12 c = none Correctly Classified Instances 17 70.8333 % CPSC 444: Artificial Intelligence Spring 2019 62
age spectacle-prescrip astigmatism Example tearprod-rate contact -lenses soft hard none young hypermetrope yes normal hard 0.0058 0.0184 0.0109 pre-presbyopic myope yes normal hard 0.0044 0.0245 0.0116 presbyopic myope yes normal hard 0.0029 0.0245 0.0135 young myope yes normal hard 0.0044 0.0367 0.0096 young myope no reduced none 0.0044 0.0015 0.0279 young myope yes reduced none 0.0007 0.0073 0.0314 young hypermetrope no reduced none 0.0058 0.0007 0.0314 pre-presbyopic myope no reduced none 0.0044 0.0010 0.0335 young hypermetrope yes reduced none 0.0010 0.0037 0.0353 pre-presbyopic myope yes reduced none 0.0007 0.0049 0.0376 pre-presbyopic hypermetrope no reduced none 0.0058 0.0005 0.0376 presbyopic myope no reduced none 0.0029 0.0010 0.0390 pre-presbyopic hypermetrope yes normal none 0.0058 0.0122 0.0130 presbyopic hypermetrope yes normal none 0.0039 0.0122 0.0152 pre-presbyopic hypermetrope yes reduced none 0.0010 0.0024 0.0423 presbyopic myope yes reduced none 0.0005 0.0049 0.0439 presbyopic hypermetrope no reduced none 0.0039 0.0005 0.0439 presbyopic hypermetrope yes reduced none 0.0006 0.0024 0.0494 presbyopic myope no normal none 0.0175 0.0049 0.0120 presbyopic hypermetrope no normal soft 0.0233 0.0024 0.0135 young myope no normal soft 0.0262 0.0073 0.0086 pre-presbyopic myope no normal soft 0.0262 0.0049 0.0103 young hypermetrope no normal soft 0.0350 0.0037 0.0096 pre-presbyopic CPSC 444: Artificial Intelligence hypermetrope Spring 2019 no normal soft 0.0350 0.0024 0.011663 easy to implement easy to interpret / understand the resulting classification can be applied to large datasets tends to perform well frequently used in text classification and spam filtering many extensions / modifications Observations. assumption of independence of attributes is not necessarily a problem can start with attribute selection to eliminate highly correlated attributes even with correlated attributes, results based on the independence assumption aren't necessarily wrong CPSC 444: Artificial Intelligence Spring 2019 65 for numeric values discretize can assume a normal distribution and compute probabilities based on that Ensemble Learning Idea. use multiple classifiers to improve on the performance of any one Class Attribute Iris setosa Iris versicolor Iris virginica (0.33) (0.33) (0.33) =============================================================== sepallength mean 4.9913 5.9379 6.5795 std. dev. 0.355 0.5042 0.6353 weight sum 50 50 50 precision 0.1059 0.1059 0.1059 sepalwidth mean 3.4015 2.7687 2.9629 std. dev. 0.3925 0.3038 0.3088 weight sum 50 50 50 precision 0.1091 0.1091 0.1091 petallength mean 1.4694 4.2452 5.5516 std. dev. 0.1782 0.4712 0.5529 weight sum 50 50 50 precision 0.1405 0.1405 0.1405 CPSC 444: Artificial Intelligence Spring 2019 64 CPSC 444: Artificial Intelligence Spring 2019 66
AdaBoost AdaBoost works with weak classifiers (accuracy just above random chance) often decision stumps (single level decision trees) simple algorithm accurate often does not overfit (but it can) solid theoretical foundation CPSC 444: Artificial Intelligence Spring 2019 67 CPSC 444: Artificial Intelligence Spring 2019 70 AdaBoost Algorithm. assign equal weights (1/n) to each training instance n = size of the training set repeat for T rounds or until no further improvement train a classifier using the current training set weights if the classifier algorithm can't deal with weights directly, choose training elements in accordance with their weights test the classifier on the training examples and determine the error adjust the weights based on the error increase the weight of incorrectly classified examples use weighted majority voting amongst the classifiers from each round to determine the class weight is based on the error more accurate models are given higher weights CPSC 444: Artificial Intelligence Spring 2019 69