Machine Learning Models for Pattern Classification. Comp 473/6731

Size: px

Start display at page:

Download "Machine Learning Models for Pattern Classification. Comp 473/6731"

Benedict Haynes
6 years ago
Views:

1 Machine Learning Models for Pattern Classification Comp 473/6731 November 24th 2016 Prof. Neamat El Gayar

2 Neural Networks Neural Networks Low level computational algorithms Learn by example (no required algorithms) Good performance for real time applications Adaptive, generalize, robust, fault tolerant,.etc Drawbacks: Convergence to optimal solution not guaranteed Some implementation heuristics (learning rate, stopping criteria,.etc) Learning sometimes tedious and slow Black-box architecture Interpretable model? Not easy to describe human knowledge 2

3 Topics Popular Classifiers: KNN Support Vector Machines Decision Trees Classifier testing and Evaluation 3

4 More Classifiers

5 Outline Lecture Some popular ML models for classification K-nearest neighbor (KNN) Support vector machines (SVM) Decision Trees 5

6 K-Nearest Neighbor Classifiers Learning by analogy: Tell me who your friends are and I ll tell you who you are A new example is assigned to the most common class among the (K) examples that are most similar to it. To determine the class of a new example E: Calculate the distance between E and all examples in the training set Select K-nearest examples to E in the training set Assign E to the most common class among its K-nearest neighbors 6

7 K-Nearest Neighbor Classification (knn) Unlike all the previous learning methods, knn does not build model from the training data. To classify a test instance d, define k- neighborhood P as k nearest neighbors of d Count number n of training instances in P that belong to class c j Estimate Pr(c j d) as n/k No training is needed. Classification time is linear in training set size for each test case. 7

8 knnalgorithm k is usually chosen empirically via a validation set or cross-validation by trying a range of k values. Distance function is crucial, but depends on applications. 8

9 Example: k=6 (6NN) Government Science Arts A new point Pr(science )? 9

10 Discussions knn can deal with complex and arbitrary decision boundaries. Despite its simplicity, researchers have shown that the classification accuracy of knn can be quite strong and in many cases as accurate as those elaborated methods. knn is slow at the classification time knn does not produce an understandable model 10

11 K-Nearest Neighbor Classifier Strengths: Simple to implement and use Comprehensible easy to explain prediction Robust to noisy data by averaging k-nearest neighbors. Some appealing applications 11

12 Outline Lecture Some popular ML models K-nearest neighbor (KNN) Support vector machines (SVM) Decision Trees Summary & some ML Advice 12

13 Why are SVM interesting? Discriminant based Classifiers Linear Classifier that use the Kernel trick for nonlinearily Maximize generalization Support vectors represent knowledge Learning is based on complex optimization 13

14 Introduction Support vector machines were invented by V. Vapnik and his co-workers in 1970s in Russia and became known to the West in SVMs are linear classifiers that find a hyperplane to separate two class of data, positive and negative. Kernel functions are used for nonlinear separation. SVM not only has a rigorous theoretical foundation, but also performs classification more accurately than most other methods in applications, especially for high dimensional data. It is perhaps the best classifier for text classification. 14

15 Review of Linear Classifiers Linear classifiers One of the simplest classifiers Linear decision boundary Applicable to linearly separable tasks Perceptron One of the most popular linear classifiers Perceptron learning Given a training set containing a number of examples with labels Iteratively update weights of linear equation until convergence Sensitive to initialisation and example input orders during learning Could generate decision boundaries of different generalization capabilities x2-1 y w T +1 x1 w T x b 0 y x b 15

16 Motivation and Concept Perceptron learning denotes +1 denotes -1 Q: How would you classify this data set with the perceptron? 16

17 Motivation and Concept Perceptron learning denotes +1 w T x b 0 denotes -1 Q: How would you classify this data set? A: using perceptron learning rule to learn weights w, b 17

18 Motivation and Concept Perceptron learning denotes +1 denotes -1 w T x b 0 Q: How would you classify this data set? A: Using perceptron learning rule to learn weights w, b Q: With different initial values of w, b and example orders what happens? 18

19 Motivation and Concept Perceptron learning denotes +1 denotes -1 w T x b 0 Q: How would you classify this data set? A: Using perceptron learning rule to learn weights w, b Q: With different initial values of w, b and example orders what happens? A: Leading to many decision boundaries 19

20 Motivation and Concept Perceptron learning denotes +1 denotes -1 w T x b 0 Q: How would you classify this data set? A: Using perceptron learning rule to learn weights w, b Q: With different initial values of w, b and example orders what happens? A: Leading to many decision boundaries Q: Which decision boundary is the best for generalisation? 20

21 Motivation and Concept Margin of linear classifier denotes +1 denotes -1 w T x b 0 Definition: The margin of a linear classifier as the width that the boundary could be increased by before hitting a data point. 21

22 Motivation and Concept Maximum margin denotes +1 w T x b 0 denotes -1 The maximum margin is the one with the widest width to a data point. The maximum margin linear classifier is linear Support Vector Machines. 22

23 Motivation and Concept Support Vectors denotes +1 w T x b 0 denotes -1 Support Vectors Support Vectors are those data points that the margin pushes up against 23

24 Motivation and Concept SVM: the best solution in terms of generalisation denotes +1 denotes -1 Support Vectors w T x b 0 Intuitively this feels safest. If we ve made a small error in the location of the boundary this gives us least chance of causing a misclassification. The model is immune to removal of any non-supportvector data points. There s some theory (using VC dimension) that is related to the proposition that this is a good thing. Empirically it works very well. 24

25 SVM Learning Objectives: finding appropriate weights w and bias b to minimize training errors (the same as that of Perceptron) maximize the margin What is the relationship between weights and margin? with knowledge of analytic geometry, we obtain Support vectors: Margin : M y sv Learning rule is no longer that simple like perceptron! need to search the space of w s and b s to find the widest margin that matches all the data points or support vectors How? Using a Quadratic Programming (QP) algorithm! 2 T W W w 2 1 T ( w x b) 1. sv 2 w 2 n 25

26 Basic concepts Let the set of training examples D be {(x 1, y 1 ), (x 2, y 2 ),, (x r, y r )}, where x i = (x 1, x 2,, x n ) is an input vector in a realvalued space X R n and y i is its class label (output value), y i {1, -1}. 1: positive class and -1: negative class. SVM finds a linear function of the form (w: weight vector) f(x) = w x + b y i 1 1 if if w w x x i i b b

27 The hyperplane The hyperplane that separates positive and negative training data is w x + b = 0 It is also called the decision boundary (surface). So many possible hyperplanes, which one to choose? 27

28 Maximal margin hyperplane SVM looks for the separating hyperplane with the largest margin. Machine learning theory says this hyperplane minimizes the error bound 28

29 Linear SVM: separable case Assume the data are linearly separable. Consider a positive data point (x +, 1) and a negative (x -, - 1) that are closest to the hyperplane <w x> + b = 0. We define two parallel hyperplanes, H + and H -, that pass through x + and x - respectively. H + and H - are also parallel to <w x> + b = 0. 29

30 Compute the margin Now let us compute the distance between the two margin hyperplanes H + and H -. Their distance is the margin (d + + d in the figure). Recall from vector space in algebra that the (perpendicular) distance from a point x i to the hyperplane w x + b = 0 is: where w is the norm of w, w x i w b w ww w1 w... wn 2 (36) (37) 30

31 Compute the margin (cont ) Let us compute d +. Instead of computing the distance from x + to the separating hyperplane w x + b = 0, we pick up any point x s on w x + b = 0 and compute the distance from x s to w x + + b = 1 by applying the distance Eq. (36) and noticing w x s + b = 0, d w x s b 1 w margin d d 2 w 1 w (38) (39) 31

32 A optimization problem! Definition (Linear SVM: separable case): Given a set of linearly separable training examples, y D = {(x 1, y 1 ), (x 2, y 2 ),, (x r, y r )} Learning is to solve the following constrained minimization problem, i Minimize: Subject to: w w 2 y ( w x ( w x b 1, i 1,2,..., r i w x i + b 1 for y i = 1 w x i + b -1 for y i = -1. i i b) 1, summarizes i 1, 2,..., r (40) 32

33 The final decision boundary Finding the support vectors is equivalent to training the SVM From optimization we obtain the values for i, which are used to compute the weight vector w and the bias b The decision boundary w x b y x x b 0 (57) isv Testing: Use (57). Given a test instance z, sign( w z b) i If (58) returns 1, then the test instance z is classified as positive; otherwise, it is classified as negative, x i are the support vectors. i i sign isv y i i x i z b (58) 33

34 How to deal with nonlinear separation? The SVM formulations require linear separation. Real-life data sets may need nonlinear separation. To deal with nonlinear separation, the same formulation and techniques as for the linear case are still used. We only transform the input data into another space (usually of a much higher dimension) so that a linear decision boundary can separate positive and negative examples in the transformed space, The transformed space is called the feature space. The original data space is called the input space. 34

35 Space transformation The basic idea is to map the data in the input space X to a feature space F via a nonlinear mapping, : x X F (76) ( x) After the mapping, the original training data set {(x 1, y 1 ), (x 2, y 2 ),, (x r, y r )} becomes: {((x 1 ), y 1 ), ((x 2 ), y 2 ),, ((x r ), y r )} (77) 35

36 Geometric interpretation In this example, the transformed space is also 2-D. But usually, the number of dimensions in the feature space is much higher than that in the input space 36

37 An example space transformation Suppose our input space is 2-dimensional, and we choose the following transformation (mapping) from 2-D to 3-D: 2 ( x1, x2) ( x1, x2, 2x1x2 ) The training example ((2, 3), -1) in the input space is transformed to the following in the feature space: ((4, 9, 8.5), -1) 2 37

38 Kernel functions In solving the quadratic optimization problem of SVM we only require dot products (x) (z) and never the mapped vector (x) in its explicit form. This is a crucial point. Good news: explicit transformation is not needed. Thus, if we have a way to compute the dot product (x) (z) using the input vectors x and z directly, no need to know the feature vector (x) or even itself. In SVM, this is done through the use of kernel functions, denoted by K, K(x, z) = (x) (z) (82) 38

39 An example kernel function Polynomial kernel K(x, z) = x z d Let us compute the kernel with degree d = 2 in a 2- dimensional space: x = (x 1, x 2 ) and z = (z 1, z 2 ). This shows that the kernel x z 2 is a dot product in a transformed feature space (83), ) ( ) ( ) 2 ( ) 2 ( 2 ) ( z x z x z z, z, z x x, x, x z x z x z x z x z x z x (84) 39

40 Kernel trick The derivation in (84) is only for illustration purposes. We do not need to find the mapping function. We can simply apply the kernel function directly by replace all the dot products (x) (z) in the optimization problem with the kernel function K(x, z) (e.g., the polynomial kernel x z d in (83)). This strategy is called the kernel trick. 40

41 Commonly used kernels It is clear that the idea of kernel generalizes the dot product in the input space. This dot product is also a kernel with the feature map being the identity 41

42 Some other issues in SVM SVM works only in a real-valued space. For a categorical attribute, we need to convert its categorical values to numeric values. SVM does only two-class classification. For multi-class problems, some strategies can be applied, e.g., oneagainst-rest, and error-correcting output coding. The hyperplane produced by SVM is hard to understand by human users. The matter is made worse by kernels. Thus, SVM is commonly used in applications that do not required human understanding. 42

43 43 Kernel SVM Weight/bias solution and decision rule ]. ), ( sign[ ] ) ( ) ( sign[ ] ) ( sign[ )] ( [ )], ( [ 1 )] ( ) ( [ 1 )] ( [ 1 ), ( b K v b v b K v y n v y n y n b v i SV i i i T i i T T j i i i j j j i T i i j j j T j j i i i x x x x x w x w x x x x x w x w SV SV SV SV SV SV SV σ (internal for SVs) In practice, we don t use this transformation but a kernel on inner product. (decision rule)

44 Conclusions (Linear) SVM is a state-of-the-art linear classifier Developed based on statistical learning theory Learning process seeking support vectors or maximum margin Best generalisation performance guaranteed for linearly separable data sets Kernel trick : extending linear SVM to non-linear SVM In principle, data points are mapped onto a higher dimensional feature space so that they are linearly separable in that space In reality, a kernel function directly works on the inner product of data points With the kernel transformation, linear SVM extended to non-linear SVM can be extended to multi-category classification Decompose the problem into multiple binary classification tasks Variants of SVM that can tackle multi-category classification in a straightforward way 44

45 Outline Lecture Some popular ML models K-nearest neighbor (KNN) Support vector machines (SVM) Decision Trees Summary & some ML Advice 45

46 Example Pattern is described by a list of attributes Fruit: Colour {red, green, yellow} Texture {shiny, rough, smooth} Shape { round, thin} Taste {sweet, sour} Size {big, small} 46

47 47

48 What is a decision Tree? It is a hierarchical data structure implementing a divide and conquer strategy It is composed of internal decision nodes and terminal leaves Decision nodes: implement a function (test) Leave nodes: produce output (decision) 48

49 What makes DT interesting? Nonlinear decision boundaries Categorical data Interpretable models Extract classification Rules from model 49

50 Decision Tree for Classification A pattern is classified by a sequence of questions What is special about that? You can handle nominal data Interpretability: 1. Get insight about why a test pattern was classified to belong to a certain class X={sweet, yellow, thin, medium} this is a banana because it is yellow and thin 2. Derive classification Rules (logical descriptions for categories) Apple= (green AND medium) OR (red AND medium) Integrate Expert Knowledge Decision Rapid classification Trees are by sometimes simple query(using preferred over only more necessary accurate tests) (NN?) but less interpretable model 50

51 51

52 Decision Trees Univariate Trees: Classification Trees Discrete Variables Continuous Variables Pruning Rule Extraction form Trees 52

53 Tree Uses Nodes, and Leaves 53

54 The figures below are such examples. This type of trees is known as Ordinary Binary Classification Trees (OBCT). The decision hyperplanes, splitting the space into regions, are parallel to the axis of the spaces. Other types of partition are also possible, yet less popular. 54

55 Divide and Conquer Internal decision nodes Univariate: Uses a single attribute, x i Numeric x i : Binary split : x i > w m Discrete x i : n-way split for n possible values Multivariate: Uses all attributes, x Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) 55

56 Tree Induction Creating a Tree How does it work: Start at the root (complete training data) repeat following steps recursively { Look for the best split Split the training data into: 2 splits if attribute is numeric n splits if attribute is discrete Continue for resulting splits until no more splitting is needed create leaf node } 56

57 Classification Trees (ID3, CART, C4.5) For node m, N m instances reach m, N i m belong to C i Pˆ C i x,m p i m N N Node m is pure if p i m is 0 or 1 Measure of impurity is entropy i m m I m K i 1 p i m log 2 p i m 57

58 Best Split If node m is pure, generate a leaf and stop, otherwise split and continue recursively Impurity after split: N mj of N m take branch j. N i mj belong to C i I' Pˆ m C i x,m, j n j 1 N N mj m Find the variable and split that min impurity (among all variables -- and split positions for numeric variables) K i 1 p p i mj i mj N N log 2 i mj mj p i mj Overall Impurity: Quantifies goodness of a split 58

59 Finding the best split position for numeric variables How can we choose w Test at decision node: m No f m (x): need x i to > try w m all values We have at most N m -1 poss Enough to test splits where divides adjacent points input belong space to into: Ldifferent m ={x / xclasses i > w m } left branch ==for x 1 try points A, B, C, D (5,6.5,7.5, 8.5) ==for x 2 try points A, B, C, D R(2.5,3.5,4.5,5.5) m ={x /x i < w m } right branch 59

60 Which is the best tree? For a given training data; many trees exist that code this data with no error. We need to find the smallest among these trees Tree size is measured as: No. of nodes on the tree Complexity of the decision nodes 60

61 Model Selection in Trees: Y=0.9 61

62 Pruning Trees Remove subtrees for better generalization (decrease variance) Prepruning: Early stopping (ex. Instance reaching node< 5% data) Postpruning: Grow the whole tree then prune subtrees which overfit on the pruning set Prepruning is faster, postpruning is more accurate (requires a separate pruning set) 62

63 Rule Extraction from Trees C4.5Rules (Quinlan, 1993) If-then rules =>Rule Base 63

64 Rule Extraction from Trees(cont.) A decision rule does its own feature extraction: Certain features might not be used (see feature x3) Features close to the root are globally more important Interpretability: - Allows model to be verified by expert - Get insight on important variables and how they effect output decision - Describe features and their relations that describe a certain class - Rules can also be pruned 64

65 Multivariate Trees 65

66 Remarks: A critical factor in the design is the size of the tree. Usually one grows a tree to a large size and then applies various pruning techniques. Decision trees belong to the class of unstable classifiers. This can be overcome by a number of averaging techniques. Bagging is a popular technique. Using bootstrap techniques in X, various trees are constructed, T i, i=1, 2,, B. The decision is taken according to a majority voting rule. 66

67 Outline Lecture Some popular ML models K-nearest neighbor (KNN) Support vector machines (SVM) Decision Trees Evaluation and best practices 67

68 Evaluation and best Practices

69 Classifier Performance assessment How good is the classifier that I designed? How well does it compare to competing techniques? Related question: Can an ensemble of classifiers improve performance? 69

70 Topics Accuracy and Error Measures Classifier accuracy measures Predictor error measures Classifier Evaluation: Evaluation Criteria Evaluation Methods: Holdout method and Random Resampling Cross Validation Bootstrapping Comparing classifier performance

71 Learning a Class from Examples Class C of a family car Prediction: Is car x a family car? Knowledge extraction: What do people expect from a family car? Output: Positive (+) and negative ( ) examples Input representation: x 1 : price, x 2 : engine power 71

72 Training set X X x t N { t,r } t 1 r 1 if 0 if x x is is positive negative x x x

73 Class C p price p AND e engine power e

74 Hypothesis class H h( x) 1 if h 0 if h classifies x classifies x as positive as negative Error of h on H N E( h X) 1h x t 1 t r t 74

75 Noise and Model Complexity Use the simpler one because Simpler to use (lower computational complexity) Easier to train (lower space complexity) Easier to explain (more interpretable) Generalizes better (lower variance - Occam s razor) 75

76 Bias- Variance Trade-off Bias and variance measure the alignment or match of the learning algorithm to the classification problem. Bias measures quality of match: (keep low ) Has enough free parameters to match problem; not too simple high bias more rigid to capture data characteristics Variance measures precision of match: (keep low) Performance does not change with slight changes in training data - favours simple models Procedures with increased flexibility tend to have low bias but high variance and vice versa Bias- Variance Trade-off 76

77 Accuracy and Error Measures

78 Classification measures Accuracy is only one measure (error = 1-accuracy). Accuracy is not suitable in some applications. In text mining, we may only be interested in the documents of a particular topic, which are only a small portion of a big document collection. In classification involving skewed or highly imbalanced data, e.g., network intrusion and financial fraud detections, we are interested only in the minority class. High accuracy does not mean any intrusion is detected. E.g., 1% intrusion. Achieve 99% accuracy by doing nothing. The class of interest is commonly called the positive class, and the rest negative classes. 78

79 Precision and recall measures Used in information retrieval and text classification. We use a confusion matrix to introduce them. 79

80 Precision and recall measures (cont ) p TP. TP FP TP TP FN Precision p is the number of correctly classified positive examples divided by the total number of examples that are classified as positive. Recall r is the number of correctly classified positive examples divided by the total number of actual positive examples Machine in the learning test model for set. Pattern r. 80

81 An example This confusion matrix gives precision p = 100% and recall r = 1% because we only classified one positive example correctly and no negative examples wrongly. Note: precision and recall only measure classification on the positive class. 81

82 F1-value (also called F1-score) It is hard to compare two classifiers using two measures. F 1 score combines precision and recall into one measure The harmonic mean of two numbers tends to be closer to the smaller of the two. For F 1 -value to be large, both p and r much be large. 82

83 Measuring Error Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # of found positives / # of found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity 83

84 ROC Curve 84

85 Evaluation of classifiers

86 Evaluating classification methods (Han & Kamber, 2 nd edition, 2006; chapter 6) Predictive accuracy/error Efficiency time to construct the model time to use the model Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability: understandable and insight provided by the model Compactness of the model: size of the tree, or the number of rules. 86

87 Algorithm Preference Criteria (Application-dependent): Misclassification error, or risk (loss functions) Training time/space complexity Testing time/space complexity Interpretability Easy programmability Cost-sensitive learning 87

88 Evaluation methods Holdout set: The available data set D is divided into two disjoint subsets, the training set D train (for learning a model) the test set D test (for testing the model) Important: training set should not be used in testing and the test set should not be used in learning. Unseen test set provides a unbiased estimate of accuracy. The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.) This method is mainly used when the data set D is large. 88

89 Evaluation methods (cont ) k-fold cross-validation: The available data is partitioned into k equal-size disjoint subsets. Use each subset as the test set and combine the rest k-1 subsets as the training set to learn a classifier. The procedure is run k times, which give k accuracies. The final estimated accuracy of learning is the average of the k accuracies. 10-fold and 5-fold cross-validations are commonly used. This method is used when the available data is not large. 89

Resampling and K-Fold Cross-Validation 1 2 1 3 1 2 2 2 3 2 1 1 1 K K K K K K X X X T X V X X X T X V X X X T X V The need for multiple

90 Resampling and K-Fold Cross-Validation K K K K K K X X X T X V X X X T X V X X X T X V The need for multiple training/validation sets {X i,v i } i : Training/validation sets of fold i K-fold cross-validation: Divide X into k, X i,i=1,...,k T i share K-2 parts 90

91 Cross-Validation To estimate generalization error, we need data unseen during training. We split the data as Training set (50%) Validation set (25%) Test (publication) set (25%) Resampling when there is few data 91

92 5 2 Cross-Validation 5 times 2 fold cross-validation (Dietterich, 1998) T T T T T T X X X X X X 2 5 V V V V 1 V 2 3 V X X X X 2 1 X X

93 Bootstrapping Draw instances from a dataset with replacement Prob that we do not pick an instance after N N draws N e that is, only 36.8% is new! 93

94 Evaluation methods (cont ) Leave-one-out cross-validation: This method is used when the data set is very small. It is a special case of cross-validation Each fold of the cross validation has only a single test example and all the rest of the data is used in training. If the original data has m examples, this is m-fold crossvalidation 94

95 Evaluation methods (cont ) Validation set: the available data is divided into three subsets, a training set, a validation set and a test set. A validation set is used frequently for estimating parameters in learning algorithms. In such cases, the values that give the best accuracy on the validation set are used as the final parameter values. Cross-validation can be used for parameter estimating as well. 95

96 More on Classifier Performance Using statistical tests for evaluation and comparison Improving accuracy: Bagging and boosting Classifier Combination 96

97 Best Practices Understand your model (strength/limitations/best practices) REASON ABOUT YOUR RESULTS (get insight on your data) 97

98 Diagnostics tell you what to try next Diagnostic for bias and Variance: Variance: Training error will be much lower than test error. Bias: Training error will also be high. Fixes to try: Try getting more training examples. Fixes high variance. Try a smaller set of features. Fixes high variance. Try a larger set of features. Fixes high bias Enhance Model Switch to a different model 98

99 Good machine learning practice Understanding your application problem: get a intuitive understand of what works and what doesn t work in your problem. convey insight about the problem, and justify your research claims: i.e., Rather than saying Here s an algorithm that works, it s more interesting to say Here s an algorithm that works because of component X, and here s my justification. Error analysis. Try to understand what your sources of error are. 99

100 Getting started on a problem Approach #1: Careful design. Spend a long term designing exactly the right features, collecting the right dataset, and designing the right algorithmic architecture. Implement it and hope it works. Benefit: Nicer, perhaps more scalable algorithms. May come up with new, elegant, learning algorithms; contribute to basic research in machine learning. 100

101 Getting started on a problem Approach #2: Build-and-fix. Implement something quick-and-dirty. Run error analysis and diagnostics to see what s wrong with it, and fix its errors. Benefit: Will often get your application problem working more quickly. Faster time to market. 101

102 Putting it All Together! Time spent coming up with diagnostics for learning algorithms is time well-spent. It s often up to how skilled you are to come up with right diagnostics.(cost-sensitive, time/space complexity, convergence,...etc.) Error analysis and learning curves also give insight into the problem. Two approaches to applying learning algorithms: Design very carefully, then implement. Build a quick-and-dirty prototype, diagnose, and fix. 102

103 Classification error Simple classifier Learning curves Complex classifier? Bayes error Size training set 103

104 Thank YOU!

Naïve Bayes for text classification

Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support