CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Size: px

Start display at page:

Download "CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek"

Alisha Hodge
6 years ago
Views:

1 CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Fall, 2015!1

2 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function which predicts the outcome from the inputs: Given a training set of N example pairs (x 1, y 1 ) (x 2,y 2 )...(x n,y n ), where each y j was generated by an unknown function y = f(x), discover a function h that approximates the true function y A large class of supervised learning approaches discover h with regression methods.!2

3 Regression Classifiers The basic idea underlying regression is: plot all of the sample points for n-feature examples in an n-dimensional space find an n-dimensional plane that separates the positive examples from the negative examples best If you are familiar with the statistical concept of regression for prediction, this is the same idea.!3

8 Measuring Fit For prediction, we can measure the error of our prediction by looking at how far off our predicted value is from the actual value. compute individual errors sum them Typically we are much more worried about large errors than small ones square the errors This gives us a measure of fit which is the sum of squared errors This best fit hypothesis can be solved analytically (see equation 18.3 in the text)!8

9 Linear Classifiers A linear classifier is just a hypothesis determined by linear regression with a threshold added Rather than a hard threshold we typically use a logistic function to determine the optimal cutoffs. fitted through gradient descent!9

10 More on Regression Models So far we have discussed linear models We can add dimensions to the model by including higher-order terms, such as squared or cubed values of the features. As with decision trees, we can get overfitting. If we add enough dimensions we can fit almost anything!* *For consistent data!10

11 Support Vector Machines A Support Vector Machine (SVM) is a classifier It uses features of instances to decide which class each instance belongs to It is a supervised machine-learning classifier Training cases are used to calculate parameters for a model which can then be applied to new instances to make a decision It is a binary classifier it distinguishes between two classes It is currently the most popular off-the-shelf machine learning classifier!11

12 Basic Idea Underlying SVMs Find a line, or a plane, or a hyperplane, that separates our classes cleanly. This is the same concept as we have seen in regression. By finding the greatest margin separating them This is not the same concept as we have seen in regression. What does it mean?!12

14 Classifier Margin denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Borrowed heavily from Andrew tutorials:: Copyright 2001, 2003, Andrew W. Moore!14

15 Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. Called Linear Support Vector Machine (SVM) Borrowed heavily from Andrew tutorials:: Copyright 2001, 2003, Andrew W. Moore!15

16 Maximum Margin denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against The maximum margin linear classifier is the linear classifier with the, um, maximum margin. Called Linear Support Vector Machine (SVM) Borrowed heavily from Andrew tutorials:: Copyright 2001, 2003, Andrew W. Moore!16

17 Why Maximum Margin? denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against 1. If we ve made a small error in the location of f(x,w,b) the boundary = sign(w. (it s x been - b) jolted in its perpendicular direction) this gives us least chance of causing a The maximum misclassification. margin linear classifier is the linear classifier with the, um, maximum margin. 2. Empirically it works very very well. This is the simplest kind of SVM (Called an LSVM) Borrowed heavily from Andrew tutorials:: Copyright 2001, 2003, Andrew W. Moore!17

18 Concept Check For which of these could we use a basic linear SVM? A: Classify the three kinds of iris in the UC Irvine data set? B: Classify into spam and non-spam? C: Classify students into likely to pass or not? Which of these is the SVM margin? B A!18

19 Messy Data This is all good so far. Suppose our aren t that neat:!19

20 Soft Margins Intuitively, it still looks like we can make a decent separation here. Can t make a clean margin But can almost do so, if we allow some errors We introduce a slack variable, which measure the degree of misclassification and adds a cost (C) for these misclassified instances. Tradeoff between wide margin and classification errors High cost will give relatively narrow margins Low cost will give broader margins but misclassify more data. How much we want it to cost to misclassify instances depends on our domain -- what we are trying to do!20

21 Only Two Errors, Narrow Margin!21

22 Several Errors, Wider Margin!22

23 Finding the Margin Conceptually similar to sum of squared errors for regression First we find the maximum margin separator minimize the error of the points that are closest to the separator line Formula in the text The margin is then the band that touches the nearest points!23

24 Evaluating SVMs Same as evaluating any other classifier Train on sample data, evaluate on test data (why?) Look at: classification accuracy confusion matrix sensitivity and specificity!24

25 More on Evaluating SVMs Overfitting: very close fit to training data which takes advantage of irrelevant variations in instances performance on test data will be much lower may mean that your training sample isn t representative in SVMs, may mean that C is too high Is the SVM actually useful? Compare to the majority classifier!25

26 Non-Linearly-Separable Data Suppose we can t get a good linear separation of data? As with regression, allowing non-linearity will give us better modeling of many data sets. In SVMs, we do this by using a kernel. A kernel is a function which maps our data into a higher-order order feature space where we can find a separating hyperplane Common kernels are polynomial, RBF!26

30 SVM Kernel Functions K(a,b)=(a. b +1) d is an example of an SVM Kernel Function Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function Most common: Radial-Basis-style Kernel Function: Borrowed heavily from Andrew tutorials:: Copyright 2001, 2003, Andrew W. Moore!30

Polynomial Borrowed heavily from Andrew tutorials:: http://www.

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.

33 Kernel Trick We don t actually have to compute the complete higher-order function In computing the SVM we only use the dot product So we replace it with a kernel function This means we can work with much higher dimensions without getting hopeless performance The kernel trick in SVMs refers to all of this: using a kernel function instead of the dot product to give us separation of non-linear data without impossible performance cost.!33

34 Why SVMs? Focus on the instances nearest the margin is paying more attention to where the differences are critical Can handle very large feature sets effectively In practice has been shown to work well in a variety of domains!34

35 Summary SVMs are a form of supervised classifier The basic SVM is binary and linear, but there are non-linear and multi-class extensions One key insight and one neat trick 1 key insight: maximum margin separator neat trick: kernel trick A good method to try first if you have no knowledge about the domain Applicable in a wide variety of domains 1 Artificial Intelligence, a Modern Approach, third edition, Russell and Norvig, 2010, p, 744!35

36 Learning by Analogy: Case-based Reasoning Case-based systems are a significant chunk of AI in their own right. A case-based system has two major components: Case base Problem solver The case base contains a growing set of cases, analogous to either a KB or a training set. Problem solver has A case retriever and A case reasoner. May also have a case installer.!36

37 Case-based Reasoning Definition of relevant features is critical: Need to get the ones which influence outcomes At the right level of granularity The reasoner can be a complex planning and what-if reasoning system, or a simple query for missing data. Only really becomes a learning system if there is a case installer as well. Can grow cumulatively. 37

38 K-Nearest Neighbor All instances form the trained system For a new case, determine distance to each training instance. Typically Euclidian distance Manhattan distance Weighted distance metrics Use the K nearest instances to determine class 18!38

39 Example Feature 1?? Feature 2!39

40 KNN: What Value for K? Tradeoff between looking at more neighbors (larger K). Gives more description looking at fewer neighbors. Overfitting again. Typically, start with K = 1, then 3, etc, until accuracy drops. clustering.ppt!40

41 KNN Advantages and Disadvantages Advantages Incremental. Training is very fast (lazy!) All information is retained Can learn quite complex relationships Disadvantages Uses a lot of storage, since all instances are retained Slow at query time Sensitive to irrelevant features!41

42 Neural Nets, the very short version A neural net consists of layers of nodes, or neurons, each of which has an activation level Nodes of each layer receive inputs from previous layers; these are combined according to a set of weights. If the activation level is reached the node fires and sends inputs to the next level The initial layer is data from cases; the final layer is expected outcomes Learning is accomplished by modifying the weights to reduce the prediction error!42

43 Neurons a 0 = 1 a j = g(in j ) wi,j a i Bias Weight w 0,j Σ in j g a j Input Links Input Function Activation Function Output Output Links Figure 18.19: A neuron 1 w 1,3 3 1 w 1,3 3 w 3,5 5 w 1,4 w 1,4 w 3,6 2 w 2,3 w 2,4 4 2 w 2,3 w 2,4 4 w 4,5 w 4,6 6 (a) Figure 18:20 A simple network (b) A network with an input layer, a hidden layer, and an output layer. Figures from R&N!43

44 Neural Nets, continued The typical method of modifying the weights is back-propagation success or failure at the output node is propagated back through the nodes which contributed to that output node essentially a form of gradient descent Number of hidden nodes/layers is complicated decision too many = overfitting Typical is to try several and evaluate!44

45 Deep Learning Modern extension of Neural Nets Multiple layers With non-linear functions Each layer learns an abstraction of the previous layer s data Scaleable: Can be partitioned across multiple CPUs Recent Has just become of greater importance because underlies Google s ML And they have just open sourced it en//people/jeff/cikm-keynote-nov2014.pdf

46 Reinforcement Learning If an agent has multiple sequential actions to perform, learning needs a different mode each action affects available future actions feedback may not be available after every action agent has a long-term goal to maximize Agent learns a policy which is basically a transition function Issues include exploration vs exploitation credit assignment generalization!46

47 Supervised Learning, Summary Learning systems which, given examples and results, learn a model which will yield correct results for future examples Typically used for classification Most widely applied ML category Depends on getting relevant features and representative examples Evaluated against separate test sample; overfitting Usefulness should be checked against majority classifier!47

48 Unsupervised Learning Typically used to refer to clustering methods which don t require training cases No prior definition of goal Typical aim is put similar things together Grouping search results Grouping inputs to a customer response system Purchases from a web site Census data about types and costs of housing Combinations of hand-modeled and automatic can work very well: Google News, for instance. Still requires good feature set!48

49 Clustering Basics Collect examples Compute similarity among examples according to some metric Group examples together such that examples within a cluster are similar examples in different clusters are different Summarize each cluster Sometimes -- assign new instances to the most similar cluster!49

50 Clustering Example Based on: 3

51 Clustering Example Based on: 3

52 Measures of Similarity In order to do clustering we need some kind of measure of similarity. This is basically our critic Vector of values, depends on domain: documents: bag of words, linguistic features purchases: cost, purchaser data, item data census data: most of what is collected Similar issue to KNN Cosine similarity!51

53 Cosine similarity measurement Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value. As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases. Based on home.iitk.ac.in/~mfelixor/files/non-numeric-clustering-seminar.pp!52

54 Clustering Algorithms Hierarchical Bottom up Top-down Flat K means Probabilistic Expectation Maximumization (E-M)!53

55 Hierarchical Agglomerative Clustering (HAC) Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy. Based on: 7!54

56 Dendogram: Hierarchical Clustering Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

57 Partitioning (Flat) Algorithms Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal: exhaustively enumerate all partitions. Usually too expensive. Effective heuristic methods: K-means algorithm.

58 K-Means Clustering Typically provide number of desired clusters, k. Randomly choose k instances as seeds. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations. Based on: 18!57

59 K Means Example (K=2) Based on: 22

60 K Means Example (K=2) Pick seeds Based on: 22

61 K Means Example (K=2) Pick seeds Reassign clusters Based on: 22

62 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Based on: 22

63 K Means Example (K=2) x x Pick seeds Reassign clusters Compute centroids Reasssign clusters Based on: 22

64 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Based on: 22

65 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Based on: 22

66 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Based on: 22

67 K-Means Tradeoff between having more clusters (better focus within each cluster) and having too many clusters. Overfitting again. Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. The algorithm is sensitive to outliers Data points that are far from other data points. Could be errors in the data recording or some special data points with very different values.

68 Strengths of k-means Strengths: Simple: easy to understand and to implement Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations. Since both k and t are small. k-means is considered a linear algorithm. K-means is most popular clustering algorithm. In practice, performs well, especially on text.

69 REALLY Unsupervised Learning Turn the machine loose to learn on its own Needs A representation. Still need some idea of what we are trying to learn! Good natural language processing A context People don t learn very well unsupervised. Currently some interesting research for instancelevel knowledge. Much harder to acquire structural or relational knowledge but we are getting there.!61

70 More Aspects of Machine Learning Machine learning varies by degree of human intervention: Rote -- human builds KB. Cyc Human assisted -- human adds knowledge directed by machine. Animals, Teiresias Human scored -- human provides training cases. Neural nets, ID3, CART. Completely automated. -- Nearest Neighbor, other clustering methods!62

71 More Aspects of Machine Learning Machine Learning varies by degree of transparency Hand-built KBs are by definition clear to humans Human-aided trees like Animals are also generally clear and meaningful, could easily be modified by humans Inferred rules like ID3's are generally understood by humans but may not be intuitively obvious. Modifying them by hand may lead to worse results. Systems like SVMs are typically black box: you can look at the models but it's hard to interpret them in any humanmeaningful way and essentially impossible to modify them by hand.!63

72 More Aspects of Machine Learning Machine learning varies by goal of the process Extend a knowledge base Improve some kind of decision making, such as guessing an animal or classifying diseases. Improve overall performance of a program, such as game playing Organize large amounts of data Find patterns or "knowledge" not previously known, often to take some action.!64

73 The Web Machine learning is one of those fields where the web is changing everything! Three major factors One problematic aspect of machine learning research is finding enough data. This is NOT an issue on the web! Another problematic aspect is getting a critic Web offers a lot of opportunities A third is identifying good practical uses for machine learning Lots of online opportunities here!65

74 Summary Valuable both because we want to understand how humans learn and because it improves computer systems May learn representation or actions or both Variety of methods, some knowledge-based and some statistical Currently very active research area Web is providing a lot of new opportunities Google, Amazon and other large companies are really pushing the limits

CS 8520: Artificial Intelligence

CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Spring, 2013 1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function