1) Give decision trees to represent the following Boolean functions:

Size: px

Start display at page:

Download "1) Give decision trees to represent the following Boolean functions:"

Stanley Wilson
6 years ago
Views:

1 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1

2 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2

3 2) Consider the following set of training examples: Example A1 A2 A3 class 1 T T T + 2 T T F + 3 T F T - 4 T F F + 5 F T T - 6 F T F - 7 F F T + 8 F F F - A) What is the entropy of this collection of training examples with respect to the target function classification? The entropy of this collection of training examples with respect to the target function classification is E(S) = 1bescuase it contains equal numbers of positive and negative examples. B) What is the information gain of feature A2 relative to these training examples? The information gain of feature A2 relative to these training examples G(S,A2) = E(S) - S A2 / S * E(S) = 1 ( 4/8 * 1 + 4/8 * 1) = 0 C) What is the best feature relative to these training examples, using Gain Ratio? The best feature relative to these training examples is the feature with the maximum information gain, and in this example any feature can selected because the gain of all features are the same. Define the following terms; illustrate your answer with mathematical equations and drawing The steps required to build a face recognition application K-fold cross-validation - Cross validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. - K-fold cross validation is a common type of cross validation techniques in which the original sample is randomly partitioned into k mutually exclusive subsets D1,D2,,Dk. 3

4 - Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k 1 subsamples are used as training data. - The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. Root-Mean-Square-Error acc cv = 1/n ( (vi,yi) D σ(i(d\di,vi),yi)) - (RMSE) is a frequently used measure of the differences between values predicted by a model or an estimator and the values actually observed. - The RMSE serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. - RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a particular variable and not between variables, as it is scaledependent. - The RMSE Formula is RMSE = Difference between supervised and unsupervised learning - Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. - In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). - A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples - Supervised learning algorithm works as following Given a set of N training examples of the form {(x 1,y 1 ),, (x n,y n )}, such that x i is the feature vector and y i is its label (class), a learning algorithm seeks a function g: X Y, where X is the input space and Y is the output space. The function g is an element of some space of possible functions G, usually called the hypothesis space. 4

5 It is sometimes convenient to represent g using a scoring function f: X * Y R such that g is defined as returning the y value that gives the highest score: g(x) = argmax f(x,y). - Unsupervised Learning is the problem of trying to find hidden structure in unlabeled data. - Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. - This distinguishes unsupervised learning from supervised learning and reinforcement learning. Difference between classification and regression - Classification is a technique used to arrive at a schematic that shows the organization of data starting with a precursor variable. - Regression is a prediction method that is based on an assumed or known numerical output value. This output value is the result of a series of recursive partitioning, with every step having one numerical value and another group of dependent variables which branch out to another pair such as this. - The difference between Classification and Regression is their dependent variable. For the classification tree, the dependent variables are categorical, while the regression tree has numerical dependent variables. Those of the classification tree also have a set amount of unordered values, while those of the regression tree have either discrete yet ordered values or indiscrete values. Leave-one out cross-validation - Leave-one-out cross-validation involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. - This is repeated such that each observation in the sample is used once as the validation data. - This is the same as a K-fold cross-validation with K being equal to the number of observations in the original sampling. How to solve missing feature values problem? - If some examples missing values of some attribute A, estimate the missing value based on other examples with known values as following If node n tests A, assign most common value of A among other training examples at node n. Assign most common value of A among other examples at node n with the same target attribute value. 5

6 Assign probability pi to each possible value v i of A, and assign fraction pi of example to each descendant in tree. Classify new examples in the same fashion. Over-fitting - Overfitting is a significant practical difficulty for decision tree models and many other predictive models which happens when the learning algorithm continues to develop hypotheses that reduce training set error at the cost of an increased test set error. - given a hypothesis space H, a hypothesis h ϵ H is said to overfit training data if there is an alternative hypothesis h ϵ H such that How to avoid over-fitting of decision trees error train (h) < error train (h ) and error D (h) > error D (h ) There are several approaches to avoiding overfitting in building decision trees - Pre-Pruning that stop growing the tree earlier, before it perfectly classifies the training set. Split data into training and validation set. Evaluate impact on validation set of pruning each possible node (plus those below it). Greedily remove the one that most improves the validation set accuracy. Repeat the above two steps until further pruning is harmful. Produces smallest version of most accurate subtree. - Post-Pruning that allows the tree to perfectly classify the training set, and then post prune the tree. Infer the decision tree from the training set, growing the tree and allow overfitting to occur. Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to the leave node. Prune (generalize) each rule by removing any precondition that results in improving its estimated accuracy. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances. - Minimum Description Length principle: Use an explicit measure of the complexity for encoding the training set and the decision tree, stopping growth of 6

the tree when this encoding size (size (tree) + size (misclassifications (tree)) is minimized. Entropy and Mutual information - Entropy is a measure of impurity among a sample S of training example.

7 the tree when this encoding size (size (tree) + size (misclassifications (tree)) is minimized. Entropy and Mutual information - Entropy is a measure of impurity among a sample S of training example. Entropy(S) = - p + log 2 p + - p - log 2 p - - Mutual Information between two random variables is a measure of the mutual dependence or the amount of shared information between two random variables. Where: H(X) and H(Y) are the margin entropies. I(X,Y) = H(X) H(X Y) = H(Y) H(Y X) H (X Y) and H (Y X) are the conditional entropies. - Formally, the mutual information of two discreet random variables X, Y can be defined as I(X, Y) = y ϵ Y x ϵ X p(x, y) * log (p(x, y)/p(x)*p(y)) 7

8 Where: p(x, y) is the joint probability distribution function of X and Y, and p(x), p(y) are the marginal probability distribution function of X and Y respectively Gain Ratio. - Gain ratio is a measure to penalize attributes by incorporating a term, called information spilt that is sensitive to how broadly and uniformly the attribute splits the data. SpiltInformation(S, A) = - ( S i / S ) log 2 ( S i / S ) Where Si is the subset for which attribute A has the value vi The Gaussian Distribution GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A) - The normal (or Gaussian) distribution is a very commonly occurring continuous probability distribution a function that tells the probability that an observation in some context will fall between any two real numbers. - The simplest case of a normal distribution is known as the standard normal distribution, described by this probability density function - General normal distribution Any normal distribution is a version of the standard normal distribution whose domain has been stretched by a factor σ (the standard deviation) and then translated by μ (the mean value) - The normal distribution is also often denoted by N(μ, σ 2 ), Thus when a random variable X is distributed normally with mean μ and variance σ 2, we write 8

9 Lazy Learning Algorithms - Lazy learning algorithms are machine learning algorithms that are welcome members of procrastinators anonymous. Purely lazy learners typically have the following characteristics: Defer: they delay the processing of their inputs until they receive requests for information; they simply store their inputs for future use. Demand-Driven: they reply to information queries by combining information from their stored (e.g., training) samples. Discard: they delete the constructed query and any intermediate results. Fuzzy k-nearest neighbors classifier - Fuzzy k-nearest neighbors classifier can be considered as a fuzzy classifier which can provide an estimate for the degree of membership of an example x to each class in Ω where the distance between x and its k closest neighbors are incorporated is represented by a mapping f: R D [0, 1] C. And Φ(x) can take one of the following forms The inverse of the distance > The decreasing exponential function Euclidian Distance - Euclidian Distance is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. - The Euclidean distance between points p and q is the length of the line segment connecting them (pq). - In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance from p to q, or from q to p is given by 9

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete