CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Similar documents
Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Machine Learning

Machine Learning

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Bayes Net Learning. EECS 474 Fall 2016

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Introduction to Machine Learning

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

AI Programming CS S-15 Probability Theory

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Decision Trees: Discussion

Machine Learning

Artificial Intelligence Naïve Bayes

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

BITS F464: MACHINE LEARNING

Recognition Part I: Machine Learning. CSE 455 Linda Shapiro

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Machine Learning

Probabilistic Learning Classification using Naïve Bayes

Machine Learning. Sourangshu Bhattacharya

CS 343: Artificial Intelligence

Bayesian Classification Using Probabilistic Graphical Models

Introduction to Graphical Models

Machine Learning. Supervised Learning. Manfred Huber

A Brief Introduction to Bayesian Networks AIMA CIS 391 Intro to Artificial Intelligence

Lecture 5: Decision Trees (Part II)

Machine Learning

Data Mining Algorithms: Basic Methods

Lecture 3: Conditional Independence - Undirected

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

COMP90051 Statistical Machine Learning

Simple Model Selection Cross Validation Regularization Neural Networks

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Machine Learning. Chao Lan

Bayesian Networks and Decision Graphs

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

1) Give decision trees to represent the following Boolean functions:

Machine Learning

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Data Mining and Analytics

Learning Bayesian Networks (part 3) Goals for the lecture

Building Classifiers using Bayesian Networks

Lecture 5: Exact inference

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

Sequence Labeling: The Problem

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Graphical Models. David M. Blei Columbia University. September 17, 2014

A Brief Introduction to Bayesian Networks. adapted from slides by Mitch Marcus

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Machine Learning

Machine Learning

Decision Trees. Query Selection

STA 4273H: Statistical Machine Learning

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Introduction to Machine Learning CMU-10701

Classification with Decision Tree Induction

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Data Mining Practical Machine Learning Tools and Techniques

CS 343: Artificial Intelligence

Lecture 4: Undirected Graphical Models

ECE521 W17 Tutorial 10

CS 188: Artificial Intelligence Fall Machine Learning

Feature Selection for Image Retrieval and Object Recognition

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

CS 343: Artificial Intelligence

COMP90051 Statistical Machine Learning

The Basics of Graphical Models

CS 188: Artificial Intelligence

Introduction to Machine Learning CANB 7640

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Boosting Simple Model Selection Cross Validation Regularization

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

Unsupervised Learning

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Probabilistic Graphical Models

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Deep Boltzmann Machines

Structured Learning. Jun Zhu

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

PARALLEL CLASSIFICATION ALGORITHMS

5 Learning hypothesis classes (16 points)

Introduction to Probabilistic Graphical Models

3 : Representation of Undirected GMs

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Dynamic Bayesian network (DBN)

Generative and discriminative classification techniques

Variational Autoencoders. Sargur N. Srihari

Cambridge Interview Technical Talk

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Transcription:

ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Optimal Naïve Nets (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu Optimal Naïve Nets 1 Provide practical learning algorithms: Naïve learning ian belief network learning ombine prior knowledge (prior probabilities) with observed data Requires prior probabilities 2 Provides useful conceptual framework Provides gold standard for evaluating other learning algorithms Additional insight into Occam s razor 1 / 28 2 / 28 ian Optimal Naïve Nets optimal classifier Naïve classifier : Learning over text data ian belief networks ian Optimal Naïve Nets We want to know the probability that a particular label r is correct given that we have seen data D onditional probability: P(r D) =P(r ^ D)/P(D) theorem: P(r D) = P(D r)p(r) P(D) P(r) = prior probability of label r (might include domain information) P(D) = probability of data D P(r D) = posterior probability of r given D P(D r) = probability (aka likelihood) of D given r 3 / 28 4 / 28 Note: P(r D) increases with P(D r) and P(r) and decreases with P(D) Basic for Probabilities ian Optimal Naïve Nets 5 / 28 Does a patient have cancer or not? A patient takes a lab test and the result is positive. The test returns a correct positive result in 98% of the cases in which the disease is actually present, and a correct negative result in 97% of the cases in which the disease is not present. Furthermore, 0.008 of the entire population have this cancer. P(cancer) = P( cancer) = P(+ cancer) = P( cancer) = P(+ cancer) = P( cancer) = Now consider new patient for whom the test is positive. What is our diagnosis? P(+ cancer)p(cancer) = P(+ cancer)p( cancer) = So diagnosis is ian Optimal Naïve Nets 6 / 28 Product Rule: probability P(A ^ B) of a conjunction of two events A and B: P(A ^ B) =P(A B)P(B) =P(B A)P(A) Sum Rule: probability of a disjunction of two events A and B: P(A _ B) =P(A)+P(B) P(A ^ B) of total probability: if events A 1,...,A n are mutually exclusive with P n i=1 P(A i)=1, then P(B) = nx P(B A i )P(A i ) i=1

Optimal Optimal Applying Rule ian rule lets us get a handle on the most probable label for an instance optimal classification of instance x: argmax P(r j x) where R is set of possible labels (e.g., R = {+, }) ian Let instance x be described by attributes hx 1, x 2,...,x n i Then, most probable label of x is: r = argmax = argmax = argmax P(r j x 1, x 2,...,x n ) P(x 1, x 2,...,x n r j ) P(r j ) P(x 1, x 2,...,x n ) P(x 1, x 2,...,x n r j ) P(r j ) Optimal Naïve Nets 7 / 28 On average, no other classifier using same prior information and same hypothesis space can outperform optimal! ) Gold standard for classification Optimal Naïve Nets 8 / 28 In other words, if we can estimate P(r j ) and P(x 1, x 2,...,x n r j ) for all possibilities, then we can give a optimal prediction of the label of x for all x How do we estimate P(r j ) from training data? What about P(x 1, x 2,...,x n r j )? Naïve Naïve ian Optimal Naïve Nets 9 / 28 Problem: Estimating P(r j ) easily done, but there are exponentially many combinations of values of x 1,...,x n E.g., if we want to estimate P(Sunny, Hot, High, Weak PlayTennis = No) from the data, need to count among the No labeled instances how many exactly match x (few or none) Naïve assumption: P(x 1, x 2,...,x n r j )= Y i so naïve classifier: r NB = argmax P(r j ) Y i P(x i r j ) P(x i r j ) Now have only polynomial number of probs to estimate ian Optimal Naïve Nets 10 / 28 Along with decision trees, neural networks, nearest neighbor, SVMs, boosting, one of the most practical learning methods When to use Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis lassifying text documents Naïve Algorithm Naïve ian Optimal Naïve Naïve Learn For each target value r j 1 ˆP(r j ) estimate P(r j )=fraction of exs with r j 2 For each attribute value v ik of each attrib x i 2 x ˆP(v ik r j) estimate P(v ik r j)=fraction of r j-labeled instances with v ik lassify New Instance(x) r NB = argmax ˆP(r j ) Y xi2x ˆP(x i r j ) ian Optimal Naïve Training s (Mitchell, Table 3.2): Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain ool Normal Weak Yes D6 Rain ool Normal Strong No D7 Overcast ool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny ool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Instance x to classify: houtlk = sun, Temp = cool, Humid = high, Wind = strongi Nets Nets 11 / 28 12 / 28

Naïve Naïve ian Optimal Naïve Assign label r NB = argmax P(r rj2r j) Q i P(x i r j ) P(y) P(sun y) P(cool y) P(high y) P(strong y) =(9/14) (2/9) (3/9) (3/9) (3/9) =0.0053 P(n) P(sun n) P(cool n) P(high n) P(strong n) =(5/14) (3/5) (1/5) (4/5) (3/5) =0.0206 So v NB = n ian Optimal Naïve onditional independence assumption is often violated, i.e., P(x 1, x 2,...,x n r j ) 6= Y i P(x i r j ), but it works surprisingly well anyway. Note that we don t need estimated posteriors ˆP(r j x) to be correct; need only that argmax rj2r ˆP(r j ) Y i ˆP(x i r j )=argmax rj2r P(r j )P(x 1,...,x n r j ) Sufficient conditions given in Domingos & Pazzani [1996] Nets Nets 13 / 28 14 / 28 Naïve Naïve : Text lassification ian Optimal Naïve Nets 15 / 28 What if none of the training instances with target value r j have attribute value v ik? Then ˆP(v ik r j )=0, and ˆP(r j ) Y i Typical solution is to use m-estimate: where ˆP(v ik r j ) n c + mp n + m ˆP(v ik r j )=0 n is number of training examples for which r = r j, n c number of examples for which r = r j and x i = v ik p is prior estimate for ˆP(v ik r j ) m is weight given to prior (i.e., number of virtual examples) Sometimes called pseudocounts ian Optimal Naïve Nets 16 / 28 Target concept Spam? :Document! {+, } Each document is a vector of words (one attribute per word position), e.g., x 1 = each, x 2 = document, etc. Naïve very effective despite obvious violation of conditional independence assumption () words in an email are independent of those around them) Set P(+) = fraction of training emails that are spam, P( )=1 P(+) To simplify matters, we will assume position independence, i.e., we only model the words in spam/not spam, not their position ) For every word w in our vocabulary, P(w +) = probability that w appears in any position of +-labeled training emails (factoring in prior m-estimate) Naïve : Text lassification Pseudocode [Mitchell] ian Belief Networks ian Optimal Naïve Nets ian Optimal Naïve Nets Generative Sometimes naïve assumption of conditional independence too restrictive But inferring probabilities is intractable without some such assumptions ian belief networks (also called Nets) describe conditional independence among subsets of variables Allows combining prior knowledge about dependencies among variables with observed training data 17 / 28 Summary 18 / 28

ian Belief Networks endence ian Belief Networks ian Optimal Naïve Nets Generative Summary 19 / 28 : X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if (8x i, y j, z k ) P(X = x i Y = y j, Z = z k )=P(X = x i Z = z k ) more compactly, we write P(X Y, Z) =P(X Z) : is conditionally independent of Rain, given P( Rain, ) =P( ) Naïve uses conditional independence and product rule to justify P(X, Y Z) = P(X Y, Z) P(Y Z) = P(X Z) P(Y Z) ian Optimal Naïve Nets Generative Summary 20 / 28 S,B S, B S,B S, B Network (directed acyclic graph) represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors E.g., Given and, is I of and ian Belief Networks Special ase ian Belief Networks ausality ian Since each node is conditionally independent of its nondescendants given its immediate predecessors, what model does this represent, given that is class and x i s are attributes? ian an think of edges in a net as representing a causal relationship between nodes Optimal Optimal Naïve Nets Naïve Nets E.g., rain causes wet grass Generative Generative Probability of wet grass depends on whether there is rain Summary 21 / 28 Summary 22 / 28 ian Belief Networks Generative ian Belief Networks Predicting Most Likely Label ian S,B S, B S,B S, B ian We sometimes call nets generative (vs discriminative) models since they can be used to generate instances hy 1,...,Y n i according to joint distribution Optimal Naïve Nets Generative Summary 23 / 28 Represents joint probability distribution over variables hy 1,...,Y n i, e.g., P(,,...,) In general, for y i = value of Y i ny P(y 1,...,y n )= P(y i Parents(Y i )) i=1 where Parents(Y i ) denotes immediate predecessors of Y i in graph E.g., P(S, B,, L, T, F) = P(S) P(B) P( B, S) P( L S) P( T L) P( F S, L, ) {z } Optimal Naïve Nets Generative Summary 24 / 28 an use for classification Label r to predict is one of the variables, represented by a node If we can determine the most likely value of r given the rest of the nodes, can predict label One idea: Go through all possible values of r, and compute joint distribution (previous slide) with that value and other attribute values, then return one that maximizes

ian Belief Networks Predicting Most Likely Label Learning of ian Belief Networks ian Optimal Naïve Nets Generative Summary 25 / 28 E.g., if (S) is the label to predict, and we are given values of B,, L, T, and F, can use formula to compute P(S, B,, L, T, F) and P( S, B,, L, T, F), then predict more likely one S,B S, B Easily handles unspecified attribute values Issue: Takes time exponential in number of values of unspecified attributes More efficient approach: Pearl s message passing algorithm for chains and trees and polytrees (at most one path between any pair of nodes) S,B S, B ian Optimal Naïve Nets Generative Summary 26 / 28 Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and all variables observed, then it s as easy as training a naïve classifier: Initialize PTs with pseudocounts If, e.g., a training instance has set S, B, and, then increment that count in s table Probability estimates come from normalizing counts S, B S, B S, B S, B 4 1 8 2 6 10 2 8 Learning of ian Belief Networks ian Belief Networks Summary ian Optimal Naïve Nets Generative Suppose structure known, variables partially observable E.g., observe,,,, but not, Similar to training neural network with hidden units; in fact can learn network conditional probability tables using gradient ascent onverge to network h that (locally) maximizes P(D h), i.e., maximum likelihood hypothesis an also use EM (expectation maximization) algorithm Use observations of variables to predict their values in cases when they re not observed EM has many other applications, e.g., hidden Markov models (HMMs) ian Optimal Naïve Nets Generative ombine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Active research area Extend from boolean to real-valued variables Parameterized distributions instead of tables More effective inference methods Summary 27 / 28 Summary 28 / 28