Machine Learning

Similar documents
Naïve Bayes, Gaussian Distributions, Practical Applications

Advanced Introduction to Machine Learning CMU-10715

Machine Learning / Jan 27, 2010

Machine Learning

Machine Learning

Machine Learning

Machine Learning

Machine Learning

Machine Learning. Supervised Learning. Manfred Huber

Machine Learning

Simple Model Selection Cross Validation Regularization Neural Networks

Machine Learning

ECE 5424: Introduction to Machine Learning

Machine Learning

CSE 446 Bias-Variance & Naïve Bayes

Naïve Bayes Classifier. Lecture 8: Aykut Erdem October 2017 Hacettepe University

Boosting Simple Model Selection Cross Validation Regularization

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Notes and Announcements

Feature Selection for fmri Classification

10601 Machine Learning. Model and feature selection

Bayes Net Learning. EECS 474 Fall 2016

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

Recognition Part I: Machine Learning. CSE 455 Linda Shapiro

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Learning from High Dimensional fmri Data using Random Projections

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Automated fmri Feature Abstraction using Neural Network Clustering Techniques

Learning Common Features from fmri Data of Multiple Subjects John Ramish Advised by Prof. Tom Mitchell 8/10/04

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Classification and K-Nearest Neighbors

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Bayesian Classification Using Probabilistic Graphical Models

Statistical Methods for NLP

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

COMP90051 Statistical Machine Learning

Expectation Maximization: Inferring model parameters and class labels

Learning Bayesian Networks (part 3) Goals for the lecture

TA Section: Problem Set 4

Network Traffic Measurements and Analysis

Voxel selection algorithms for fmri

Problem Set #6 Due: 11:30am on Wednesday, June 7th Note: We will not be accepting late submissions.

7. Boosting and Bagging Bagging

Dimensionality Reduction of fmri Data using PCA

Regularization and model selection

Unsupervised Learning

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Artificial Intelligence Naïve Bayes

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Cognitive States Detection in fmri Data Analysis using incremental PCA

Temporal Feature Selection for fmri Analysis

A Taxonomy of Semi-Supervised Learning Algorithms

Machine Learning. Classification

Building Classifiers using Bayesian Networks

Expectation Maximization: Inferring model parameters and class labels

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Computer Vision. Exercise Session 10 Image Categorization

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data

Learning from Data: Adaptive Basis Functions

COMS 4771 Clustering. Nakul Verma

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Multi-voxel pattern analysis: Decoding Mental States from fmri Activity Patterns

Penalizied Logistic Regression for Classification

10-701/15-781, Fall 2006, Final

Machine Learning in Telecommunications

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Bayesian Networks Inference (continued) Learning

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Introduction to Graphical Models

Machine Learning. Chao Lan

Machine Learning Crash Course: Part I

What is machine learning?

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Kernels for Structured Data

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

CLUSTERING. JELENA JOVANOVIĆ Web:

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

RNNs as Directed Graphical Models

Introduction to Machine Learning. Xiaojin Zhu

Announcements: projects

Bayesian Methods. David Rosenberg. April 11, New York University. David Rosenberg (New York University) DS-GA 1003 April 11, / 19

Introduction to Machine Learning Prof. Mr. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CSC 2515 Introduction to Machine Learning Assignment 2

Semi-supervised Learning

08 An Introduction to Dense Continuous Robotic Mapping

CS 231A Computer Vision (Fall 2012) Problem Set 3

Note Set 4: Finite Mixture Models and the EM Algorithm

CS 343: Artificial Intelligence

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Experimental Design + k- Nearest Neighbors

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Multivariate Pattern Classification. Thomas Wolbers Space and Aging Laboratory Centre for Cognitive and Neural Systems

CLASSIFICATION JELENA JOVANOVIĆ. Web:

Problem Set #6 Due: 2:30pm on Wednesday, June 1 st

Transcription:

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 20, 2011 Today: Bayes Classifiers Naïve Bayes Gaussian Naïve Bayes Readings: Mitchell: Naïve Bayes and Logistic Regression (available on class website) Let s learn classifiers by learning P(Y X) Consider Y=Wealth, X=<Gender, HoursWorked> Gender HrsWorked P(rich G,HW) P(poor G,HW) F <40.5.09.91 F >40.5.21.79 M <40.5.23.77 M >40.5.38.62 1

How many parameters must we estimate? Suppose X =<X 1, X n > where X i and Y are boolean RV s To estimate P(Y X 1, X 2, X n ) If we have 30 X i s instead of 2? Bayes Rule Which is shorthand for: Equivalently: 2

Can we reduce params using Bayes Rule? Suppose X =<X 1, X n > where X i and Y are boolean RV s Naïve Bayes Naïve Bayes assumes i.e., that X i and X j are conditionally independent given Y, for all i j 3

Conditional Independence Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z Which we often write E.g., Naïve Bayes uses assumption that the X i are conditionally independent, given Y Given this assumption, then: in general: How many parameters to describe P(X 1 X n Y)? P(Y)? Without conditional indep assumption? With conditional indep assumption? 4

Naïve Bayes in a Nutshell Bayes rule: Assuming conditional independence among X i s: So, classification rule for X new = < X 1,, X n > is: Naïve Bayes Algorithm discrete X i Train Naïve Bayes (examples) for each * value y k estimate for each * value x ij of each attribute X i estimate Classify (X new ) * probabilities must sum to 1, so need estimate only n-1 of these... 5

Estimating Parameters: Y, X i discrete-valued Maximum likelihood estimates (MLE s): Number of items in dataset D for which Y=y k Example: Live in Sq Hill? P(S G,D,M) S=1 iff live in Squirrel Hill G=1 iff shop at SH Giant Eagle D=1 iff Drive to CMU M=1 iff Rachel Maddow fan What probability parameters must we estimate? 6

Example: Live in Sq Hill? P(S G,D,M) S=1 iff live in Squirrel Hill G=1 iff shop at SH Giant Eagle D=1 iff Drive to CMU M=1 iff Rachel Maddow fan P(S=1) : P(D=1 S=1) : P(D=1 S=0) : P(G=1 S=1) : P(G=1 S=0) : P(M=1 S=1) : P(M=1 S=0) : P(S=0) : P(D=0 S=1) : P(D=0 S=0) : P(G=0 S=1) : P(G=0 S=0) : P(M=0 S=1) : P(M=0 S=0) : Naïve Bayes: Subtlety #1 If unlucky, our MLE estimate for P(X i Y) might be zero. (e.g., X i = Birthday_Is_January_30_1990) Why worry about just one parameter out of many? What can be done to avoid this? 7

Estimating Parameters Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data Estimating Parameters: Y, X i discrete-valued Maximum likelihood estimates: MAP estimates (Beta, Dirichlet priors): Only difference: imaginary examples 8

Estimating Parameters Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data Conjugate priors [A. Singh] 9

Conjugate priors [A. Singh] Naïve Bayes: Subtlety #2 Often the X i are not really conditionally independent We use Naïve Bayes in many cases anyway, and it often works pretty well often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996]) What is effect on estimated P(Y X)? Special case: what if we add two copies: X i = X k 10

Special case: what if we add two copies: X i = X k Learning to classify text documents Classify which emails are spam? Classify which emails promise an attachment? Classify which web pages are student home pages? How shall we represent text documents for Naïve Bayes? 11

Baseline: Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0 Learning to classify document: P(Y X) the Bag of Words model Y discrete valued. e.g., Spam or not X = <X 1, X 2, X n > = document X i is a random variable describing the word at position i in the document possible values for X i : any word w k in English Document = bag of words: the vector of counts for all w k s This vector of counts follows a?? distribution 12

Naïve Bayes Algorithm discrete X i Train Naïve Bayes (examples) for each value y k estimate for each value x ij of each attribute X i estimate Classify (X new ) prob that word x ij appears in position i, given Y=y k * Additional assumption: word probabilities are position independent MAP estimates for bag of words Map estimate for multinomial What β s should we choose? 13

For code and data, see www.cs.cmu.edu/~tom/mlbook.html click on Software and Data 14

What if we have continuous X i? Eg., image classification: X i is i th pixel What if we have continuous X i? image classification: X i is i th pixel, Y = mental state Still have: Just need to decide how to represent P(X i Y) 15

What if we have continuous X i? Eg., image classification: X i is i th pixel Gaussian Naïve Bayes (GNB): assume Sometimes assume σ ik is independent of Y (i.e., σ i ), or independent of X i (i.e., σ k ) or both (i.e., σ) Gaussian Naïve Bayes Algorithm continuous X i (but still discrete Y) Train Naïve Bayes (examples) for each value y k estimate* for each attribute X i estimate class conditional mean, variance Classify (X new ) * probabilities must sum to 1, so need estimate only n-1 parameters... 16

Estimating Parameters: Y discrete, X i continuous Maximum likelihood estimates: jth training example ith feature kth class δ(z)=1 if z true, else 0 GNB Example: Classify a person s cognitive activity, based on brain image are they reading a sentence or viewing a picture? reading the word Hammer or Apartment viewing a vertical or horizontal line? answering the question, or getting confused? 17

Stimuli for our study: time ant or 60 distinct exemplars, presented 6 times each fmri voxel means for bottle : means defining P(Xi Y= bottle) fmri activation high Mean fmri activation over all stimuli: average bottle minus mean activation: below average 18

Rank Accuracy Distinguishing among 60 words Tools vs Buildings: where does brain encode their word meanings? Accuracies of cubical 27-voxel Naïve Bayes classifiers centered at each voxel [0.7-0.8] 19

What you should know: Training and using classifiers based on Bayes rule Conditional independence What it is Why it s important Naïve Bayes What it is Why we use it so much Training using MLE, MAP estimates Discrete variables and continuous (Gaussian) Questions: What error will the classifier achieve if Naïve Bayes assumption is satisfied and we have infinite training data? Can you use Naïve Bayes for a combination of discrete and real-valued X i? How can we extend Naïve Bayes if just 2 of the n X i are dependent? What does the decision surface of a Naïve Bayes classifier look like? 20

21