COMP90051 Statistical Machine Learning

Similar documents
COMP90051 Statistical Machine Learning

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Bayes Net Learning. EECS 474 Fall 2016

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Probabilistic Graphical Models

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Conditional Random Fields : Theory and Application

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Introduction to Hidden Markov models

Sequence Labeling: The Problem

Machine Learning. Sourangshu Bhattacharya

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Structured Learning. Jun Zhu

of Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Probabilistic Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Introduction to Graphical Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Structured Models in. Dan Huttenlocher. June 2010

STA 4273H: Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

7. Boosting and Bagging Bagging

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

27: Hybrid Graphical Models and Neural Networks

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Graphical Models. David M. Blei Columbia University. September 17, 2014

08 An Introduction to Dense Continuous Robotic Mapping

Machine Learning

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Machine Learning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Generative and discriminative classification techniques

Machine Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Machine Learning

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Time series, HMMs, Kalman Filters

Scene Grammars, Factor Graphs, and Belief Propagation

Bayesian Networks Inference

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

A Brief Introduction to Bayesian Networks AIMA CIS 391 Intro to Artificial Intelligence

2. Graphical Models. Undirected graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Scene Grammars, Factor Graphs, and Belief Propagation

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

The Basics of Graphical Models

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

CS 6784 Paper Presentation

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayesian Classification Using Probabilistic Graphical Models

Building Classifiers using Bayesian Networks

Directed Graphical Models (Bayes Nets) (9/4/13)

Machine Learning

Tracking Algorithms. Lecture16: Visual Tracking I. Probabilistic Tracking. Joint Probability and Graphical Model. Deterministic methods

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Energy Minimization for Segmentation in Computer Vision

Graphical Models & HMMs

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Multi-Class Segmentation with Relative Location Prior

STA 4273H: Statistical Machine Learning

Large Scale Data Analysis Using Deep Learning

Segmentation. Bottom up Segmentation Semantic Segmentation

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Machine Learning

10-701/15-781, Fall 2006, Final

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

Statistical Methods for NLP

3 : Representation of Undirected GMs

Dynamic Bayesian network (DBN)

A Brief Introduction to Bayesian Networks. adapted from slides by Mitch Marcus

Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory

Simple Model Selection Cross Validation Regularization Neural Networks

Machine Learning. Chao Lan

Machine Learning. Supervised Learning. Manfred Huber

Algorithms for Markov Random Fields in Computer Vision

Conditional Random Fields. Mike Brodie CS 778

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Lecture 3: Conditional Independence - Undirected

ALTW 2005 Conditional Random Fields

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

CS 343: Artificial Intelligence

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

COS Lecture 13 Autonomous Robot Navigation

Introduction to Probabilistic Graphical Models

Learning from Data: Adaptive Basis Functions

Scaling Conditional Random Fields for Natural Language Processing

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Probabilistic Graphical Models

Learning and Recognizing Visual Object Categories Without First Detecting Features

Modeling time series with hidden Markov models

Transcription:

COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 20. PGM Representation

Next Lectures Representation of joint distributions Conditional/marginal independence * Directed vs undirected Probabilistic inference * Computing other distributions from joint Statistical inference * Learn parameters from (missing) data 2

Probabilistic Graphical Models Marriage of graph theory and probability theory. Tool of choice for Bayesian statistical learning. We ll stick with easier discrete case, ideas generalise to continuous. 3

Motivation by practical importance Many applications * Phylogenetic trees * Pedigrees, Linkage analysis * Error-control codes * Speech recognition * Document topic models * Probabilistic parsing * Image segmentation * Unifies many previouslydiscovered algorithms * HMMs * Kalman filters * Mixture models * LDA * MRFs * CRF * Logistic, linear regression * 4

Motivation by way of comparison Bayesian statistical learning Model joint distribution of X s,y and parameter r.v. s * Priors : marginals on parameters Training: update prior to posterior using observed data Prediction: output posterior, or some function of it (MAP) PGMs aka Bayes Nets Efficient joint representation * Independence made explicit * Trade-off between expressiveness and need for data, easy to make * Easy for practitioners to model Algorithms to fit parameters, compute marginals, posterior 5

Everything Starts at the Joint Distribution All joint distributions on discrete r.v. s can be represented as tables #rows grows exponentially with #r.v. s Example: Truth Tables * M Boolean r.v. s require 2 M -1 rows * Table assigns probability per row A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1? A 0.05 0.1 0.25 0.1 0.1 0.05 B 0.05 C 6

The Good: What we can do with the joint Probabilistic inference from joint on r.v. s * Computing any other distributions involving our r.v. s Pattern: want a distribution, have joint; use: Bayes rule + marginalisation Example: naïve Bayes classifier * Predict class y of instance x by maximising Pr Y = y X = x = )* +,-,X,x )* (X,x) = )* +,-,X,x 4 )* X,x,+,- Recall: integration (over parameters) continuous equivalent of sum (both referred to as marginalisation) 7

The Bad & Ugly: Tables waaaaay too large!! The Bad: Computational complexity * Tables have exponential number of rows in number of r.v. s * Therefore à poor space & time to marginalise The Ugly: Model complexity * Way too flexible * Way too many parameters to fit à need lots of data OR will overfit Antidote: assume independence! A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1? 8

Example: You re late! Modeling a tardy lecturer. Boolean r.v. s * T: Trevor teaches the class * S: It is sunny (o.w. bad weather) * L: The lecturer arrives late (o.w. on time) Assume: Trevor sometimes delayed by bad weather, Trevor more likely late than co-lecturer * Pr S T = Pr S, Pr S = 0.3, Pr T = 0.6 Lateness not independent on weather, lecturer * Need Pr L T = t, S = s for all combinations Need just 6 parameters Pr L = true T False True False 0.1 0.2 S True 0.05 0.1 9

Independence: not a dirty word Lazy Lecturer Model Model details # params Our model with S, T independence Pr S, T factors to Pr S Pr T 2 Pr L T, S modelled in full 4 Assumption-free model Pr L, T, S modelled in full 7 Independence assumptions * Can be reasonable in light of domain expertise * Allow us to factor à Key to tractable models 10

Factoring Joint Distributions Chain Rule: for any ordering of r.v. s can always factor: Pr X C, X D,, X E = F Pr X G X GHC,, X E Model s independence assumptions correspond to Dropping conditioning r.v. s in the factors! Example unconditional indep.: Pr X C X D = Pr X C Example conditional indep.: Pr X C X D, X I = Pr X C X D E Example: independent r.v. s Pr X C,, X E = G,C Pr X G E G,C Simpler factors speed inference and avoid overfitting 11

Directed PGM Nodes Edges (acyclic) Random variables Conditional dependence * Node table: Pr child parents * Child directly depends on parents Joint factorisation E Pr X C, X D,, X E = G,C Pr X G X S parents(x G ) Tardy Lecturer Example S T Pr S Pr T L Pr L S, T 12

Example: Nuclear power plant Core temperature à Temperature Gauge à Alarm Model uncertainty in monitoring failure * GRL: gauge reads low * CTL: core temperature low * FG: faulty gauge * FA: faulty alarm * AS: alarm sounds PGMs to the rescue! Pr(FA) Pr(CTL) FA CTL AS GRL Pr AS GRL, FA) Joint Pr(CTL, FG, FA, GRL, AS) given by FG Pr(FG) Pr GRL CTL, FG) Pr(AS FA, GRL) Pr(FA) Pr(GRL CTL, FG) Pr(CTL) Pr(FG) 13

Naïve Bayes Y Y ~ Bernoulli θ Aside: Bernoulli is just Binomial with count=1 X 1 X d X S Y ~ Bernoulli θ S,+ Pr Y, X C,, X c = Pr X C,, X c, Y = Pr X C Y Pr X D X C, Y Pr X c X C,, X cdc, Y Pr Y = Pr X C Y Pr X D Y Pr X c Y Pr Y Prediction: predict label maximising Pr Y X C,, X c 14

Short-hand for repeats: Plate notation X 1 X d = X i i=1..d Y Y = X 1 X d X i i=1..d 15

PGMs frequentist OR Bayesian PGMs represent joints, which are central to Bayesian Catch is that Bayesians add: node per parameters, with table being the parameter s prior θ Y Y ~ Bernoulli θ θ C,e θ C,C θ c,e θ c,c X 1 X d X S Y ~ Bernoulli θ S,+ θ f s ~ Beta 16

Example PGMs The hidden Markov model (HMM); lattice Markov random field (MRF) 17

The HMM (and Kalman Filter) Sequential observed outputs from hidden state The Kalman filter same with continuous Gaussian r.v. s A CRF is the undirected analogue o C q C o D q D o I q I o i q i 18

HMM Applications NLP part of speech tagging: given words in sentence, infer hidden parts of speech I love Machine Learning à noun, verb, noun, noun Speech recognition: given waveform, determine phonemes Biological sequences: classification, search, alignment Computer vision: identify who s walking in video, tracking 19

Fundamental HMM Tasks HMM Task Evaluation. Given an HMM μ and observation sequence O, determine likelihood Pr (O μ) Decoding. Given an HMM μ and observation sequence O, determine most probable hidden state sequence Q Learning. Given an observation sequence O and set of states, learn parameters A, B, Π PGM Task Probabilistic inference MAP point estimate Statistical inference 20

Computer Vision Hidden square-lattice Markov random fields 21

Pixel labelling tasks in Computer Vision Semantic labelling (Gould et al. 09) Interactive figure-ground segmentation (Boykov & Jolly 2011) Denoising (Felzenszwalb & Huttenlocher 04) 22

What these tasks have in common Hidden state representing semantics of image * Semantic labelling: Cow vs. tree vs. grass vs. sky vs. house * Fore-back segment: Figure vs. ground * Denoising: Clean pixels Pixels of image * What we observe of hidden state Remind you of HMMs? 23

A hidden square-lattice MRF Hidden states: square-lattice model * Boolean for two-class states * Discrete for multi-class * Continuous for denoising Pixels: observed outputs * Continuous e.g. Normal o GS q GS 24

Application to sequences: CRFs Conditional Random Field: Same model applied to sequences * observed outputs are words, speech, amino acids etc * states are tags: part-of-speech, phone, alignment CRFs are discriminative, model P(Q O) * versus HMM s which are generative, P(Q,O) * undirected PGM more general and expressive o C o D o I o i q C q D q I q i 25

Summary Probabilistic graphical models * Motivation: applications, unifies algorithms * Motivation: ideal tool for Bayesians * Independence lowers computational/model complexity * PGMs: compact representation of factorised joints 26