Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Similar documents
Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

CS 229 Midterm Review

1 Case study of SVM (Rob)

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

What is machine learning?

COMS 4771 Clustering. Nakul Verma

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

A Systematic Overview of Data Mining Algorithms

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Network Traffic Measurements and Analysis

Clustering algorithms

Supervised vs unsupervised clustering

Unsupervised Learning: Clustering

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Content-based image and video analysis. Machine learning

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Pattern Recognition ( , RIT) Exercise 1 Solution

ECE 5424: Introduction to Machine Learning

Machine Learning. Supervised Learning. Manfred Huber

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Lecture 7: Decision Trees

Clustering Lecture 5: Mixture Model

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Generative and discriminative classification techniques

10-701/15-781, Fall 2006, Final

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

ALTERNATIVE METHODS FOR CLUSTERING

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Unsupervised Learning

Discriminative training and Feature combination

Based on Raymond J. Mooney s slides

Expectation Maximization!

Clustering web search results

Machine Learning Lecture 3

Simple Model Selection Cross Validation Regularization Neural Networks

Supervised vs. Unsupervised Learning

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

ECG782: Multidimensional Digital Signal Processing

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Contents. Preface to the Second Edition

Segmentation: Clustering, Graph Cut and EM

Generative and discriminative classification techniques

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Machine Learning Lecture 3

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering: Classic Methods and Modern Views

INF 4300 Classification III Anne Solberg The agenda today:

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Machine Learning Lecture 3

3 Feature Selection & Feature Extraction

1) Give decision trees to represent the following Boolean functions:

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Machine Learning. Unsupervised Learning. Manfred Huber

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Introduction to Machine Learning CMU-10701

Clustering. Supervised vs. Unsupervised Learning

Machine Learning and Pervasive Computing

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Using Machine Learning to Optimize Storage Systems

CSE4334/5334 DATA MINING

Grundlagen der Künstlichen Intelligenz

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Hierarchical Clustering 4/5/17

Lecture 11: Classification

Chapter 4: Non-Parametric Techniques

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Mixture Models and EM

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Perceptron as a graph

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Unsupervised Learning

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Neural Networks and Deep Learning

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning in Biology

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

The exam is closed book, closed notes except your one-page cheat sheet.

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Bioinformatics - Lecture 07

Transcription:

Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1

Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant Functions How to Construct Classifiers Gaussian Classifiers The Curse of Dimensionality Estimating the Error Rate Comparing Classifiers (McNemar s test) March 29, 2007 Speech recognition 2007 2

Pattern Recognition 2 (3) Discriminative Training Maximum Mutual Information Estimation Minimum-Error-Rate Estimation Neural networks Unsupervised Estimation Methods Vector Quantization The K-Means Algorithm The EM Algorithm Multivariate Gaussian Mixture Density Estimation March 29, 2007 Speech recognition 2007 3

Pattern Recognition 3 (3) Classification and Regression Trees (CART) Choice of question set Splitting criteria Growing the tree Missing values and conflict resolution Complex questions The Right-Sized Tree March 29, 2007 Speech recognition 2007 4

4.1.1 Minimum-Error-Rate Decision Bayes decision rule Rules The decision is based on choosing the candidate that maximizes the posterior probability (results in minimum decision error) k = arg max i P( ω x) = i arg max i p( x ω i ) P( ωi ) March 29, 2007 Speech recognition 2007 5

4.1.2 Discriminant Functions The decision problem viewed as classification problem Classify unknown data into one of s known categories Using s discriminant functions Minimum-error-rate classifier: Maximize a posteriori probability: Bayes decision rule l( x ) = For two-class problem: Likelihood ratio: p( x ω1) p( x ω ) 2 l( x ) > T : ω 1 l( x ) < T : ω 2 P( ω2) T = P( ω ) 1 March 29, 2007 Speech recognition 2007 6

Discriminant Functions Fig 4.2 A classifier based on discriminant functions March 29, 2007 Speech recognition 2007 7

Fig 4.3 Decision boundaries March 29, 2007 Speech recognition 2007 8

March 29, 2007 Speech recognition 2007 9 4.2.1 Gaussian classifiers The class-conditional probability density is assumed to have a Gaussian distribution Decision boundary Σ Σ = ) ( ) ( 2 1 exp ) (2 1 ) ( 1 2 1/ 2 / i i i i d i p μ x μ x x π ω

4.2.2 The Curse of Dimensionality More features (e.g. higher dimensions or more parameters in density function) lead (in theory) to lower classification error rate In practice: may lead to worse results due to too little training data Paradox called The curse of dimensionality Fig 4.6 Curve fitting Fig 4.7 Phoneme classification March 29, 2007 Speech recognition 2007 10

The Curse of Dimensionality Curve Fitting First order, linear, second and 10 th order polynomial fitting curves March 29, 2007 Speech recognition 2007 11

The Curse of Dimensionality Two-phoneme classification Error rate as a function of the number of Gaussian mixture densities and the number of training samples (2 10000) March 29, 2007 Speech recognition 2007 12

4.2.3 Estimating the Error Rate Computation from parametrical model has problems (error under-estimation, bad model assumptions, very difficult) Recognition error on training data is a lower bound (Warning!) Use independent test data How to partition the available speech data Holdout method V-fold cross validation (Leave-one-out method) March 29, 2007 Speech recognition 2007 13

Is Algorithm/System A better than B? Compare results on the same test data McNemar Test Compares classification results (next slide) Sign Test May be used if the results are considered as Matched Pairs Only thing measured is whether A or B is better Magnitude Difference Test Measures how much better A or B is March 29, 2007 Speech recognition 2007 14

McNemar s test 4.2.4 Comparing classifiers Compares two classifiers by looking at samples where only one made an error Q 1 Correct Incorrect n = N 01 + N 10 Correct Q 2 Incorrect N 00 N 01 N 10 N 11 N xx has binomial distribution B(n,1/2) Test the null hypothesis that the classifiers have the same error rates (z-test) March 29, 2007 Speech recognition 2007 15

Confidence The true result is within an interval around the measured value with a certain probability Confidence level and interval Doddington s Rule of 30 To be 90 percent confident that the true error rate is within +/- 30 percent of the observed error rate, there must be at least 30 errors. March 29, 2007 Speech recognition 2007 16

Confidence Intervals Figure 4.8 95 % confidence intervals for classification error rate estimation with normal test n = 10, 15, 20, 30, 50, 100, 250 March 29, 2007 Speech recognition 2007 17

4.3 Discriminative Training Maximum Likelihood Estimation models each class separately, independent of other classes Discriminative Training aims at models that maximize the discrimination between the classes Maximum Mutual Information Estimation (MMIE) Minimum-Error-Rate Estimation Neural networks March 29, 2007 Speech recognition 2007 18

4.3.1 Maximum Mutual Information Discriminative criterion: Estimation (MMIE) For each model to estimate, find a setting that maximizes the probability ratio between the model and the sum of all other models. Maximizes the posterior probability. Maximize k i p( xωi ) p( ωi ) p( xω ) p( ω ) Gives different result compared to MLE. MLE maximizes the numerator only Theoretically appealing but computationally expensive Every sample used for all classes Use gradient descent algorithm k k March 29, 2007 Speech recognition 2007 19

4.3.2 Minimum-Error-Rate Estimation Also called Minimum-Classification-Error (MCE) training, discriminative training, Iterative procedure (gradient descent) Re-estimate models, classification, improve correctly recognized models and suppress mis-recognized models Computationally intensive, used for few classes Corrective training Simple and faster error-correcting procedure Move the parameters of the correct class towards the training data Move the parameters of the near-miss class away from the training data Good results March 29, 2007 Speech recognition 2007 20

4.3.3 Neural Networks Inspired by nerve cells in biological nervous systems Many simple processing elements connected to a complex network. Single-Layer Perceptron Fig. 4.10 Multi-Layer Perceptron (MLP) Fig 4.11 Back propagation training March 29, 2007 Speech recognition 2007 21

Artificial Neural Network - ANN Θ March 29, 2007 Speech recognition 2007 22

4.3.3 Multi-Layer Perceptron y = sigmoid ( W' y ( h2)) y j = 1+ e 1 ( w0 j + wijh2i i ) March 29, 2007 Speech recognition 2007 23

The Back Propagation Algorithm March 29, 2007 Speech recognition 2007 24

4.4 Unsupervised Estimation Methods Vector Quantization The EM Algorithm Multivariate Gaussian Mixture Density Estimation March 29, 2007 Speech recognition 2007 25

4.4.1 Vector Quantization (VQ) Described by a codebook, a set of prototype vectors (codewords) An input vector is replaced by the index of the codeword with the smallest distortion Distortion Measures Euclidean sum of squared error Mahalanobis distance exponential term in Gaussian density function Codebook generation algorithms The K-Means Algorithm The LBG Algorithm March 29, 2007 Speech recognition 2007 26

Vector Quantization Partitioning of a two-dimensional space into 16 cells March 29, 2007 Speech recognition 2007 27

The K-Means Algorithm 1. Choose an initial division between the codewords 2. Classify each training vector into one of the cells by choosing the closest codeword 3. Update all codewords by computing the centroids of the training vectors 4. Repeat steps 2 and 3 until the distortion ratio between current and previous codebook is above a preset threshold Comment Converges to local optimum Initial choice is critical March 29, 2007 Speech recognition 2007 28

The LBG Algorithm 1. Initialization. Set number of cells M = 1. Find the centroid of all training data. 2. Splitting. Split M into 2M by finding two distant points in each cell. Set these as centroids for 2M cells. 3. K-Means Stage. Use K-Means algorithm to modify the centroids for minimum distortion. 4. Termination If M equals the required codebook size, STOP. Otherwise go to 2. March 29, 2007 Speech recognition 2007 29

4.4.2 The Expectation Maximization (EM) Algorithm Used for training of hidden Markov models Generalisation of Maximum-Likelihood Estimation Problem approached Estimate distributions (ML) of several classes when the training data is not classified (e.g. into states of the models) Is it possible to train the classes anyway? (Yes - local maximum) Simplified iterative procedure (similar to K-Means procedure for VQ) 1. Initialise class distributions 2. Using current parameters, compute the class probability for each training sample. 3. Each sample updates each class distribution by the probability weights Maximum-likelihood estimate of distributions, replace current distr. 4. Repeat 2+3 until convergence (Will converge) March 29, 2007 Speech recognition 2007 30

Simplified illustration of EM estimation Say, three paths have been found in a training utterance. The probabilities of the state sequences for the initial HMM are 0.13, 0.35 and 0.22. s P = 0.13 s 3 s 2 P = 0.35 P = 0.22 s 1 t1 t2 t3 T t New E(s 2 ) = (0.13 X(t 1 ) + 0.35 X (t 2 ) + 0.22 X(t 3 )) / 0.70 Not as simple as it may look, though! March 29, 2007 Speech recognition 2007 31

4.4.3 Multivariate Gaussian Mixture Density Estimation Probability density is weighted sum of Gaussians: c k is the probability of component k, c k = P(X = k) Analogy GM vs VQ, k = 1 EM algorithm vs K-means algorithm VQ minimizes codebook distortion GM maximizes the likelihood of the observed data p( y Φ) = c p ( y Φ ) = c N ( y μ, Σ k = 1 VQ performs hard assignment EM performs soft assignment K k k k K k k k k ) March 29, 2007 Speech recognition 2007 32

Partitioning Space into Gaussian Density Functions March 29, 2007 Speech recognition 2007 33

4.5 Classification And Regression Trees Binary decision tree CART An automatic and data-driven framework to construct a decision process based on objective criteria Handles data samples with mixed types, nonstandard structures Handles missing data, robust to outliers and mislabeled data samples Used in speech recognition for model tying March 29, 2007 Speech recognition 2007 34

Binary Decision Tree for Height Classification March 29, 2007 Speech recognition 2007 35

Steps in Constructing a CART 1. Find set of questions 2. Put all training samples in root 3. Recursive algorithm Find the best combination of question and node. Split the node into two new nodes Move the corresponding data into the new nodes Repeat until right-sized tree is obtained Greedy algorithm, only locally optimal, splitting without regard to subsequent splits Dynamic programming would help but computationally heavy Works well in practice March 29, 2007 Speech recognition 2007 36

4.5.1 Choice of Question Set Can be manually selected Automatic procedure: Simple (singleton) or complex questions Simple questions about a single variable Discrete variable questions Does x i belong to set S? S is any possible subset of the training samples Continuous variable questions Is x i <= c n? c n is midpoint between two training samples March 29, 2007 Speech recognition 2007 37

4.5.2 Splitting Criteria Find the pair of node and question for which split gives Discrete variable Maximum reduction in entropy ΔH Continuous variables The maximum gain in likelihood For regression purposes ( q) = H ( Y ) ( H ( Y ) H ( Y )) t t l + ΔLt ( q) = L1 ( X1 N) + L2 ( X2 N) LX ( X N) The largest reduction in squared error from a regression of the data in the node r March 29, 2007 Speech recognition 2007 38

4.5.3 Growing the Tree Stop growing a node when either All samples in the node belong to the same class The greatest entropy reduction falls below threshold The number of data samples in the node is too small March 29, 2007 Speech recognition 2007 39

4.5.4 Missing Values and Missing values in input Conflict Resolution Handle by surrogate question(s) Conflict resolution Two questions may achieve the same entropy reduction and the same partitioning One question may be sub-question to the other Select the sub-question (since more specific) March 29, 2007 Speech recognition 2007 40

4.5.5 Complex Questions Problem Simple (one-variable) questions may result in similar leaves in different locations Over-fragmenting Solution Form composite questions by all possible combinations of the simple-question leaf nodes in the tree using all AND and OR combinations. Select best composite question. March 29, 2007 Speech recognition 2007 41

4.5.6 The Right-Sized Tree Too many splits improves classification on training data but reduction on test data (Curse of dimensionality) Use a pruning strategy to gradually cut back the overgrown tree until the minimum misclassification on the test data is achieved Minimum Cost-Complexity Pruning Produces a sequence of trees with increased pruning Select the best tree using either of Independent Test Sample Estimation (fixed test data) V-fold Cross Validation (train on (v-1) parts, test on 1, circulate) March 29, 2007 Speech recognition 2007 42