Alex Waibel

Similar documents
Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

ECG782: Multidimensional Digital Signal Processing

CS6220: DATA MINING TECHNIQUES

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Machine Learning in Biology

Supervised vs unsupervised clustering

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Content-based image and video analysis. Machine learning

Clustering Lecture 5: Mixture Model

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

What is machine learning?

Machine Learning Classifiers and Boosting

Machine Learning Lecture 3

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

INF 4300 Classification III Anne Solberg The agenda today:

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Pattern Classification Algorithms for Face Recognition

Classification: Linear Discriminant Functions

6. Linear Discriminant Functions

Data Mining. Neural Networks

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Support Vector Machines

Unsupervised Learning

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

For Monday. Read chapter 18, sections Homework:

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

CSE 573: Artificial Intelligence Autumn 2010

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Character Recognition Using Convolutional Neural Networks

Note Set 4: Finite Mixture Models and the EM Algorithm

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Machine Learning Lecture 3

Support Vector Machines

Unsupervised Learning

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Machine Learning. Supervised Learning. Manfred Huber

Generative and discriminative classification techniques

Perceptron as a graph

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

The exam is closed book, closed notes except your one-page cheat sheet.

Expectation Maximization (EM) and Gaussian Mixture Models

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Fall 09, Homework 5

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Machine Learning Lecture 3

3 Feature Selection & Feature Extraction

Artificial Neural Networks (Feedforward Nets)

Machine Learning / Jan 27, 2010

SD 372 Pattern Recognition

The Curse of Dimensionality

Introduction to Machine Learning

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Lecture #11: The Perceptron

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

STA 4273H: Statistical Machine Learning

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Discriminate Analysis

ECE 662 Hw2 4/1/2008

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Unsupervised Learning

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Deep Learning for Computer Vision

CS6716 Pattern Recognition

Neural Networks and Deep Learning

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Radial Basis Function Neural Network Classifier

CMPT 882 Week 3 Summary

Pattern Recognition ( , RIT) Exercise 1 Solution

COMS 4771 Clustering. Nakul Verma

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Bayes Classifiers and Generative Methods

Grundlagen der Künstlichen Intelligenz

Generative and discriminative classification techniques

Introduction to Machine Learning CMU-10701

IBL and clustering. Relationship of IBL with CBR

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Supervised Learning in Neural Networks (Part 2)

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Transcription:

Alex Waibel 815.11.2011 1 16.11.2011

Organisation Literatur: Introduction to The Theory of Neural Computation Hertz, Krogh, Palmer, Santa Fe Institute Neural Network Architectures An Introduction, Judith Dayhoff, VNR Publishers Andere Publikationen, TBD Papers On-Line Sprechstunde: Dienstag, 12:00, oder Anmeldung Sekretariat Link im Studienportal, und http://isl.anthropomatik.kit.edu lectures -> neuronale netze 2 16.11.2011

Neural Nets Terminology: Artificial Neural Networks Connectionist Models Parallel Distributed Processing (PDP) Massive Parallel Processing Multi-layer Perceptron (MLP) 3 16.11.2011

The brain is: ~ 1000 supercomputers ~ 1/10 pocket calculator The brain is very good for some problems: vision, speech, language, motor-control. It is very poor at others: arithmetic 4 16.11.2011

Von Neumann Computer - Neural Computation Processing: Sequential - Parallel Processors: One - Many Interaction: None - A lot Communication: Poor - Rich Processors: Fast, Accurate - Slow, Sloppy Knowledge: Local - Distributed Hardware: General Purpose - Dedicated Design: Programed - Learned 5 16.11.2011

Why Neural Networks Massive parallelism. Massive constraint satisfaction for ill-defined input. Simple computing units. Many processing units, many interconnections. Uniformity (-> sensor fusion) Non-linear classifiers/ mapping (-> good performance) Learning/ adapting Brain like?? 6 16.11.2011

History McCollough & Pitts, 1943 Perceptrons, Rosenblatt ~1960 Minsky, 1967 PDP, Neuro Rennaissance ~1985 Since then Bio, Neuro, Brain-Science Machine Learning, Statistics Applications, Part of the Toolbox 7 16.11.2011

Using Neural Nets Classification Prediction Function Approximation Continuous Mapping Pattern Completion Coding 8 16.11.2011

Design Criteria Recognition Error Rate Training Time Recognition Time Memory Requirements Training Complexity Ease of Implementation Ease of Adaptation 9 16.11.2011

Neural Models Back-Propagation Boltzman Machines Decision Tree Classifiers Feature Map Classifiers Learning Vector Quantizer (LVQ, LVQ2) High Order Networks Modified Nearest Neighbor Radial Basis Functions Kernels SVM etc. 10 16.11.2011

Network Specification What parameters are typically chosen by the network designer: Net Topology Node Characteristics Learning Rule Objective Function (Initial) Weights Learning Parameters 11 16.11.2011

Applications Space Robot* Autonomous Navigation* Speech Recognition and Understanding* Natural Language Processing* Music* Gesture Recognition Lip Reading Face Recognition Household Robots Signal Processing Banking, Bond Rating,... Sonar etc... 12 16.11.2011

Advanced Neural Models Time-Delay Neural Networks (Waibel) Recurrent Nets (Elman, Jordan) Higher Order Nets Modular System Construction Adaptive Architectures Hybrid Neural/Non-Neural Architectures 13 16.11.2011

Neural Nets - Design Problems Local Minima Speed of Learning Architecture must be selected Choice of Feature Representation Scaling Systems, Modularity Treatment of Temporal Features and Sequences 14 16.11.2011

Neural Nets - Modeling Neural Nets are Non-Linear Classifiers Approximate Posterior Probabilities Non-Parametric Training The Problem with Time How to Consider Context How to Operate Shift-Invariantly How to Process Pattern Sequences 15 16.11.2011

Pattern Recognition Overview Static Patterns, no dependence on Time or Sequential Order Important Notions Supervised - Unsupervised Classifiers Parametric - Non-Parametric Classifiers Linear - Non-linear Classifiers Classical Methods Bayes Classifier K-Nearest Neighbor Connectionist Methods Perceptron Multilayer Perceptrons 16 16.11.2011 Kognitive Systeme Prof. Waibel

Pattern Recognition 17 16.11.2011 Kognitive Systeme Prof. Waibel

Supervised - Unsupervised Supervised training: Class to be recognized is known for each sample in training data. Requires a priori knowledge of useful features and knowledge/labeling of each training token (cost!). Unsupervised training: Class is not known and structure is to be discovered automatically. Feature-space-reduction example: clustering, autoassociative nets 18 16.11.2011 Kognitive Systeme Prof. Waibel

Unsupervised Classification F2 Classification: Classes Not Known: Find Structure F1 19 16.11.2011 Kognitive Systeme Prof. Waibel

Unsupervised Classification F2 Classification: Classes Not Known: Find Structure Clustering How? How many? F1 20 16.11.2011 Kognitive Systeme Prof. Waibel

Supervised Classification Income credit-worthy non-credit-worthy Classification: Classes Known: Creditworthiness: Yes-No Features: Income, Age Classifiers Age 21 16.11.2011 Kognitive Systeme Prof. Waibel

Classification Problem Income credit-worthy non-credit-worthy Is Joe credit-worthy? Age Features: age, income Classes: creditworthy, non-creditworthy Problem: Given Joe's income and age, should a loan be made? Other Classification Problems: Fraud Detection, Customer Selection... 22 16.11.2011 Kognitive Systeme Prof. Waibel

Classification Problem credit-worthy non-credit-worthy Income 23 16.11.2011 Kognitive Systeme Prof. Waibel

Parametric - Non-parametric Parametric: assume underlying probability distribution; estimate the parameters of this distribution. Example: "Gaussian Classifier" Non-parametric: Don't assume distribution. Estimate probability of error or error citerion directly from training data. Examples: Parzen Window, k-nearest neighbor, perceptron... Kognitive Systeme Prof. Waibel 24 16.11.2011

Bayes Rule: where A priori probability A posteriori probability observation of x Class-conditional Probability Density 25 16.11.2011 Kognitive Systeme - Prof. Waiel

P P if we decide else Error is minimized, if we: Decide if otherwise Decide if otherwise For the multicategory case: Decide if P > P for all 26 16.11.2011

27 16.11.2011

28 16.11.2011

Assign x to class, if for all independent of class i class conditional probability density function A priori probability 29 16.11.2011

30 16.11.2011

31 16.11.2011

Need a priori probability P (not too bad) Need class conditional PDF p Problems: limited training data limited computation class-labelling potentially costly and errorful classes may not be known good features not known Parametric Solution: Assume that p has a particular prametric form Most common representative: multivariate normal density 32 16.11.2011

! Univariate Normal Density: p ( x ) =! 1 µ 2 πσ exp 1 2 x 2 σ Multivariate Density:! p ( x ) = ~! N ( µ, σ 2 ) 1 ( 2 π) d 2 Σ exp 1 1 2 2 ( x! µ )! t Σ 1 ( x! µ )! g! i ( x ) = 1 2 1 ( x!! ) ( x µ i ) d t!! µ 2 log ( 2 i i π) 33 16.11.2011

34 16.11.2011

F2 F1 35 16.11.2011

F2 F1 36 16.11.2011

For each class i, need to estimate from trainingdata: covariance matrix mean vector 37 16.11.2011

MLE, Maximum Likelihood Estimation For Multivariate Case: 38 16.11.2011

39 16.11.2011

40 16.11.2011

41 16.11.2011

42 16.11.2011

Features: What and how many features should be selected? Any features? The more the better? If additional features not useful (same mean and covariance), classifier will automatically ignore them? 43 16.11.2011

Generally, adding more features indiscriminantly leads to worse performance! Reason: Training Data vs. Number of Parameters Limited training data. Solution: select features carefully reduce dimensionality Principle Component Analysis 44 16.11.2011

Assumption: Single dimensions are correlated Aim: Reduce number of dimensions with minimum loss of information 45 16.11.2011

Find the axis along the highest variance 46 16.11.2011

Rotate the space along the axis 47 16.11.2011

Dimensions are uncorrelated now 48 16.11.2011

Remove dimensions with low variance => Reduction of dimensionality with minimum loss of information 49 16.11.2011

Normal distribution does not model this situation well. other densities may be mathematically intractable. >non-parametric techniques 50 16.11.2011

No Assumptions about the distribution are made estimate p(x) directly from data V Choose a window of volume V Count the number of samples that fall inside the window. :k = count :n = number of samples 51 16.11.2011

Problem: Volume too large -> loose resolution Volume too small -> erratic, poor estimate set 52 16.11.2011

Idea: Let the volume be a function of the data. Include k- nearest neighbors in estimate. set k-nearest neighbor rule for classification To classify sample x: Find k-nearest neighbors of x. Determine the class most frequently represented among those k samples (take a vote) Assign x to that class. k = 9 7 o 2 classify o? 53 16.11.2011

For finite number of samples n, we want k to be: large for reliable estimate small to guarantee that all k neighbors are reasonably close. Need training database to be larger. "There is no data like more data." 54 16.11.2011

Pattern Recognition 55 16.11.2011

Unsupervised Classification F2 Classification: Classes Not Known: Find Structure F1 56 16.11.2011

Unsupervised Classification F2 Classification: Classes Not Known: Find Structure Clustering How? How many? F1 57 16.11.2011

Unsupervised Learning Data collection and labeling costly and time consuming Characteristics of patterns can change over time May not have insight into nature or structure of data Classes NOT known 58 16.11.2011

Mixture Densities Samples come from c classes A priori probability P Assume: The forms for the class-conditional PDF's P are known (usually normal) Unknown parameter vector µ 1, σ 1 µ 2, σ 2 59 16.11.2011

Mixture Densities Problem: Hairy Math Simplification/ Approximation: Look only for means -> Isodata 1. Choose initial 2. Classify n samples to closest mean 3. Recompute means from samples in class 4. Means changed? Goto step 2, else stop 60 16.11.2011

Mixture Densities Isodata, problems: Choosing initial means Knowing number of classes Assuming distribution What is "closest"? 61 16.11.2011

Similarity Criterion function Clustering Samples in same class should extremize criterion function, that measures cluster quality Example: sum of error criterion 62 16.11.2011

Hierarchical Clustering Need not determine c Need not guess initial means 1. Initialize c := n 2. Find nearest pair of distinct clusters and 3. Merge them and decrement c 4. If c stop, otherwise goto step 2 63 16.11.2011

Dendrogram for hierarchical clustering 64 16.11.2011

Similarity What constitutes "nearest cluster"? 65 16.11.2011

Three illustrative examples a) c) b) 66 16.11.2011

Results of the nearest-neighbor algorithm a) c) b) 67 16.11.2011

Results of the furthest-neighbor algorithm a) c) b) 68 16.11.2011

Parametric - Non-parametric Parametric: assume underlying probability distribution; estimate the parameters of this distribution. Example: "Gaussian Classifier" Non-parametric: Don't assume distribution. Estimate probability of error or error criterion directly from training data. Examples: Parzen Window, k-nearest neighbor, perceptron... 69 16.11.2011

Class A Not class A No decision Feature vector Weight vector Threshold weight 70 16.11.2011

No assumption about distributions (non-parametric) Linear Decision Surfaces Begin by supervised training (given classes of training data) Discriminant function: g(x) gives distance from decision surface Two category case: 71 16.11.2011

Each class can be surounded by a convex polygon Maximum safety area is half of the distance of the polygons 72 16.11.2011

Goal: find an orientation of the line, were the projected samples are well separated. 73 16.11.2011

Dimensionality Reduction; Project a set of multidimensional points onto a line Fisher Discriminant is function that maximizes criterion where sample mean for projected samples scatter for projected samples 74 16.11.2011

Fisher's linear discriminant: (within-class scatter matrix) 75 16.11.2011

The Perceptron W0 W1 W2 1 Age Income 76 16.11.2011

Decision Function g(x) Class A Not class A No decision Feature vector Weight vector Threshold weight denotes scalar product 77 16.11.2011

Perceptrons - Nonlinearities x 1 x 2 x 3 Three common non-linearities : 1 0-1 Hard Limiter 1 0-1 Threshold Logic 1 0-1 sigmoid 78 16.11.2011

The Perceptron Linear Decision Element: x 1 x 2 x n threshold x 0 x 1 x 2 x n 79 16.11.2011

Classifier Discriminant Functions Assign x to class, if for all j i independent of class i class conditional probability density function A priori probability 80 16.11.2011

Linear Discriminant Functions No assumption about distributions (non-parametric) Linear Decision Surfaces Begin by supervised training (given classes of training data) Discriminant function: g(x) gives distance from decision surface Two category case: 81 16.11.2011

Perceptron All vectors are labelled correctly, if for all j if labelled if labelled Now we set all samples belonging to to their negative ( ). Then all vectors are classified correctly, if for all j 82 16.11.2011

Perceptron Criterion Function X is the set of misclassified tokens Since is negation for misclassified tokens, is positive When is zero a solution vector is found. is proportional to the sum of distances of misclassified samples to decision boundary 83 16.11.2011

Linearly Separable Samples and the Solution Region in Weight Space 84 16.11.2011

The Perceptron Criterion Function 85 16.11.2011

Finding a Solution Region by Gradient Search 86 16.11.2011

Perceptron Learning Issues: How to set the Learning Rate How to set initial weights Problems: Non-Separable Data Separable Data, but Which Decision Surface Non-Linearly Separable 87 16.11.2011

Relaxation Procedure Variations Gradient is continuous -> smoother surface Margin margin b 88 16.11.2011

Effect of the Margin on the Solution Region 89 16.11.2011

Nonseparable behavior Minimize MSE 90 16.11.2011

The Perceptron: The good news: Simple computing units Learning algorithm The bad news: Single units generate linear decision surfaces (The infamous XOR problem). Multilayered machines could not be learned. Local optimization. 91 16.11.2011

Networks of Neurons/ Multi-Layer Perceptron Many interconnected simple processing elements: Output patterns Internal Representation Units Input Patterns 92 16.11.2011

Connectionist Units x j y j Backpropagation of error: 93 16.11.2011

Training the MLP by Error Back-Propagation Choose random initial weights Apply input, get some output Compare output to desired output and compute error Back-propagate error through net and compute, the contribution of each weight to the overall error. Adjust weights slightly to reduce error. 94 16.11.2011

Backpropagation of Error 1). 2). x j y j 3). 4). 95 16.11.2011

Derivative dy/dx 1 0-1 sigmoid 96 16.11.2011

Statistical Interpretation of MLP s What is the output of an MLP? Output represents a Posteriori Probabilities P(w x) Assumptions: Training Targets: 1, 0 Output Error Function: Mean Squared Error Other Functions Possible 97 16.11.2011

Statistical Interpretation of MLP s What is the output of an MLP? It can be shown, that the output units represent the a posteriori probability Provided certain objective functions (e.g. MSE,..) Sufficiently large training database 98 16.11.2011

What does the sigmoid do? 99 16.11.2011

What does the sigmoid do? inputs Bias log. likelihood ratio or "odds" 100 16.11.2011

Statistical Interpretation of MLP s What is the output of an MLP? Using MSE, for two class problem, if N large and reflects prior distribution 101 16.11.2011

Statistical Interpretation of MLP s net approx. posterior 102 16.11.2011