Alex Waibel - PDF Free Download

Alex Waibel 815.11.2011 1 16.11.2011

Organisation Literatur: Introduction to The Theory of Neural Computation Hertz, Krogh, Palmer, Santa Fe Institute Neural Network Architectures An Introduction, Judith Dayhoff, VNR Publishers Andere Publikationen, TBD Papers On-Line Sprechstunde: Dienstag, 12:00, oder Anmeldung Sekretariat Link im Studienportal, und http://isl.anthropomatik.kit.edu lectures -> neuronale netze 2 16.11.2011

Neural Nets Terminology: Artificial Neural Networks Connectionist Models Parallel Distributed Processing (PDP) Massive Parallel Processing Multi-layer Perceptron (MLP) 3 16.11.2011

The brain is: ~ 1000 supercomputers ~ 1/10 pocket calculator The brain is very good for some problems: vision, speech, language, motor-control. It is very poor at others: arithmetic 4 16.11.2011

Von Neumann Computer - Neural Computation Processing: Sequential - Parallel Processors: One - Many Interaction: None - A lot Communication: Poor - Rich Processors: Fast, Accurate - Slow, Sloppy Knowledge: Local - Distributed Hardware: General Purpose - Dedicated Design: Programed - Learned 5 16.11.2011

Why Neural Networks Massive parallelism. Massive constraint satisfaction for ill-defined input. Simple computing units. Many processing units, many interconnections. Uniformity (-> sensor fusion) Non-linear classifiers/ mapping (-> good performance) Learning/ adapting Brain like?? 6 16.11.2011

History McCollough & Pitts, 1943 Perceptrons, Rosenblatt ~1960 Minsky, 1967 PDP, Neuro Rennaissance ~1985 Since then Bio, Neuro, Brain-Science Machine Learning, Statistics Applications, Part of the Toolbox 7 16.11.2011

Using Neural Nets Classification Prediction Function Approximation Continuous Mapping Pattern Completion Coding 8 16.11.2011

Design Criteria Recognition Error Rate Training Time Recognition Time Memory Requirements Training Complexity Ease of Implementation Ease of Adaptation 9 16.11.2011

Neural Models Back-Propagation Boltzman Machines Decision Tree Classifiers Feature Map Classifiers Learning Vector Quantizer (LVQ, LVQ2) High Order Networks Modified Nearest Neighbor Radial Basis Functions Kernels SVM etc. 10 16.11.2011

Network Specification What parameters are typically chosen by the network designer: Net Topology Node Characteristics Learning Rule Objective Function (Initial) Weights Learning Parameters 11 16.11.2011

Applications Space Robot* Autonomous Navigation* Speech Recognition and Understanding* Natural Language Processing* Music* Gesture Recognition Lip Reading Face Recognition Household Robots Signal Processing Banking, Bond Rating,... Sonar etc... 12 16.11.2011

Advanced Neural Models Time-Delay Neural Networks (Waibel) Recurrent Nets (Elman, Jordan) Higher Order Nets Modular System Construction Adaptive Architectures Hybrid Neural/Non-Neural Architectures 13 16.11.2011

Neural Nets - Design Problems Local Minima Speed of Learning Architecture must be selected Choice of Feature Representation Scaling Systems, Modularity Treatment of Temporal Features and Sequences 14 16.11.2011

Neural Nets - Modeling Neural Nets are Non-Linear Classifiers Approximate Posterior Probabilities Non-Parametric Training The Problem with Time How to Consider Context How to Operate Shift-Invariantly How to Process Pattern Sequences 15 16.11.2011

Pattern Recognition Overview Static Patterns, no dependence on Time or Sequential Order Important Notions Supervised - Unsupervised Classifiers Parametric - Non-Parametric Classifiers Linear - Non-linear Classifiers Classical Methods Bayes Classifier K-Nearest Neighbor Connectionist Methods Perceptron Multilayer Perceptrons 16 16.11.2011 Kognitive Systeme Prof. Waibel

Pattern Recognition 17 16.11.2011 Kognitive Systeme Prof. Waibel

Supervised - Unsupervised Supervised training: Class to be recognized is known for each sample in training data. Requires a priori knowledge of useful features and knowledge/labeling of each training token (cost!). Unsupervised training: Class is not known and structure is to be discovered automatically. Feature-space-reduction example: clustering, autoassociative nets 18 16.11.2011 Kognitive Systeme Prof. Waibel

Unsupervised Classification F2 Classification: Classes Not Known: Find Structure F1 19 16.11.2011 Kognitive Systeme Prof. Waibel

Unsupervised Classification F2 Classification: Classes Not Known: Find Structure Clustering How? How many? F1 20 16.11.2011 Kognitive Systeme Prof. Waibel

Supervised Classification Income credit-worthy non-credit-worthy Classification: Classes Known: Creditworthiness: Yes-No Features: Income, Age Classifiers Age 21 16.11.2011 Kognitive Systeme Prof. Waibel

Classification Problem Income credit-worthy non-credit-worthy Is Joe credit-worthy? Age Features: age, income Classes: creditworthy, non-creditworthy Problem: Given Joe's income and age, should a loan be made? Other Classification Problems: Fraud Detection, Customer Selection... 22 16.11.2011 Kognitive Systeme Prof. Waibel

Classification Problem credit-worthy non-credit-worthy Income 23 16.11.2011 Kognitive Systeme Prof. Waibel

Parametric - Non-parametric Parametric: assume underlying probability distribution; estimate the parameters of this distribution. Example: "Gaussian Classifier" Non-parametric: Don't assume distribution. Estimate probability of error or error citerion directly from training data. Examples: Parzen Window, k-nearest neighbor, perceptron... Kognitive Systeme Prof. Waibel 24 16.11.2011

Bayes Rule: where A priori probability A posteriori probability observation of x Class-conditional Probability Density 25 16.11.2011 Kognitive Systeme - Prof. Waiel

P P if we decide else Error is minimized, if we: Decide if otherwise Decide if otherwise For the multicategory case: Decide if P > P for all 26 16.11.2011

27 16.11.2011

28 16.11.2011

Assign x to class, if for all independent of class i class conditional probability density function A priori probability 29 16.11.2011

30 16.11.2011

31 16.11.2011

Need a priori probability P (not too bad) Need class conditional PDF p Problems: limited training data limited computation class-labelling potentially costly and errorful classes may not be known good features not known Parametric Solution: Assume that p has a particular prametric form Most common representative: multivariate normal density 32 16.11.2011

! Univariate Normal Density: p ( x ) =! 1 µ 2 πσ exp 1 2 x 2 σ Multivariate Density:! p ( x ) = ~! N ( µ, σ 2 ) 1 ( 2 π) d 2 Σ exp 1 1 2 2 ( x! µ )! t Σ 1 ( x! µ )! g! i ( x ) = 1 2 1 ( x!! ) ( x µ i ) d t!! µ 2 log ( 2 i i π) 33 16.11.2011

34 16.11.2011

F2 F1 35 16.11.2011

F2 F1 36 16.11.2011

For each class i, need to estimate from trainingdata: covariance matrix mean vector 37 16.11.2011

MLE, Maximum Likelihood Estimation For Multivariate Case: 38 16.11.2011

39 16.11.2011

40 16.11.2011

41 16.11.2011

42 16.11.2011

Features: What and how many features should be selected? Any features? The more the better? If additional features not useful (same mean and covariance), classifier will automatically ignore them? 43 16.11.2011

Generally, adding more features indiscriminantly leads to worse performance! Reason: Training Data vs. Number of Parameters Limited training data. Solution: select features carefully reduce dimensionality Principle Component Analysis 44 16.11.2011

Assumption: Single dimensions are correlated Aim: Reduce number of dimensions with minimum loss of information 45 16.11.2011

Find the axis along the highest variance 46 16.11.2011

Rotate the space along the axis 47 16.11.2011

Dimensions are uncorrelated now 48 16.11.2011

Remove dimensions with low variance => Reduction of dimensionality with minimum loss of information 49 16.11.2011

Normal distribution does not model this situation well. other densities may be mathematically intractable. >non-parametric techniques 50 16.11.2011

No Assumptions about the distribution are made estimate p(x) directly from data V Choose a window of volume V Count the number of samples that fall inside the window. :k = count :n = number of samples 51 16.11.2011

Problem: Volume too large -> loose resolution Volume too small -> erratic, poor estimate set 52 16.11.2011

Idea: Let the volume be a function of the data. Include k- nearest neighbors in estimate. set k-nearest neighbor rule for classification To classify sample x: Find k-nearest neighbors of x. Determine the class most frequently represented among those k samples (take a vote) Assign x to that class. k = 9 7 o 2 classify o? 53 16.11.2011

For finite number of samples n, we want k to be: large for reliable estimate small to guarantee that all k neighbors are reasonably close. Need training database to be larger. "There is no data like more data." 54 16.11.2011

Pattern Recognition 55 16.11.2011

Unsupervised Classification F2 Classification: Classes Not Known: Find Structure F1 56 16.11.2011

Unsupervised Classification F2 Classification: Classes Not Known: Find Structure Clustering How? How many? F1 57 16.11.2011

Unsupervised Learning Data collection and labeling costly and time consuming Characteristics of patterns can change over time May not have insight into nature or structure of data Classes NOT known 58 16.11.2011

Mixture Densities Samples come from c classes A priori probability P Assume: The forms for the class-conditional PDF's P are known (usually normal) Unknown parameter vector µ 1, σ 1 µ 2, σ 2 59 16.11.2011

Mixture Densities Problem: Hairy Math Simplification/ Approximation: Look only for means -> Isodata 1. Choose initial 2. Classify n samples to closest mean 3. Recompute means from samples in class 4. Means changed? Goto step 2, else stop 60 16.11.2011

Mixture Densities Isodata, problems: Choosing initial means Knowing number of classes Assuming distribution What is "closest"? 61 16.11.2011

Similarity Criterion function Clustering Samples in same class should extremize criterion function, that measures cluster quality Example: sum of error criterion 62 16.11.2011

Hierarchical Clustering Need not determine c Need not guess initial means 1. Initialize c := n 2. Find nearest pair of distinct clusters and 3. Merge them and decrement c 4. If c stop, otherwise goto step 2 63 16.11.2011

Dendrogram for hierarchical clustering 64 16.11.2011

Similarity What constitutes "nearest cluster"? 65 16.11.2011

Three illustrative examples a) c) b) 66 16.11.2011

Results of the nearest-neighbor algorithm a) c) b) 67 16.11.2011

Results of the furthest-neighbor algorithm a) c) b) 68 16.11.2011

Parametric - Non-parametric Parametric: assume underlying probability distribution; estimate the parameters of this distribution. Example: "Gaussian Classifier" Non-parametric: Don't assume distribution. Estimate probability of error or error criterion directly from training data. Examples: Parzen Window, k-nearest neighbor, perceptron... 69 16.11.2011

Class A Not class A No decision Feature vector Weight vector Threshold weight 70 16.11.2011

No assumption about distributions (non-parametric) Linear Decision Surfaces Begin by supervised training (given classes of training data) Discriminant function: g(x) gives distance from decision surface Two category case: 71 16.11.2011

Each class can be surounded by a convex polygon Maximum safety area is half of the distance of the polygons 72 16.11.2011

Goal: find an orientation of the line, were the projected samples are well separated. 73 16.11.2011

Dimensionality Reduction; Project a set of multidimensional points onto a line Fisher Discriminant is function that maximizes criterion where sample mean for projected samples scatter for projected samples 74 16.11.2011

Fisher's linear discriminant: (within-class scatter matrix) 75 16.11.2011

The Perceptron W0 W1 W2 1 Age Income 76 16.11.2011

Decision Function g(x) Class A Not class A No decision Feature vector Weight vector Threshold weight denotes scalar product 77 16.11.2011

Perceptrons - Nonlinearities x 1 x 2 x 3 Three common non-linearities : 1 0-1 Hard Limiter 1 0-1 Threshold Logic 1 0-1 sigmoid 78 16.11.2011

The Perceptron Linear Decision Element: x 1 x 2 x n threshold x 0 x 1 x 2 x n 79 16.11.2011

Classifier Discriminant Functions Assign x to class, if for all j i independent of class i class conditional probability density function A priori probability 80 16.11.2011

Linear Discriminant Functions No assumption about distributions (non-parametric) Linear Decision Surfaces Begin by supervised training (given classes of training data) Discriminant function: g(x) gives distance from decision surface Two category case: 81 16.11.2011

Perceptron All vectors are labelled correctly, if for all j if labelled if labelled Now we set all samples belonging to to their negative ( ). Then all vectors are classified correctly, if for all j 82 16.11.2011

Perceptron Criterion Function X is the set of misclassified tokens Since is negation for misclassified tokens, is positive When is zero a solution vector is found. is proportional to the sum of distances of misclassified samples to decision boundary 83 16.11.2011

Linearly Separable Samples and the Solution Region in Weight Space 84 16.11.2011

The Perceptron Criterion Function 85 16.11.2011

Finding a Solution Region by Gradient Search 86 16.11.2011

Perceptron Learning Issues: How to set the Learning Rate How to set initial weights Problems: Non-Separable Data Separable Data, but Which Decision Surface Non-Linearly Separable 87 16.11.2011

Relaxation Procedure Variations Gradient is continuous -> smoother surface Margin margin b 88 16.11.2011

Effect of the Margin on the Solution Region 89 16.11.2011

Nonseparable behavior Minimize MSE 90 16.11.2011

The Perceptron: The good news: Simple computing units Learning algorithm The bad news: Single units generate linear decision surfaces (The infamous XOR problem). Multilayered machines could not be learned. Local optimization. 91 16.11.2011

Networks of Neurons/ Multi-Layer Perceptron Many interconnected simple processing elements: Output patterns Internal Representation Units Input Patterns 92 16.11.2011

Connectionist Units x j y j Backpropagation of error: 93 16.11.2011

Training the MLP by Error Back-Propagation Choose random initial weights Apply input, get some output Compare output to desired output and compute error Back-propagate error through net and compute, the contribution of each weight to the overall error. Adjust weights slightly to reduce error. 94 16.11.2011

Backpropagation of Error 1). 2). x j y j 3). 4). 95 16.11.2011

Derivative dy/dx 1 0-1 sigmoid 96 16.11.2011

Statistical Interpretation of MLP s What is the output of an MLP? Output represents a Posteriori Probabilities P(w x) Assumptions: Training Targets: 1, 0 Output Error Function: Mean Squared Error Other Functions Possible 97 16.11.2011

Statistical Interpretation of MLP s What is the output of an MLP? It can be shown, that the output units represent the a posteriori probability Provided certain objective functions (e.g. MSE,..) Sufficiently large training database 98 16.11.2011

What does the sigmoid do? 99 16.11.2011

What does the sigmoid do? inputs Bias log. likelihood ratio or "odds" 100 16.11.2011

Statistical Interpretation of MLP s What is the output of an MLP? Using MSE, for two class problem, if N large and reflects prior distribution 101 16.11.2011

Statistical Interpretation of MLP s net approx. posterior 102 16.11.2011