Alex Waibel 815.11.2011 1 16.11.2011
Organisation Literatur: Introduction to The Theory of Neural Computation Hertz, Krogh, Palmer, Santa Fe Institute Neural Network Architectures An Introduction, Judith Dayhoff, VNR Publishers Andere Publikationen, TBD Papers On-Line Sprechstunde: Dienstag, 12:00, oder Anmeldung Sekretariat Link im Studienportal, und http://isl.anthropomatik.kit.edu lectures -> neuronale netze 2 16.11.2011
Neural Nets Terminology: Artificial Neural Networks Connectionist Models Parallel Distributed Processing (PDP) Massive Parallel Processing Multi-layer Perceptron (MLP) 3 16.11.2011
The brain is: ~ 1000 supercomputers ~ 1/10 pocket calculator The brain is very good for some problems: vision, speech, language, motor-control. It is very poor at others: arithmetic 4 16.11.2011
Von Neumann Computer - Neural Computation Processing: Sequential - Parallel Processors: One - Many Interaction: None - A lot Communication: Poor - Rich Processors: Fast, Accurate - Slow, Sloppy Knowledge: Local - Distributed Hardware: General Purpose - Dedicated Design: Programed - Learned 5 16.11.2011
Why Neural Networks Massive parallelism. Massive constraint satisfaction for ill-defined input. Simple computing units. Many processing units, many interconnections. Uniformity (-> sensor fusion) Non-linear classifiers/ mapping (-> good performance) Learning/ adapting Brain like?? 6 16.11.2011
History McCollough & Pitts, 1943 Perceptrons, Rosenblatt ~1960 Minsky, 1967 PDP, Neuro Rennaissance ~1985 Since then Bio, Neuro, Brain-Science Machine Learning, Statistics Applications, Part of the Toolbox 7 16.11.2011
Using Neural Nets Classification Prediction Function Approximation Continuous Mapping Pattern Completion Coding 8 16.11.2011
Design Criteria Recognition Error Rate Training Time Recognition Time Memory Requirements Training Complexity Ease of Implementation Ease of Adaptation 9 16.11.2011
Neural Models Back-Propagation Boltzman Machines Decision Tree Classifiers Feature Map Classifiers Learning Vector Quantizer (LVQ, LVQ2) High Order Networks Modified Nearest Neighbor Radial Basis Functions Kernels SVM etc. 10 16.11.2011
Network Specification What parameters are typically chosen by the network designer: Net Topology Node Characteristics Learning Rule Objective Function (Initial) Weights Learning Parameters 11 16.11.2011
Applications Space Robot* Autonomous Navigation* Speech Recognition and Understanding* Natural Language Processing* Music* Gesture Recognition Lip Reading Face Recognition Household Robots Signal Processing Banking, Bond Rating,... Sonar etc... 12 16.11.2011
Advanced Neural Models Time-Delay Neural Networks (Waibel) Recurrent Nets (Elman, Jordan) Higher Order Nets Modular System Construction Adaptive Architectures Hybrid Neural/Non-Neural Architectures 13 16.11.2011
Neural Nets - Design Problems Local Minima Speed of Learning Architecture must be selected Choice of Feature Representation Scaling Systems, Modularity Treatment of Temporal Features and Sequences 14 16.11.2011
Neural Nets - Modeling Neural Nets are Non-Linear Classifiers Approximate Posterior Probabilities Non-Parametric Training The Problem with Time How to Consider Context How to Operate Shift-Invariantly How to Process Pattern Sequences 15 16.11.2011
Pattern Recognition Overview Static Patterns, no dependence on Time or Sequential Order Important Notions Supervised - Unsupervised Classifiers Parametric - Non-Parametric Classifiers Linear - Non-linear Classifiers Classical Methods Bayes Classifier K-Nearest Neighbor Connectionist Methods Perceptron Multilayer Perceptrons 16 16.11.2011 Kognitive Systeme Prof. Waibel
Pattern Recognition 17 16.11.2011 Kognitive Systeme Prof. Waibel
Supervised - Unsupervised Supervised training: Class to be recognized is known for each sample in training data. Requires a priori knowledge of useful features and knowledge/labeling of each training token (cost!). Unsupervised training: Class is not known and structure is to be discovered automatically. Feature-space-reduction example: clustering, autoassociative nets 18 16.11.2011 Kognitive Systeme Prof. Waibel
Unsupervised Classification F2 Classification: Classes Not Known: Find Structure F1 19 16.11.2011 Kognitive Systeme Prof. Waibel
Unsupervised Classification F2 Classification: Classes Not Known: Find Structure Clustering How? How many? F1 20 16.11.2011 Kognitive Systeme Prof. Waibel
Supervised Classification Income credit-worthy non-credit-worthy Classification: Classes Known: Creditworthiness: Yes-No Features: Income, Age Classifiers Age 21 16.11.2011 Kognitive Systeme Prof. Waibel
Classification Problem Income credit-worthy non-credit-worthy Is Joe credit-worthy? Age Features: age, income Classes: creditworthy, non-creditworthy Problem: Given Joe's income and age, should a loan be made? Other Classification Problems: Fraud Detection, Customer Selection... 22 16.11.2011 Kognitive Systeme Prof. Waibel
Classification Problem credit-worthy non-credit-worthy Income 23 16.11.2011 Kognitive Systeme Prof. Waibel
Parametric - Non-parametric Parametric: assume underlying probability distribution; estimate the parameters of this distribution. Example: "Gaussian Classifier" Non-parametric: Don't assume distribution. Estimate probability of error or error citerion directly from training data. Examples: Parzen Window, k-nearest neighbor, perceptron... Kognitive Systeme Prof. Waibel 24 16.11.2011
Bayes Rule: where A priori probability A posteriori probability observation of x Class-conditional Probability Density 25 16.11.2011 Kognitive Systeme - Prof. Waiel
P P if we decide else Error is minimized, if we: Decide if otherwise Decide if otherwise For the multicategory case: Decide if P > P for all 26 16.11.2011
27 16.11.2011
28 16.11.2011
Assign x to class, if for all independent of class i class conditional probability density function A priori probability 29 16.11.2011
30 16.11.2011
31 16.11.2011
Need a priori probability P (not too bad) Need class conditional PDF p Problems: limited training data limited computation class-labelling potentially costly and errorful classes may not be known good features not known Parametric Solution: Assume that p has a particular prametric form Most common representative: multivariate normal density 32 16.11.2011
! Univariate Normal Density: p ( x ) =! 1 µ 2 πσ exp 1 2 x 2 σ Multivariate Density:! p ( x ) = ~! N ( µ, σ 2 ) 1 ( 2 π) d 2 Σ exp 1 1 2 2 ( x! µ )! t Σ 1 ( x! µ )! g! i ( x ) = 1 2 1 ( x!! ) ( x µ i ) d t!! µ 2 log ( 2 i i π) 33 16.11.2011
34 16.11.2011
F2 F1 35 16.11.2011
F2 F1 36 16.11.2011
For each class i, need to estimate from trainingdata: covariance matrix mean vector 37 16.11.2011
MLE, Maximum Likelihood Estimation For Multivariate Case: 38 16.11.2011
39 16.11.2011
40 16.11.2011
41 16.11.2011
42 16.11.2011
Features: What and how many features should be selected? Any features? The more the better? If additional features not useful (same mean and covariance), classifier will automatically ignore them? 43 16.11.2011
Generally, adding more features indiscriminantly leads to worse performance! Reason: Training Data vs. Number of Parameters Limited training data. Solution: select features carefully reduce dimensionality Principle Component Analysis 44 16.11.2011
Assumption: Single dimensions are correlated Aim: Reduce number of dimensions with minimum loss of information 45 16.11.2011
Find the axis along the highest variance 46 16.11.2011
Rotate the space along the axis 47 16.11.2011
Dimensions are uncorrelated now 48 16.11.2011
Remove dimensions with low variance => Reduction of dimensionality with minimum loss of information 49 16.11.2011
Normal distribution does not model this situation well. other densities may be mathematically intractable. >non-parametric techniques 50 16.11.2011
No Assumptions about the distribution are made estimate p(x) directly from data V Choose a window of volume V Count the number of samples that fall inside the window. :k = count :n = number of samples 51 16.11.2011
Problem: Volume too large -> loose resolution Volume too small -> erratic, poor estimate set 52 16.11.2011
Idea: Let the volume be a function of the data. Include k- nearest neighbors in estimate. set k-nearest neighbor rule for classification To classify sample x: Find k-nearest neighbors of x. Determine the class most frequently represented among those k samples (take a vote) Assign x to that class. k = 9 7 o 2 classify o? 53 16.11.2011
For finite number of samples n, we want k to be: large for reliable estimate small to guarantee that all k neighbors are reasonably close. Need training database to be larger. "There is no data like more data." 54 16.11.2011
Pattern Recognition 55 16.11.2011
Unsupervised Classification F2 Classification: Classes Not Known: Find Structure F1 56 16.11.2011
Unsupervised Classification F2 Classification: Classes Not Known: Find Structure Clustering How? How many? F1 57 16.11.2011
Unsupervised Learning Data collection and labeling costly and time consuming Characteristics of patterns can change over time May not have insight into nature or structure of data Classes NOT known 58 16.11.2011
Mixture Densities Samples come from c classes A priori probability P Assume: The forms for the class-conditional PDF's P are known (usually normal) Unknown parameter vector µ 1, σ 1 µ 2, σ 2 59 16.11.2011
Mixture Densities Problem: Hairy Math Simplification/ Approximation: Look only for means -> Isodata 1. Choose initial 2. Classify n samples to closest mean 3. Recompute means from samples in class 4. Means changed? Goto step 2, else stop 60 16.11.2011
Mixture Densities Isodata, problems: Choosing initial means Knowing number of classes Assuming distribution What is "closest"? 61 16.11.2011
Similarity Criterion function Clustering Samples in same class should extremize criterion function, that measures cluster quality Example: sum of error criterion 62 16.11.2011
Hierarchical Clustering Need not determine c Need not guess initial means 1. Initialize c := n 2. Find nearest pair of distinct clusters and 3. Merge them and decrement c 4. If c stop, otherwise goto step 2 63 16.11.2011
Dendrogram for hierarchical clustering 64 16.11.2011
Similarity What constitutes "nearest cluster"? 65 16.11.2011
Three illustrative examples a) c) b) 66 16.11.2011
Results of the nearest-neighbor algorithm a) c) b) 67 16.11.2011
Results of the furthest-neighbor algorithm a) c) b) 68 16.11.2011
Parametric - Non-parametric Parametric: assume underlying probability distribution; estimate the parameters of this distribution. Example: "Gaussian Classifier" Non-parametric: Don't assume distribution. Estimate probability of error or error criterion directly from training data. Examples: Parzen Window, k-nearest neighbor, perceptron... 69 16.11.2011
Class A Not class A No decision Feature vector Weight vector Threshold weight 70 16.11.2011
No assumption about distributions (non-parametric) Linear Decision Surfaces Begin by supervised training (given classes of training data) Discriminant function: g(x) gives distance from decision surface Two category case: 71 16.11.2011
Each class can be surounded by a convex polygon Maximum safety area is half of the distance of the polygons 72 16.11.2011
Goal: find an orientation of the line, were the projected samples are well separated. 73 16.11.2011
Dimensionality Reduction; Project a set of multidimensional points onto a line Fisher Discriminant is function that maximizes criterion where sample mean for projected samples scatter for projected samples 74 16.11.2011
Fisher's linear discriminant: (within-class scatter matrix) 75 16.11.2011
The Perceptron W0 W1 W2 1 Age Income 76 16.11.2011
Decision Function g(x) Class A Not class A No decision Feature vector Weight vector Threshold weight denotes scalar product 77 16.11.2011
Perceptrons - Nonlinearities x 1 x 2 x 3 Three common non-linearities : 1 0-1 Hard Limiter 1 0-1 Threshold Logic 1 0-1 sigmoid 78 16.11.2011
The Perceptron Linear Decision Element: x 1 x 2 x n threshold x 0 x 1 x 2 x n 79 16.11.2011
Classifier Discriminant Functions Assign x to class, if for all j i independent of class i class conditional probability density function A priori probability 80 16.11.2011
Linear Discriminant Functions No assumption about distributions (non-parametric) Linear Decision Surfaces Begin by supervised training (given classes of training data) Discriminant function: g(x) gives distance from decision surface Two category case: 81 16.11.2011
Perceptron All vectors are labelled correctly, if for all j if labelled if labelled Now we set all samples belonging to to their negative ( ). Then all vectors are classified correctly, if for all j 82 16.11.2011
Perceptron Criterion Function X is the set of misclassified tokens Since is negation for misclassified tokens, is positive When is zero a solution vector is found. is proportional to the sum of distances of misclassified samples to decision boundary 83 16.11.2011
Linearly Separable Samples and the Solution Region in Weight Space 84 16.11.2011
The Perceptron Criterion Function 85 16.11.2011
Finding a Solution Region by Gradient Search 86 16.11.2011
Perceptron Learning Issues: How to set the Learning Rate How to set initial weights Problems: Non-Separable Data Separable Data, but Which Decision Surface Non-Linearly Separable 87 16.11.2011
Relaxation Procedure Variations Gradient is continuous -> smoother surface Margin margin b 88 16.11.2011
Effect of the Margin on the Solution Region 89 16.11.2011
Nonseparable behavior Minimize MSE 90 16.11.2011
The Perceptron: The good news: Simple computing units Learning algorithm The bad news: Single units generate linear decision surfaces (The infamous XOR problem). Multilayered machines could not be learned. Local optimization. 91 16.11.2011
Networks of Neurons/ Multi-Layer Perceptron Many interconnected simple processing elements: Output patterns Internal Representation Units Input Patterns 92 16.11.2011
Connectionist Units x j y j Backpropagation of error: 93 16.11.2011
Training the MLP by Error Back-Propagation Choose random initial weights Apply input, get some output Compare output to desired output and compute error Back-propagate error through net and compute, the contribution of each weight to the overall error. Adjust weights slightly to reduce error. 94 16.11.2011
Backpropagation of Error 1). 2). x j y j 3). 4). 95 16.11.2011
Derivative dy/dx 1 0-1 sigmoid 96 16.11.2011
Statistical Interpretation of MLP s What is the output of an MLP? Output represents a Posteriori Probabilities P(w x) Assumptions: Training Targets: 1, 0 Output Error Function: Mean Squared Error Other Functions Possible 97 16.11.2011
Statistical Interpretation of MLP s What is the output of an MLP? It can be shown, that the output units represent the a posteriori probability Provided certain objective functions (e.g. MSE,..) Sufficiently large training database 98 16.11.2011
What does the sigmoid do? 99 16.11.2011
What does the sigmoid do? inputs Bias log. likelihood ratio or "odds" 100 16.11.2011
Statistical Interpretation of MLP s What is the output of an MLP? Using MSE, for two class problem, if N large and reflects prior distribution 101 16.11.2011
Statistical Interpretation of MLP s net approx. posterior 102 16.11.2011