Introduction to Neural Networks

What are connectionist neural networks? Connectionism refers to a computer modeling approach to computation that is loosely based upon the architecture of the brain Many different models, but all include: Multiple, individual nodes or units that operate at the same time (in parallel) A network that connects the nodes together Information is stored in a distributed fashion among the links that connect the nodes Learning can occur with gradual changes in connection strength

History of Neural Networks () Attempts to mimic the human brain date back to work in the 93s, 94s, & 95s by Alan Turing, Warren McCullough, Walter Pitts, Donald Hebb and James von Neumann 943 McCulloch-Pitts: neuron as computing element 948 Wiener: cybernetics 949 Hebb: learning rule 957 Rosenblatt at Cornell developed Perceptron, a hardware neural net for character recognition 959 Widrow and Hoff at Stanford developed Adaline for adaptive control of noise on telephone lines 96 Widrow-Hoff: least mean square algorithm 3

History of Neural Networks () Recession 969 Minsky-Papert: limitations of perceptron model Linear Separability in Perceptrons 4

History of Neural Networks (3) Revival, mathematically tied together many of the ideas from previous research 98 Hopfield: recurrent network model 98 Kohonen: self-organizing maps 986 Rumelhart et. al.: backpropagation universial approximation Since then, growth has exploded. Over 8% of Fortune 5 have neural net R&D programs Thousands of research papers Commercial software applications 5

Application with Neural Network Forecasting/Market Prediction: finance and banking Manufacturing: quality control, fault diagnosis Medicine: analysis of electrocardiogram data, RNA & DNA sequencing, drug development without animal testing Pattern/Image recognition: handwriting recognition, airport bomb detection Optimization: without Simplex Control: process, robotics 6

Comparison of Brains and Traditional Computers billion neurons, 3 trillion synapses Element size: -6 m Energy use: 5W Processing speed: Hz Parallel, Distributed Fault Tolerant Learns: Yes Intelligent/Conscious: Usually billion bytes RAM but trillions of bytes on disk Element size: -9 m Energy watt: 3~9W (CPU) Processing speed: 9 Hz Serial, Centralized Generally not Fault Tolerant Learns: Some Intelligent/Conscious: Generally No 7

Biological Inspiration Idea : To make the computer more robust, intelligent, and learn, Let s model our computer software (and/or hardware) after the brain My brain: It's my second favorite organ. - Woody Allen, from the movie Sleeper 8

Neurons in the Brain Although heterogeneous, at a low level the brain is composed of neurons A neuron receives input from other neurons (generally thousands) from its synapses Inputs are approximately summed When the input exceeds a threshold the neuron sends an electrical spike that travels from the body, down the axon, to the next neuron(s) 9

Biological Neuron 3 major functional units Dendrites Cell body Axon Synapse x x xn Amount of signal passing through a neuron depends on: Intensity of signal from feeding neurons Their synaptic strengths Threshold of the receiving neuron Hebb rule (plays key part in learning) A synapse which repeatedly triggers the activation of a postsynaptic neuron will grow in strength, others will gradually weaken Learn by adjusting magnitudes of synapses strengths w g(ξ) w w n y ξ

Learning in the Brain Brains learn Altering strength between neurons Creating/deleting connections Hebb s Postulate (Hebbian Learning) When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased Long Term Potentiation (LTP) Cellular basis for learning and memory LTP is the long-lasting strengthening of the connection between two nerve cells in response to stimulation Discovered in many regions of the cortex

Artificial Neurons (basic computational entities of an ANN) Analogy between artificial and biological concepts (connection weights represent synapses) y In 958 Rosenblatt introduced mechanics (perceptron) Input to output: y=g( i w i x i ) Only when sum exceeds the threshold limit will neuron fire g( ) w.x Weights can enhance or inhibit Collective behaviour of neurons is what s interesting for intelligent data processing w w w 3 x x x 3

3 Model of a Neuron

Activation Function f(a) + A - Step Function f(a) + A - Sigmoid Function 4

Perceptrons Can be trained on a set of examples using a special learning rule (process) Weights are changed in proportion to the difference (error) between target output and perceptron output for each example Minimize summed square error function: E = / p i (o i (p) -t i (p) ) with respect to the weights Error is function of all the weights and forms an irregular multidimensional complex hyperplane with many peaks, saddle points and minima Error minimized by finding set of weights that correspond to global minimum Done with gradient descent method weights incrementally updated in proportion to δe/δw ij Updating reads: w ij (t + ) = w ij (t) w ij Aim is to produce a true mapping for all patterns w ij x j o i g(ξ) threshold ξ 5

6 Perceptron Structure

Learning for Perceptron. Initialize w ij with random values. Repeat until w ij (t + ) w ij (t): Pick pattern p from training set Feed input to network and calculate the output Update the weights according to w ij (t + ) = w ij (t) - w ij where w ij = -η δe/δw ij. When no change (within some accuracy) occurs, the weights are frozen and network is ready to use on data it has never seen 7

Example AND OR x x t x x t Perceptron learns these rules easily (ie, sets appropriate weights and threshold) w=(w,w,w ) = (-.5,.,.) and (-.5,.,.) where w corresponds to the threshold term 8

Problem & Solution Perceptrons can only perform accurately with linearly separable classes linear hyperplane can place one class of objects on one side of plane and other class on other x ANN research put on hold for yrs x Solution: additional (hidden) layers of neurons, MLP architecture Able to solve non-linear classification problems x x 9

Multilayer Perceptrons (MLPs) Learning procedure is extension of simple perceptron algorithm o i Response function: o i =g( i w ij g( k w jk x k )) which is non-linear so network able to perform non-linear mappings h j w ij Theory tells us that a neural network with at least hidden layer can represent any function w jk Vast number of ANN types exist x k

MLP Structure

Geometric Interpretation of Perceptron Learning

Backpropagation ANNs Most widely used type of network Feedforward Supervised (learns mapping from one data space to another using examples) Error propagated backwards Versatile. Used for data modelling, classification, forecasting, data and image compression and pattern recognition. 3

BP Learning Algorithm Like Perceptron, uses gradient descent to minimize error (generalized to case with hidden layers) Each iteration constitutes two sweeps To minimize Error we need δe/δw ij but also need δe/δw jk (which we get using the chain rule) Training of MLP using BP can be thought of as a walk in weight space along an energy surface, trying to find global minimum and avoiding local minima Unlike for Perceptron, there is no guarantee that global minimum will be reached, but most cases energy landscape is smooth 4

5 Backpropagation Net Structure

BP Learning Algorithm. Initialize w ij and w jk with random values. Repeat until w ij and w jk have converged or the desired performance level is reached: Pick pattern p from training set Present input and calculate the output Update weights according to: w ij (t + ) = w ij (t) w ij w jk (t + ) = w jk (t) w jk where w = -η δe/δw. ( etc for extra hidden layers) 6

Training Generalization: network s performance on a set of test patterns it has never seen before (lower than on training set) Training set used to let ANN capture features in data or mapping Error (eg SSE) Testing Initial large drop in error is due to learning, but subsequent slow reduction is due to:. Network memorization (too many training cycles used). Overfitting (too many hidden nodes) Training Optimum network (network learns individual training examples and loses generalization ability) No. of hidden nodes or training cycles 7

Other Popular ANNs Some problems can be solved using variety of ANN types, some only via specific. (problem logistics) Hopfield networks: optimization Presented with incomplete/noisy pattern, network responds by retrieving an internally stored pattern it most closely resembles Kohonen networks: (self-organizing) Trained in an unsupervised manner to form clusters in the data. Used for pattern classification and data compression 8

Summary of ANN Learning Artificial Neural Networks Feedforward Recurrent Unsupervised Supervised Unsupervised Supervised Kohonen, Hebbian MLP, RBF ART Elman, Jordan, Hopfield 9

홉필드망 : 구조와작동식 제약조건 w ij = w ji w ii = i I i, O i {,} 작동식 NET O j j wijoi + = I j + = O j if NET j if NET j if NET j > T = T j j < T j 구조 3

홉필드망 : 특성과목적 특성 피드백이있는 recurrent네트워크 동적네트워크 목적 입력에가장가까운패턴출력 응용분야 연상기억 (Associative memory) 최적화 (Optimization) 3

3 예 () 문제 : 두패턴벡터를저장 학습 : 연결강도를에의해구함 t t x x ),, (,, ),,, ( = = t t x x x x w + = = w

33 예 () 연상실험. 학습데이터의복구능력 = = 4 x w

34 예 (3) 불완전한데이터의복구능력 = = 4 4 x w limiting hard + - f h (x) x

실제예와문제점 연상시킬패턴들의유사도가적어야함 네트워크의용량 : 노드수의약 5% 예 : 개패턴의경우 7개이상의노드가필요 5개이상의연결필요 35

Boltzman Machine 시뮬레이티드어닐링 At temperature T, output value is determined Stochastically by Boltzman distribution With carefully designed Annealing schedule 볼쯔만분포 P( E i ) = α e E β/ T i 특성 시뮬레이티드어닐링등에의해통계학적으로작동하는신경망 전역최적화가가능한네트워크 36

37 에너지곡선

Self-Organizing Map Self-organizing map (SOM) Unsupervised learning Preserves the topology of data Widely used in data visualization or topology-preserving mapping Selection of winner: Weight update x mc = min{ x mi } mi( t + ) = mi( t) + α( t) nci( t) { x( t) mi( t)} i 38

39 SOM Structure

SASOM. Start with a basic SOM (4X4 map). Train the current network with the Kohonen s algorithm 3. Calibrate the network using known I/O patterns to determine Node should be replaced with a submap of several nodes (X map) Node should be deleted 4. Unless every node represents a unique class, go to step 4

Learning Procedure Input data Initialize map as 4X4 Train with Kohonen s algorithm Structure adaptation Find nodes whose hit_ratio is less than 95.% Split the nodes to X submap Train the split nodes with LVQ algorithm Remove nodes not participated in learning Stop condition satisfied? Yes Map generated No 4

Kohonen s Learning Initialization 4X4 rectangular map using Kohonen s learning algorithm Learning Winner node x m c = min{ i x mi } Kohonen s learning algorithm mi( t + ) = mi( t) + α( t) nci( t) { x( t) mi( t)} n ci (t) α(t) Neighborhood function Learning rate 4

Dynamic Node-Splitting Determining whether a node is to be split or not Hit ratio hit _ ratio i = max j P( c j n i ) where i =,, L, M and j =,, L, N Nodes less than 95.% of hit ratio are split,,,, 43

Initial Weight of Split Nodes C P N c S : Child node : Parent node C = ( P ) S + : Weights of neighbors : Total number of nodes that participate in weight initialization N c P P C C P P 4 P P P C C 3 P 3 P 3 44

LVQ Learning for Modified Map mi( t + ) = mi( t) + α( t) nci( t) hci( t) { x( t) mi( t)} { h ( t) = ci ) h ci ( t) = ), x (t and m i (t) belong to the same class, x (t and (t) belong to different classes m i Neighborhood function is used to preserve the topological order 45

Homework #. Information Geometry 에근거한 MLP 학습원리설명및학습성능향상을위한방법론을조사하시오.. MLP 를실제문제해결에사용하기위한 Tips 를네트워크의구조, 학습알고리즘, 학습데이터전처리로나누어조사하시오. 46