Training of Neural Networks. Q.J. Zhang, Carleton University

Size: px

Start display at page:

Download "Training of Neural Networks. Q.J. Zhang, Carleton University"

Cecilia Morris
5 years ago
Views:

1 Training of Neural Networks

2 Notation: x: input of the original modeling problem or the neural network y: output of the original modeling problem or the neural network w: internal weights/parameters of the neural network m: number of outputs of the model y = f(x, w) : neural network model d: data for y (e.g., trainining data)

3 Define Model Input-Output Define model input-output (x, y), for example, x: physical/geometrical parameters of the component y: S-parameters of the component

4 Data Generation: (a)generate (x,y) samples: ( x k, yk ), k = 1, 2,, P, such that the finished NN best (accurately) represent original x~y problem (b)data generator Measurement : for each given xk, measure values of yk, k=1,2,, p Simulation: for each given x k, use a simulator to calculate, k=1,2,, p y k

5 Comparison of Neural Network Based Microwave Model Development Using Data from Two Types of Data Generators Basis of Comparison Neural Model Development Using Measurement Data Neural Model Development Using Sim ulation D ata A vailability of Problem Theory-Equations Model can be developed even if the theory-equations are not known, or difficult to im plem ent in CAD. Model can be developed only for the problem s that have theory that is implemented in a simulator. Assum ptions Input Parameter Sweep N o assum ptions are involved and the model could include all the effects, e.g., 3D-fullwave effects, fringing effects etc. Data generation could either be expensive or infeasible, if a geometrical param eter, e.g., transistor gate-length needs to be sampled/changed. Often involves assum ptions and the m odel will be lim ited by the assumptions made by the simulator, e.g., 2.5D E M. Relatively easier to sweep any param eter in the sim ulator, because the changes are num erical and not physical/m anual.

6 Comparison of Neural Network Based Microwave Model Development Using Data from Two Types of Data Generators (continued) Basis of Comparison Neural Model Development Using Measurement Data Neural Model Development Using Simulation Data Sources of Small and Large/Gross Errors Equipm ent limitations and tolerances. Accuracy limitations and nonconvergence of simulations. Feasibility of Getting Desired Output Development of models is possible for measurable responses only. For example, drain charge of an FET may not be easy to measure. Any response can be modeled as long as it can be computed by the simulator.

7 Data Generation: (c) Range of x to be sampled For Testing Data and Validation data: xmin x max : Should represent the user-intended range in which the NN is to be used by user For Training Data: Default range of x samples should be equal to userintended range, or if feasible, slightly beyond the user intended range

8 Data Generation - where data should be sampled x3 x1 x2 (d) Distribution of x samples Uniform grid distribution Non-uniform grid distribution Design of Experiments (DOE) methodology central-composite design 2 n factorial design Star distribution Random distribution

9 Data Generation (continued): (e) Number of samples P -- Theoretical factor: For grid distribution case: Shannon s Theorem For random distribution case: statistical confidence

10 Input / Output Scaling The orders of magnitude of various x and d values in microwave applications can be very different from one another. Scaling of training data is desirable for efficient neural network training The data can be scaled such that various x (or d ) have similar order of magnitude

11 Input / Output Scaling: Notation: x and y -- Original x and y x~ and ~ y -- Scaled x and y x, x max min -- Obtained from data x~ max, x ~ min -- Dictated by NN trainer Linear scale Scale formula: De-scaled formula: Log scale Scale formula: De-scale formula: x~ = x~ x x~ = x = min = x x + x x x min max ln( x x e min max min ( x ~ ~ x ~ x + ~ x ~ x x min x~ min + ) min max min ( x x~ max min ) x min )

12 Illustration of Data Scaling y d Scaling De-scaling Training data Scaled Data Neural network Trained Neural network model Scaling Scaling x x x x Data generation Data scaling Neural network training Finished model for user

13 Divide Data into Training Set, Validation Set and Testing Set Notation: P total number of data samples generated D Set for all data, D = {1, 2,, P} T r Training data set V -- Validation data set -- Test data set T e Ideally: Each data set ( Tr, V, T e ) should be an adequate representation of original y = f (x) problem in the entire range. Three sets have no overlap. x min ~ x max

14 Divide Data into Training Set, Validation Set and Testing Set Case 1: When original data is quite sufficient, split D into non-overlapping sets Case 2: When data is very limited, duplicate D, such that Tr = V = Te = D Case 3: Split data D into 2 sets.

15 Training / Validation and Testing Training error: Validation error: Test error: q 1 V k m 1 j q j min j max jk k j V y y d ) w, x ( y m size(v) 1 ) w ( E = = q 1 T k m 1 j q j min j max jk k j e T e e y y d ) w, x ( y m ) size(t 1 ) w ( E = = q 1 Tr k m 1 j q j min j max jk k j Tr y y d ) w, x ( y m size(tr) 1 ) w ( E = =

16 Where to Use Each Error Criteria ET r Training error: The training error E Tr (w) and its derivative w are used to determine how to update w during training Validation error: The validation error E V (w) is used as a stopping criteria during training, i.e., to determine if training is sufficient. Test error: The test error E Te (w) is used after training has finished to provide a final assessment of the quality of the trained neural network. Test error is not involved during training.

17 Flow-chart Showing Neural Network Training, Validation and Testing Update neural network weight parameters using a gradient-based algorithm (e.g., BP, backpropagation) quasi-newton) Compute derivatives of training error w.r.t. neural ANN weights network internal weights Evaluate training error Perform feedforward computation for all samples in training set No Assign random initial values for all the weight parameters Perform feedforward computation for all samples in validation set Evaluate validation error Desired accuracy achieved? Select a neural network structure, e.g., MLP START Evaluate test error as an independent quality measure of for the ANN trained model neural reliability network Perform feedforward computation for all samples in test set Yes STOP Training

18 Initial Value of NN Weights Before Training MLP: small random values RBF/Wavelet: estimate center & width of RBF or translation & dilation of Wavelet Knowledge Based NN: use physical/electrical experience

19 Overlearning Definition (strict): Math ET 0 r EV >> E T r Observation: NN memorized training data, but can not generalize well Possible reasons: a) Too many hidden neurons b) Not enough training data Actions: a) Add training data b) Delete hidden neurons c) Backup/retrieve previous solution

20 Neural Network Over-Learning output (y) Neural network Training data Validation data input (x)

21 Underlearning Definition (strict): Math E Tr >> 0 Observation: NN can not even represent the problem at training points Possible reasons: a) Not enough hidden neurons b) Training stuck at local solution c) Not enough training Actions: a) Add hidden neurons b) More training c) Perturb solution, then train

22 Neural Network Under-Learning Output (y) Neural network Training data Validation data Input (x)

23 Perfect Learning: Definition (strict): Math E Tr E V 0 Observation: generalized well

24 Perfect Learning of Neural Networks output (y) Neural network Training data Validation data input (x)

25 Types of Training Sample-by-sample (or online) training: ANN weights are updated every time a training sample is presented to the network, i.e., weight update is based on training error from that sample Batch-mode (or offline) training: ANN weights are updated after each epoch, i.e., weight update is based on training error from all the samples in training data set where an epoch is defined as a stage of ANN training that involves presentation of all the samples in the training data set to the neural network once for the purpose of learning Supervised training: using (x & y) data for training Un-supervised training: using only x data for training

26 Neural Network Training The error between training data and neural network outputs is fedback to the neural network to guide the internal weight update of the network d Training Error - y Training Data Neural Network w x

27 Training Problem Statement: Given training data (xk,dk ),k Tr Validation data (xk,dk ), k V NN model y(x, w) Find values of w, such that validation error is minimized min ( epoch) E V epoch where y j( x k, w (epoch)) 1 m EV (epoch) = PV m k V j= 1 ymax j y P V = size(v ) w( epoch) = w( epoch 1) + w( epoch 1) w epoch=0 = user s / software initial guess min j d jk q 1 q w( epoch 1) is the update determined by the optimization algorithm (training algorithm) which minimizes the training error

28 Steps of Gradient Based Training Algorithms: Step 1: w= initial guess epoch = 0 Step 2: If E V (epoch) < ε (given accuracy criteria) or epoch > max_epoch (max number of epochs), stop ( w) T Step 3: Calculate ET r (w) and using partial or all training Data w E r Step 4: Use optimization algorithm to find w w+ w w Step 5: If all training data are used, then epoch= epoch+ 1, go to Step 2, else go to Step 3

29 Update w in Gradient-based Methods w = η h where h is the direction of the update of w η is the step size of the update of w Gradient based methods use information of ET r (w) and to determine update direction of w. ( w) ET r w Step size η is determined by: Small fixed constant set by user Adaptive constant during training Line minimization method to find best value of η

30 Line Minimization Problem Statement Let a scalar function of one variable be defined as φ(η) = E Tr (w + η h) Given present value of w and direction h Find η such that φ(η) is minimized. Solution method: (1-dimensional optimization method): Sectioning Golden section method methods Fibonacci method Bisection method Quadratic method Interpolation methods Cubic method

31 Back-Propagation (BP), (Rumelhart, Hinton & Williams, 1986) We use the negative gradient direction: h = - for w = η h The neural network weights are updated during training as: or w E = w η where η is called the learning rate α is called the momentum factor T r ( w w ET ( w ) r w = w η + α w w ) epoch 1 ( w) ET r w

32 Determining η and α for BP : Set η and α as fixed constant η and α can be adaptive η = c / epoch, c is a constant STC (Darkens, 1992) c epoch 1 + ( ) ( η0 τ η = η0 c epoch ( ) ( ) + τ ( η0 τ where are user defined η 0, τ,c ) epoch τ ) 2 Delta-bar-delta (Jacobs, 1988) (a) A η i for each weight w in of NN, w i (b) η is adjusted during training using present and previous E information of T r ( w ) w i w i ET r ( w ) = wi ηi w

33 Conjugate Gradient Method w = η h ET ( w ) r Let E = w Then h(epoch) = E + γ h( epoch 1) where h = E γ = 0 E( epoch ) 2 E( epoch 1 ) 2 (Fletcher/Reeves) ( E( epoch ) E( epoch 1 )) γ = 2 E( epoch 1 ) Speed: generally fast than BP Memory: A few vectors of N W long, where weights/parameters in w T E( epoch ) Line minimization method η is determined by Trust Region method (Polak Ribiere) N W is the total # of NN

34 Quasi-Newton Method Let H be the Hessian matrix of E Tr w.r.t. w B be the inverse of H Weight update: w = η h, where Use information of w and g to approximate B: B B epoch= 0 epoch = = B I epoch 1 (DFP formula) where w = w(epoch) w(epoch-1) g = E( epoch) E( epoch 1) T w w + T w g B h = B E epoch 1 T g g g B T B epoch 1 epoch 1 g Speed: fast Main Memory Needed: 2 N W (Large)

35 Levenberg-Marquardt Method Obtain w from solving linear equations, e.g. T ( J J + µ I ) w = where e = [ e 1, e 2,..., e Ne ] T, J is Jacobian, J T e T e J = ( ) w T e i = δ > 0 Typical Levenberg Marquardt µ = 0 Gauss Newton jk = y j y ( x k max, j, w) d y jk min, j This method is good if problems. e can be very small, e.g., small residue Computation needs LU decomposition 2 Main Memory Needed:, (Large) N W

36 Other Training Methods Huber-Quasi-Newton Similar to Quasi-Newton, except that the error function for training is based on Huber function, and not the conventional least square error function. The Huber formulation allows the training algorithm to robustly handle both small random errors and accidental large error in training data.

37 Other Training Methods (continued) Simplex Method using information (w) only The method starts with several initial guesses of w. These initial points form a simplex in the w space. The method then iteratively updates the simplex using basic moves such as reflection, expansion, and contraction, according to the information of (w) at vertices of the simplex The error (w) generally decreases as the simplex is updated. ET r ET r ET r

38 Other Training Methods (continued) Genetic Algorithm using information E Tr (w) only, searching for global minimum The algorithm starts with several initial points of w called a population of w A fitness value is defined for each w such that w with lower error E Tr (w) has a high fitness w points with high fitness values are more likely selected as parents from whom new points of w called offspring are produced This process continues, and fitness among the population improves

39 Other Training Methods (continued) Simulated Annealing Algorithm using information E Tr (w) only, also searching for global minimum The method uses a temperature parameter to control the training process In early stages of training, temperature value is high and new w points with high error values could be accepted, thus avoiding being trapped in local minimum In late stages of training, temperature value is low and only the w points leading to a reduction in error are accepted

40 Qualitative Comparison of Different Algorithms Convergence Rate fast slow more less Levenberg Marquardt (for small-residue problems) Quasi-Newton Conjugate Gradient BP Memory need and effort in implementation

41 Comparison of Training Algorithms for 3-Conductor Microstrip Line Example (5 input neurons, 28 hidden neurons, 5 output neurons) Training Algorithm No. of Epochs Training Error (%) Avg. Test Error (%) CPU (in Sec) Adaptive Backpropagation Conjugate-gradient Quasi-Newton Levenberg- Marquardt

42 Comparison of Training Algorithms for MESFET Example (4 input neurons, 60 hidden neurons, 8 output neurons) Training Algorithm No of Epochs Training Error (%) Ave. Test Error (%) CPU (in Sec) Adaptive Backpropagation Conjugate Gradient Quasi-Newton Levenberg- Marquardt

Introduction to Neural Networks: Structure and Training

Introduction to Neural Networks: Structure and Training Qi-Jun Zhang Department of Electronics Carleton University, Ottawa, ON, Canada A Quick Illustration Example: Neural Network Model for Delay Estimation