Experimental Data and Training

Size: px

Start display at page:

Download "Experimental Data and Training"

Rebecca Lee
5 years ago
Views:

1 Modeling and Control of Dynamic Systems Experimental Data and Training Mihkel Pajusalu Alo Peets Tartu,

2 Overview Experimental data Designing input signal Preparing data for modeling Training Criterion Training methods Issues with training 2

3 Experimental data The primary purpose of an experiment is to produce a set of examples of how the dynamic system to be identified responds to various controls. Sufficiency of linear model Superposition check Homogeneity Frequency response check output signal must trace input signal over frequency and amplitude 3

4 Experiment design Sampling frequency compromise between identification and controller design Curse of dimensionality all combinations of frequencies and amplitudes Designing the input signal N-samples constant Level change at random instances Chirp signal 4

5 N-samples constant 5

6 Level change at random instances 6

7 Chirp signal 7

Experiment in closed loop When a system is unstable or poorly damped, in order to keep the system inside the range in which it is intended to operate, it

8 Experiment in closed loop When a system is unstable or poorly damped, in order to keep the system inside the range in which it is intended to operate, it is necessary to use 1. a stabilizing controller 2. or, alternatively, a manually tuned PID controller 3. or a human operator for controlling the system 8

9 Preparing data Filtering can be used to remove from the measured signals noise, periodic disturbances, off-sets, and the effects of uninteresting dynamics. Removing redundancy and outliers from the data set Scaling is highly recommended to remove the mean and scale all signals to the same variance, 9

10 Training Definition Mapping from the data set to the set of candidate models = selecting the best model among the candidates Criterion Mean square error Searching for minimum Prediction Error Method 10

11 Taylor expansion Second order Taylor expansion gradient the Hessian 11

12 Searching for minimum Minimum conditions Gradient equals zero Hessian matrix is positive definite for all non-zero vectors v Search Update rule Converges at a local minimum 12

13 Training methods First order Deploy gradient information about the criterion only Gradient method Second order ( Hessian Also use second order derivative (the Newton method, Levenberg-Marquardt method Recursive On-line training, use only the latest input-output pair Recursive versions of gradient and Gauss-Newton 13

14 First-order methods: Gradient a.k.a steepest descent method Search direction opposite to gradient Back-propagation algorithm Step size is very important Slow convergence 14

15 Second order: Newton method Uses also the Hessian Using criterion Derivation 15

16 Second order: Newton method Update rule Search direction Must be complemented with line search Damped Newton method Better convergence when gradient method as first stage Possible ill-conditioned Hessian Numerical problems for calculating search direction Might not converge if criterio n 16

17 Quasi-Newton methods Newton method has quadratic convergence Computationally expensive Computationally cheaper Hessian can be approximated -> quasi-newton methods BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm Approximation of the Hessian form previous iterates and gradients 17

18 Gauss-Newton method Convergence of quasi-newton is poor Gauss-Newton method approximates prediction error Resulting Hessian is different Less expensive to compute (firs t ( definition -order derivates only, positive semidefinite by Damped Gauss-Newton method if with line search 18

19 Pseudo-Newton method Neglects off-diagonal elements in the Gauss- Newton Hessian Search direction cheaper to calculate Consumes less memory Overcomes ill-conditioning of Hessian Convergence slower 19

20 Levenberg-Marquadt method Search direction of Gauss-Newton method is not optimal - valid only in immediate neighborhood of iterate Minimum is searched only in a given radius Can be scaled 20

21 Levenberg-Marquadt method The most obvious choice for neural network training Fast and robust convergence 21

22 Recursive algorithms New data is added during training Sometimes on-line training is needed Adaptive control -> time-varying systems Calculation of batch methods requires too much time and old information becomes obsolete Useful also in off-line training Implementation simpler, less memory needed, redundancy is used more effectively for convergence A new criterion is needed Schemes for discarding past information 22

23 Recursive Gauss-Newton method P is a covariance matrix, initially P(0)=cI, where c is a large number For an ARX network degenerates to Recursive Least Squares (RLS) algorithm 23

24 Exponential forgetting Exponential forgetting factor can be used for discarding past information Must be high enough to avoid covariance blow-up 24

25 Exponential Forgetting and Resetting Algorithm 25

26 Recursive gradient method Recursive gradient method is obtained by setting A.k.a incremental or on-line back-propagation 26

27 Generalization Training data vs reality Errors -> overfitting Generalization error Cannot be exactly found Test and validation sets Final Prediction Error estimate: Generalization error contributions The bias error -> insufficient model structure The variance error -> specific data set 27

28 Bias vs variance dilemma Neural network never describes a system completely Bias can be reduced by expanding network architecture Causes variance of weights to increase 28

29 Bias vs variance 29

30 Regularization Possible way to deal with bias/variance dilemma Criterion can be augmented with regularization (or complexity) term Simple weight decay term D is a diagonal matrix Denotes weight decay 30

31 Effect of regularization 31

32 Effect of regularization 32

33 Effect of weight decay constant 33

34 Effects of regularization Improves generalization Local minima are gradually eliminated as decay constant increases Increases smoothness of criterion Minimum is reached faster Selection of weight decay is important Trial-and-error method is too demanding for large networks 34

35 Implementation issues Computing gradients for different model structures How to decide when to terminate the training How to handle systems with multiple inputs and outputs Multiple inputs can be used with practically no modifications Multiple outputs can be solved with multiple models or better methods 35

36 Computing gradients Except for the full Newton method (which requires second-order derivative information) only the derivative of the prediction with respect to the weights is required Derivative can be calculated using the structure of the network 36

37 Examples of derivatives 37

38 Back propagation A.k.a generalized delta rule Formally way of calculating gradients Popular method 38

Back propagation 1.Present a training sample to the neural network. 2.Compare the network's output to the desired output from that sample. Calculate the error in each output neuron. 3.

39 Back propagation 1.Present a training sample to the neural network. 2.Compare the network's output to the desired output from that sample. Calculate the error in each output neuron. 3.For each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error. 4.Adjust the weights of each neuron to lower the local error. 5.Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights. 6.Repeat from step 3 on the neurons at the previous level, using each one's "blame" as its error.. 39

40 Stopping criteria Maximum number of iterations Upper bound for gradient Upper bound for weight change If biggest weight change is below a certain value Upper bound for criterion Rarely known beforehand Lower bound for trust region Maximum value of λ as a stopping criterion Early stopping Additional data set to find when gen. error is smallest 40

41 Multiple outputs Simplest: separate model for each output Better strategy: model as whole Criterion changes Γ is covariance matrix For Levenberg-Marquardt 41

42 Multiple outputs Recursive Gauss-Newton algorithm 42

43 Iterated Generalized Least Squares IGLS Covariance matrix is unknown in practice 43

44 The End 44

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5