Knowledge Discovery and Data Mining

Size: px

Start display at page:

Download "Knowledge Discovery and Data Mining"

Corey Norris
5 years ago
Views:

1 Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews twk@st-andrews.ac.uk Tom Kelsey ID NN 04 March / 30

2 Neural Nets Ought-to-knows: 1 How a general NN can be displayed graphically 2 The NN terminology exemplified by such a diagram 3 How a relatively simple single hidden-layer, two input NN can produce a complex non-linear prediction surface 4 The form of given activation functions (both the equation and sketch) 5 How NN weights and biases are derived Tom Kelsey ID NN 04 March / 30

3 A simple NN as a Mathematical Formula where ( ) ˆp ln (1 ˆp) = ˆβ 0 + ˆβ 1 z 1 + ˆβ 2 z 2 + ˆβ 3 z 3 z 1 = tanh(ˆα 4 + ˆα 5 x 1 + ˆα 6 x 2 ) z 2 = tanh(ˆα 7 + ˆα 8 x 1 + ˆα 9 x 2 ) z 3 = tanh(ˆα 10 + ˆα 11 x 1 + ˆα 12 x 2 ) Tom Kelsey ID NN 04 March / 30

4 What did all that mean? The output is an optimal probability ˆp = e θ θ is a linear weighted sum of z i terms, with optimal weights ˆβ i There is an additional optimal weight ˆβ 0 that is an intercept or bias term The z i are formed by 1 weighting inputs x i with optimal ˆα k 2 adding another ˆα bias term 3 taking the hyperbolic tangent of the sum Tom Kelsey ID NN 04 March / 30

5 Conversion to a diagrammatic form For ease of understanding for non-mathematicians 1 We have two input x 1 and x 2 sources 2 We have an input bias source 3 There is an internal layer of three z nodes, each taking in weighted inputs and outputting tanh of the summed inputs 4 There is an internal bias source for ˆβ 0 5 There is an output layer with one node, producing the logistic function of the weighted sum of the internal layer outputs 6 The number output is a probability between 0 and 1 Tom Kelsey ID NN 04 March / 30

6 Examples Source: Google Images Tom Kelsey ID NN 04 March / 30

7 Examples Source: Google Images Tom Kelsey ID NN 04 March / 30

8 Examples Tom Kelsey ID NN 04 March / 30

9 Examples Source: Google Images Tom Kelsey ID NN 04 March / 30

10 Examples Source: Google Images Tom Kelsey ID NN 04 March / 30

11 Examples Source: Google Images Tom Kelsey ID NN 04 March / 30

12 Examples Source: Google Images Tom Kelsey ID NN 04 March / 30

13 Examples Source: Google Images Tom Kelsey ID NN 04 March / 30

14 Examples Source: Google Images Tom Kelsey ID NN 04 March / 30

15 Examples Source: Google Images Tom Kelsey ID NN 04 March / 30

16 NN components Weights and biases: from a statistical perspective these weights are simply parameters of a potentially non-linear function, and the biases are the intercept terms for the linear components. Combination Functions : in our example equations above these are the linear combinations expressed in matrix form, they combine the input variables or the hidden nodes. Tom Kelsey ID NN 04 March / 30

17 NN components Activation functions: these are the functions wrapping the combination functions, and several variants are commonly used: Identity Function - does not alter the value of the argument. The resulting range may be R. Sigmoid Functions - S-shaped functions with the logistic or hyperbolic tangent functions being common. The resulting values will be bounded - (0, 1) or ( 1, 1) respectively. The logistic is given by: φ(θ) = e θ for some argument value θ. tanh - hyperbolic tangent gives real values within ( 1, 1) Others: Gaussian functions (bell-shaped); functions bounded below by zero but unbounded above, e.g. Exponential and Reciprocal Functions. Tom Kelsey ID NN 04 March / 30

18 NN components Network Layers: as the hidden layers are contrivances under control of the analyst, the number of layers and units within these can be large. The layering is partly for convenience, where all the nodes/units share similar characteristics such as their activation and combination functions. All the nodes in a layer are connected to all the nodes in the next. We can have feed-forward NN, in which layers are skipped for some combinations of inputs this allows us to use activation functions that suit a subset of the covariates Tom Kelsey ID NN 04 March / 30

19 Softmax Output Function Choice of output function depends on the type of model and the response ranges Softmax is used in the common event that we wish to return probabilities that sum to one a form of classification t k is the net value in a final layer node, K is the number of such nodes y k = f k (T) = et k K l=1 et l Tom Kelsey ID NN 04 March / 30

20 Main components Layers: input, hidden, output. Connections and weights. Combination functions: linear. Activation functions: Identity, tanh, exp, logistic. Output functions: Back to response scale - Identity, (multiple) Logistic. Tom Kelsey ID NN 04 March / 30

21 Overview of our coverage NNs are an art Jargon is very inconsistent. Huge number of decisions that can be made in their construction and the results are sensitive to these. We ll look at the general ideas and very few specific implementations. Tom Kelsey ID NN 04 March / 30

22 Fitting a Neural Net Start with arbitrary weights and biases. Define an error function. Search for update values that reduce the error. Iterate until convergence (hopefully). This is numerical optimisation Non-linear problem with large numbers of parameters. You will not find a general analytic solution for solving the weights. All methods implemented are iterative numerical approaches - trial-and-error searches. Conceptually simple what we want to do, once we define best. Tom Kelsey ID NN 04 March / 30

23 Objective functions The RSS between the target and actual output values on each output unit; the standard for regression problems (can also be used for classification problems) City-block The sum of the differences for each output unit; differences are always taken to be positive. Less sensitive to outlying points than RSS (why?) so may perform better on regression problems if there are a few outliers Cross-entropy (single & multiple) The sum of the products of the target value and the logarithm of the error value on each output unit. Two versions: one for single-output (two-class) networks, the other for multiple-output networks. Used in combination with the logistic (single output) or softmax (multiple output) activation functions in the output layer of the network. Equivalent to maximum likelihood estimation of the network weights. Tom Kelsey ID NN 04 March / 30

24 NNs as basis functions NNs can be viewed as a potentially complex combination of basis functions in X... Fitting can be thought of as a gradient search method e.g. some variant on the Newton method that seeks to minimise the error... Problems: oscillation, local minima, slow convergence, starting values,... Learning rate parameter: controls step sizes in gradient search Momentum parameter: allows over-shooting to mitigate against local minima Tom Kelsey ID NN 04 March / 30

25 Overfitting A NN can be a very rich class of functions with even just a single hidden layer with a few hidden units So we are likely to have a model with sufficient inherent complexity to model complex systems This presents a problem too - the model can easily overfit i.e. learn the training dataset very well, giving a model with poor generality The standard problem that we have encountered throughout our consideration of automated model selections Two approaches are considered here Tom Kelsey ID NN 04 March / 30

26 Validation Maintain an independent dataset which is not used to develop the model, but is used to measure the models performance/generality Seek a model that predicts data we have not yet seen - the use of validation or cross-validation data simulates this scenario Simplest method is to use a single validation dataset, and stop fitting when the performance of the model against the validation dataset begins to deteriorate Tom Kelsey ID NN 04 March / 30

27 Weight decay Similar to the approach in tree-methods we can balance our raw model fit against a measure of model complexity Using R θ as our measure of resubstitution error with a given set of parameters θ: R θ + λj(θ) R and J are effectively in competition, and as we are using a gradient search, you can think of λj as preventing us from reaching our global minimum for R we must estimate λ and the usual approach would be via validation or cross-validation performance This reveals that we have in effect just considered a more explicit phrasing of the validation approach above Tom Kelsey ID NN 04 March / 30

28 NN problems overview Lack of interpretability: these models are effectively black-box Over-fitting: NNs are clearly prone to overfitting if some proper controls are not put in place Specification decisions: there are a bewildering array of activation functions, combination functions, output functions, training methods, parameters (e.g. number of hidden units and layers), standardisations etc. Local minima: as for standard non-linear regression, we may require multiple fits to ensure we have not been trapped in a sub-optimal solution by local minima in the error function Long run-times (as hinted at in SAS EM by the default option of Maximum run-time=4 hours".): These models can take a very long time to fit Tom Kelsey ID NN 04 March / 30

29 Deep Learning A family of complicated datamining models that are currently very competitive For us, multi-layer feed-forward neural nets More layers means more work to train and more chance of overfit Can pre-train each layer, then fine tune with back propagation uses unsupervised restricted Boltzmann machines helps reduce the computational complexity of learning as do modern HPC architectures & technologies Complex validation techniques exist for reducing overfit Same advantages & disadvantages as for textbook perceptron NN Tom Kelsey ID NN 04 March / 30

30 Further information This a very brief overview of NNs (although the multitude of minor details makes detailed views difficult). For further information: Basheer & Hajmeer (2000) paper - quite a nice high-level overview (note terminology is very loose within NN literature) More on NNs in L14 data issues and more detail on fitting (learning) algorithms Tom Kelsey ID NN 04 March / 30

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN