Neural Network Application Design. Supervised Function Approximation. Supervised Function Approximation. Supervised Function Approximation

Supervised Function Approximation There is a tradeoff between a network s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar to fitting a function to a given set of data points. Let us assume that you want to find a fitting function f:r R for a set of three data points. You try to do this with polynomials of degree one (a straight line), two, and nine. Supervised Function Approximation f(x) Obviously, the polynomial of degree 2 provides the most plausible fit. x deg. 2 deg. 1 deg. 9 1 2 Supervised Function Approximation The same principle applies to ANNs: If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function. If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization. Unfortunately, there are no known equations that could tell you the optimal size of your network for a given application; there are only heuristics. Reducing Overfitting with Dropout During each training step, we turn off a randomly chosen subset of 50% of the hidden-layer neurons, i.e., we set their output to zero. During testing, we once again use all neurons but reduce their outputs by 50% to compensate for the increased number of inputs to each unit. By doing this, we prevent each neuron from relying on the output of any particular other neuron in the network. It can be argued that in this way we train an astronomical number of decoupled sub-networks, whose expertise is combined when using all neurons again. Due to the changing composition of sub-networks it is much more difficult to overfit any of them. 3 4 Reducing Overfitting with Dropout Now let us talk about Neural Network Application Design 5 6 1

NN Application Design Now that we got some insight into the theory of artificial neural networks, how can we design networks for particular applications? Designing NNs is basically an engineering task. As we discussed before, for example, there is no formula that would allow you to determine the optimal number of hidden units in a BPN for a given task. NN Application Design We need to address the following issues for a successful application design: Choosing an appropriate data representation Performing an exemplar analysis Training the network and evaluating its performance We are now going to look into each of these topics. 7 8 Data Representation Data Representation Most networks process information in the form of input pattern vectors. These networks produce output pattern vectors that are interpreted by the embedding application. All networks process one of two types of signal components: analog (continuously variable) signals or discrete (quantized) signals. In both cases, signals have a finite amplitude; their amplitude has a minimum and a maximum value. The main question is: How can we appropriately capture these signals and represent them as pattern vectors that we can feed into the network? We should aim for a data representation scheme that maximizes the ability of the network to detect (and respond to) relevant features in the input pattern. Relevant features are those that enable the network to generate the desired output pattern. 9 10 Data Representation Similarly, we also need to define a set of desired outputs that the network can actually produce. Often, a natural representation of the output data turns out to be impossible for the network to produce. We are going to consider internal representation and external interpretation issues as well as specific methods for creating appropriate representations. Internal Representation Issues As we said before, in all network types, the amplitude of input signals and internal signals is limited: analog networks: values usually between 0 and 1 binary networks: only values 0 and 1allowed bipolar networks: only values 1 and 1allowed Without this limitation, patterns with large amplitudes would dominate the network s behavior. A disproportionately large input signal can activate a neuron even if the relevant connection weight is very small. 11 12 2

The patterns that can be represented by an ANN most easily are binary patterns. Even analog networks like to receive and produce binary patterns we can simply round values < 0.5 to 0 and values 0.5 to 1. To create a binary input vector, we can simply list all features that are relevant to the current task. Each component of our binary vector indicates whether one particular feature is present (1) or absent (0). With regard to output patterns, most binary-data applications perform classification of their inputs. The output of such a network indicates to which class of patterns the current input belongs. Usually, each output neuron is associated with one class of patterns. For any input, only one output neuron should be active (1) and the others inactive (0), indicating the class of the current input. 13 14 In other cases, classes are not mutually exclusive, and more than one output neuron can be active at the same time. Another variant would be the use of binary input patterns and analog output patterns for classification. In that case, again, each output neuron corresponds to one particular class, and its activation indicates the probability (between 0 and 1) that the current input belongs to that class. 15 For non-binary (e.g., tertiary) features: Use multiple binary inputs to represent non-binary states (e.g., 001 for red, 010 for green, 100 for blue for representing three possible colors). Treat each feature in the pattern as an individual subpattern. Represent each subpattern with as many positions (units) in the pattern vector as there are possible states for the feature. Then concatenate all subpatterns into one long pattern vector. 16 Another way of representing n-ary data in a neural network is using one neuron per feature, but scaling the (analog) value to indicate the degree to which a feature is present. Good examples: the brightness of a pixel in an input image the output of an edge filter Poor examples: the letter (1 26) of a word the type (1 6) of a chess piece This can be explained as follows: The way NNs work (both biological and artificial ones) is that each neuron represents the presence/absence of a particular feature. Activations 0 and 1 indicate absence or presence of that feature, respectively, and in analog networks, intermediate values indicate the extent to which a feature is present. Consequently, a small change in one input value leads to only a small change in the network s activation pattern. 17 18 3

Therefore, it is appropriate to represent a non-binary feature by a single analog input value only if this value is scaled, i.e., it represents the degree to which a feature is present. This is the case for the brightness of a pixel or the output of an edge detector. It is not the case for letters or chess pieces. For example, assigning values to individual letters (a = 0, b = 0.04, c = 0.08,, z = 1) implies that a and b are in some way more similar to each other than are a and z. Obviously, in most contexts, this is not a reasonable assumption. It is also important to notice that, in artificial (not natural!), completely connected networks the order of features that you specify for your input vectors does not influence the outcome. For the network performance, it is not necessary to represent, for example, similar features in neighboring input units. All units are treated equally; neighborhood of two neurons does not imply to the network that these represent similar features. Of course once you specified a particular order, you cannot change it any more during training or testing. 19 20 Exemplar Analysis Ensuring Coverage When building a neural network application, we must make sure that we choose an appropriate set of exemplars (training data): The entire problem space must be covered. There must be no inconsistencies (contradictions) in the data. We must be able to correct such problems without compromising the effectiveness of the network. For many applications, we do not just want our network to classify any kind of possible input. Instead, we want our network to recognize whether an input belongs to any of the given classes or it is garbage that cannot be classified. To achieve this, we train our network with both classifiable and garbage data (null patterns). For the null patterns, the network is supposed to produce a zero output, or a designated null neuron is activated. 21 22 Ensuring Coverage In many cases, we use a 1:1 ratio for this training, that is, we use as many null patterns as there are actual data samples. We have to make sure that all of these exemplars taken together cover the entire input space. If it is certain that the network will never be presented with garbage data, then we do not need to use null patterns for training. Sometimes there may be conflicting exemplars in our training set. A conflict occurs when two or more identical input patterns are associated with different outputs. Why is this problematic? 23 24 4

Assume a BPN with a training set including the exemplars (a, b) and (a, c). Whenever the exemplar (a, b) is chosen, the network adjust its weights to present an output for a that is closer to b. Whenever (a, c) is chosen, the network changes its weights for an output closer to c, thereby unlearning the adaptation for (a, b). In the end, the network will associate input a with an output that is between b and c, but is neither exactly b or c, so the network error caused by these exemplars will not decrease. For many applications, this is undesirable. 25 To identify such conflicts, we can apply a (binary) search algorithm to our set of exemplars. How can we resolve an identified conflict? Of course, the easiest way is to eliminate the conflicting exemplars from the training set. However, this reduces the amount of training data that is given to the network. Eliminating exemplars is the best way to go if it is found that these exemplars represent invalid data, for example, inaccurate measurements. In general, however, other methods of conflict resolution are preferable. 26 Another method combines the conflicting patterns. For example, if we have exemplars (0011, 0101), (0011, 0010), we can replace them with the following single exemplar: (0011, 0111). The way we compute the output vector of the new exemplar based on the two original output vectors depends on the current task. It should be the value that is most similar (in terms of the external interpretation) to the original two values. Alternatively, we can alter the representation scheme. Let us assume that the conflicting measurements were taken at different times or places. In that case, we can just expand all the input vectors, and the additional values specify the time or place of measurement. For example, the exemplars (0011, 0101), (0011, 0010) could be replaced by the following ones: (100011, 0101), (010011, 0010). 27 28 One advantage of altering the representation scheme is that this method cannot create any new conflicts. Expanding the input vectors cannot make two or more of them identical if they were not identical before. Example I: Predicting the Weather Let us study an interesting neural network application. Its purpose is to predict the local weather based on a set of current weather data: temperature (degrees Celsius) atmospheric pressure (inches of mercury) relative humidity (percentage of saturation) wind speed (kilometers per hour) wind direction (N, NE, E, SE, S, SW, W, or NW) cloud cover (0 = clear 9 = total overcast) weather condition (rain, hail, thunderstorm, ) 29 30 5

Example I: Predicting the Weather We assume that we have access to the same data from several surrounding weather stations. There are eight such stations that surround our position in the following way: 100 km 31 6