Neural Network Application Design. Supervised Function Approximation. Supervised Function Approximation. Supervised Function Approximation

Similar documents
Linear Separability. Linear Separability. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Classification with Decision Tree Induction

Challenges motivating deep learning. Sargur N. Srihari

Introduction to Supervised Learning

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Classification Algorithms in Data Mining

Weka ( )

Using Machine Learning to Optimize Storage Systems

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

The k-means Algorithm and Genetic Algorithm

Perceptron: This is convolution!

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

Practice Exam Sample Solutions

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Data Mining and Analytics

Probabilistic Learning Classification using Naïve Bayes

Digital Image Processing. Prof. P.K. Biswas. Department of Electronics & Electrical Communication Engineering

Instantaneously trained neural networks with complex inputs

Robust line segmentation for handwritten documents

ECG782: Multidimensional Digital Signal Processing

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

CS 4510/9010 Applied Machine Learning

Deep Learning. Architecture Design for. Sargur N. Srihari

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Lecture 8: Genetic Algorithms

Climate Precipitation Prediction by Neural Network

Introduction to Machine Learning

Bayes Classifiers and Generative Methods

Opening the Black Box Data Driven Visualizaion of Neural N

CHAPTER 4 SEMANTIC REGION-BASED IMAGE RETRIEVAL (SRBIR)

Chapter-0: Introduction. Chapter 0 INTRODUCTION

Structural and Syntactic Pattern Recognition

Notes on Multilayer, Feedforward Neural Networks

OVERVIEW & RECAP COLE OTT MILESTONE WRITEUP GENERALIZABLE IMAGE ANALOGIES FOCUS

CSC 2515 Introduction to Machine Learning Assignment 2

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Lecture on Modeling Tools for Clustering & Regression

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester.

Data Mining Part 5. Prediction

Types of Edges. Why Edge Detection? Types of Edges. Edge Detection. Gradient. Edge Detection

(Refer Slide Time 00:17) Welcome to the course on Digital Image Processing. (Refer Slide Time 00:22)

Introduction to numerical algorithms

WHAT TYPE OF NEURAL NETWORK IS IDEAL FOR PREDICTIONS OF SOLAR FLARES?

An Information-Theoretic Approach to the Prepruning of Classification Rules

Homework 5 (final) CS:2230 Computer Science II: Data Structures. Deadline: 5 December 2015 (Saturday) at 11 pm.

5 Learning hypothesis classes (16 points)

Knowledge-Defined Networking: Towards Self-Driving Networks

Regularization and model selection

Seismic regionalization based on an artificial neural network

Back propagation Algorithm:

Data Mining. Lecture 03: Nearest Neighbor Learning

Hardware Neuronale Netzwerke - Lernen durch künstliche Evolution (?)

Character Recognition

Lecture 5: Decision Trees (Part II)

Basis Functions. Volker Tresp Summer 2016

Midterm 2 Solutions. CS70 Discrete Mathematics and Probability Theory, Spring 2009

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Logical Rhythm - Class 3. August 27, 2018

Floating Point Considerations

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Basis Functions. Volker Tresp Summer 2017

Practical Image and Video Processing Using MATLAB

Hidden Loop Recovery for Handwriting Recognition

CISC 4631 Data Mining

CHAPTER TWO. Data Representation ( M.MORRIS MANO COMPUTER SYSTEM ARCHITECTURE THIRD EDITION ) IN THIS CHAPTER

Random Search Report An objective look at random search performance for 4 problem sets

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

We show that the composite function h, h(x) = g(f(x)) is a reduction h: A m C.

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Training Digital Circuits with Hamming Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

1 Achieving IND-CPA security

Two Comments on the Principle of Revealed Preference

Machine Learning 13. week

Data Mining. Neural Networks

Applying Supervised Learning

CLASSIFICATION OF BOUNDARY AND REGION SHAPES USING HU-MOMENT INVARIANTS

Figure (5) Kohonen Self-Organized Map

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

7. Decision or classification trees

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Case-Based Reasoning

Machine Learning in Biology

Applications of Finite State Machines

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Machine Learning : Clustering, Self-Organizing Maps

Network Traffic Measurements and Analysis

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Journal of Asian Scientific Research FEATURES COMPOSITION FOR PROFICIENT AND REAL TIME RETRIEVAL IN CBIR SYSTEM. Tohid Sedghi

Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute

1) Give decision trees to represent the following Boolean functions:

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Transcription:

Supervised Function Approximation There is a tradeoff between a network s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar to fitting a function to a given set of data points. Let us assume that you want to find a fitting function f:r R for a set of three data points. You try to do this with polynomials of degree one (a straight line), two, and nine. Supervised Function Approximation f(x) Obviously, the polynomial of degree 2 provides the most plausible fit. x deg. 2 deg. 1 deg. 9 1 2 Supervised Function Approximation The same principle applies to ANNs: If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function. If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization. Unfortunately, there are no known equations that could tell you the optimal size of your network for a given application; there are only heuristics. Reducing Overfitting with Dropout During each training step, we turn off a randomly chosen subset of 50% of the hidden-layer neurons, i.e., we set their output to zero. During testing, we once again use all neurons but reduce their outputs by 50% to compensate for the increased number of inputs to each unit. By doing this, we prevent each neuron from relying on the output of any particular other neuron in the network. It can be argued that in this way we train an astronomical number of decoupled sub-networks, whose expertise is combined when using all neurons again. Due to the changing composition of sub-networks it is much more difficult to overfit any of them. 3 4 Reducing Overfitting with Dropout Now let us talk about Neural Network Application Design 5 6 1

NN Application Design Now that we got some insight into the theory of artificial neural networks, how can we design networks for particular applications? Designing NNs is basically an engineering task. As we discussed before, for example, there is no formula that would allow you to determine the optimal number of hidden units in a BPN for a given task. NN Application Design We need to address the following issues for a successful application design: Choosing an appropriate data representation Performing an exemplar analysis Training the network and evaluating its performance We are now going to look into each of these topics. 7 8 Data Representation Data Representation Most networks process information in the form of input pattern vectors. These networks produce output pattern vectors that are interpreted by the embedding application. All networks process one of two types of signal components: analog (continuously variable) signals or discrete (quantized) signals. In both cases, signals have a finite amplitude; their amplitude has a minimum and a maximum value. The main question is: How can we appropriately capture these signals and represent them as pattern vectors that we can feed into the network? We should aim for a data representation scheme that maximizes the ability of the network to detect (and respond to) relevant features in the input pattern. Relevant features are those that enable the network to generate the desired output pattern. 9 10 Data Representation Similarly, we also need to define a set of desired outputs that the network can actually produce. Often, a natural representation of the output data turns out to be impossible for the network to produce. We are going to consider internal representation and external interpretation issues as well as specific methods for creating appropriate representations. Internal Representation Issues As we said before, in all network types, the amplitude of input signals and internal signals is limited: analog networks: values usually between 0 and 1 binary networks: only values 0 and 1allowed bipolar networks: only values 1 and 1allowed Without this limitation, patterns with large amplitudes would dominate the network s behavior. A disproportionately large input signal can activate a neuron even if the relevant connection weight is very small. 11 12 2

The patterns that can be represented by an ANN most easily are binary patterns. Even analog networks like to receive and produce binary patterns we can simply round values < 0.5 to 0 and values 0.5 to 1. To create a binary input vector, we can simply list all features that are relevant to the current task. Each component of our binary vector indicates whether one particular feature is present (1) or absent (0). With regard to output patterns, most binary-data applications perform classification of their inputs. The output of such a network indicates to which class of patterns the current input belongs. Usually, each output neuron is associated with one class of patterns. For any input, only one output neuron should be active (1) and the others inactive (0), indicating the class of the current input. 13 14 In other cases, classes are not mutually exclusive, and more than one output neuron can be active at the same time. Another variant would be the use of binary input patterns and analog output patterns for classification. In that case, again, each output neuron corresponds to one particular class, and its activation indicates the probability (between 0 and 1) that the current input belongs to that class. 15 For non-binary (e.g., tertiary) features: Use multiple binary inputs to represent non-binary states (e.g., 001 for red, 010 for green, 100 for blue for representing three possible colors). Treat each feature in the pattern as an individual subpattern. Represent each subpattern with as many positions (units) in the pattern vector as there are possible states for the feature. Then concatenate all subpatterns into one long pattern vector. 16 Another way of representing n-ary data in a neural network is using one neuron per feature, but scaling the (analog) value to indicate the degree to which a feature is present. Good examples: the brightness of a pixel in an input image the output of an edge filter Poor examples: the letter (1 26) of a word the type (1 6) of a chess piece This can be explained as follows: The way NNs work (both biological and artificial ones) is that each neuron represents the presence/absence of a particular feature. Activations 0 and 1 indicate absence or presence of that feature, respectively, and in analog networks, intermediate values indicate the extent to which a feature is present. Consequently, a small change in one input value leads to only a small change in the network s activation pattern. 17 18 3

Therefore, it is appropriate to represent a non-binary feature by a single analog input value only if this value is scaled, i.e., it represents the degree to which a feature is present. This is the case for the brightness of a pixel or the output of an edge detector. It is not the case for letters or chess pieces. For example, assigning values to individual letters (a = 0, b = 0.04, c = 0.08,, z = 1) implies that a and b are in some way more similar to each other than are a and z. Obviously, in most contexts, this is not a reasonable assumption. It is also important to notice that, in artificial (not natural!), completely connected networks the order of features that you specify for your input vectors does not influence the outcome. For the network performance, it is not necessary to represent, for example, similar features in neighboring input units. All units are treated equally; neighborhood of two neurons does not imply to the network that these represent similar features. Of course once you specified a particular order, you cannot change it any more during training or testing. 19 20 Exemplar Analysis Ensuring Coverage When building a neural network application, we must make sure that we choose an appropriate set of exemplars (training data): The entire problem space must be covered. There must be no inconsistencies (contradictions) in the data. We must be able to correct such problems without compromising the effectiveness of the network. For many applications, we do not just want our network to classify any kind of possible input. Instead, we want our network to recognize whether an input belongs to any of the given classes or it is garbage that cannot be classified. To achieve this, we train our network with both classifiable and garbage data (null patterns). For the null patterns, the network is supposed to produce a zero output, or a designated null neuron is activated. 21 22 Ensuring Coverage In many cases, we use a 1:1 ratio for this training, that is, we use as many null patterns as there are actual data samples. We have to make sure that all of these exemplars taken together cover the entire input space. If it is certain that the network will never be presented with garbage data, then we do not need to use null patterns for training. Sometimes there may be conflicting exemplars in our training set. A conflict occurs when two or more identical input patterns are associated with different outputs. Why is this problematic? 23 24 4

Assume a BPN with a training set including the exemplars (a, b) and (a, c). Whenever the exemplar (a, b) is chosen, the network adjust its weights to present an output for a that is closer to b. Whenever (a, c) is chosen, the network changes its weights for an output closer to c, thereby unlearning the adaptation for (a, b). In the end, the network will associate input a with an output that is between b and c, but is neither exactly b or c, so the network error caused by these exemplars will not decrease. For many applications, this is undesirable. 25 To identify such conflicts, we can apply a (binary) search algorithm to our set of exemplars. How can we resolve an identified conflict? Of course, the easiest way is to eliminate the conflicting exemplars from the training set. However, this reduces the amount of training data that is given to the network. Eliminating exemplars is the best way to go if it is found that these exemplars represent invalid data, for example, inaccurate measurements. In general, however, other methods of conflict resolution are preferable. 26 Another method combines the conflicting patterns. For example, if we have exemplars (0011, 0101), (0011, 0010), we can replace them with the following single exemplar: (0011, 0111). The way we compute the output vector of the new exemplar based on the two original output vectors depends on the current task. It should be the value that is most similar (in terms of the external interpretation) to the original two values. Alternatively, we can alter the representation scheme. Let us assume that the conflicting measurements were taken at different times or places. In that case, we can just expand all the input vectors, and the additional values specify the time or place of measurement. For example, the exemplars (0011, 0101), (0011, 0010) could be replaced by the following ones: (100011, 0101), (010011, 0010). 27 28 One advantage of altering the representation scheme is that this method cannot create any new conflicts. Expanding the input vectors cannot make two or more of them identical if they were not identical before. Example I: Predicting the Weather Let us study an interesting neural network application. Its purpose is to predict the local weather based on a set of current weather data: temperature (degrees Celsius) atmospheric pressure (inches of mercury) relative humidity (percentage of saturation) wind speed (kilometers per hour) wind direction (N, NE, E, SE, S, SW, W, or NW) cloud cover (0 = clear 9 = total overcast) weather condition (rain, hail, thunderstorm, ) 29 30 5

Example I: Predicting the Weather We assume that we have access to the same data from several surrounding weather stations. There are eight such stations that surround our position in the following way: 100 km 31 6