Practical Tips for using Backpropagation

Similar documents
Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

2. Neural network basics

Deep Learning. Architecture Design for. Sargur N. Srihari

Supervised Learning in Neural Networks (Part 2)

Hidden Units. Sargur N. Srihari

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Notes on Multilayer, Feedforward Neural Networks

Neural Network Neurons

INTRODUCTION TO DEEP LEARNING

CS489/698: Intro to ML

Linear Separability. Linear Separability. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons

Data Mining. Neural Networks

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

11/14/2010 Intelligent Systems and Soft Computing 1

Week 3: Perceptron and Multi-layer Perceptron

Subgoal Chaining and the Local Minimum Problem

Ensemble methods in machine learning. Example. Neural networks. Neural networks

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Multi-Layered Perceptrons (MLPs)

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Networks (pp )

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Neural Networks CMSC475/675

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Deep Learning Cook Book

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Neural Nets. General Model Building

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Convolutional Neural Networks

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

CHAPTER VI BACK PROPAGATION ALGORITHM

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Time Series prediction with Feed-Forward Neural Networks -A Beginners Guide and Tutorial for Neuroph. Laura E. Carter-Greaves

DEEP LEARNING IN PYTHON. The need for optimization

Deep Learning with Tensorflow AlexNet

Keras: Handwritten Digit Recognition using MNIST Dataset

Backpropagation + Deep Learning

A Quick Guide on Training a neural network using Keras.

Detecting Spam with Artificial Neural Networks

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Deep neural networks II

Back propagation Algorithm:

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

Machine Learning 13. week

CMPT 882 Week 3 Summary

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Fast Learning for Big Data Using Dynamic Function

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

EE 511 Neural Networks

Introduction to Deep Learning

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Character Recognition Using Convolutional Neural Networks

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

In this assignment, we investigated the use of neural networks for supervised classification

An Introduction to NNs using Keras

Introduction to Neural Networks

Machine Learning. MGS Lecture 3: Deep Learning

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Thomas Nabelek September 22, ECE 7870 Project 1 Backpropagation

arxiv: v1 [cs.lg] 25 Jan 2018

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

For Monday. Read chapter 18, sections Homework:

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Keras: Handwritten Digit Recognition using MNIST Dataset

CS6220: DATA MINING TECHNIQUES

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

An Empirical Study of Software Metrics in Artificial Neural Networks

Opening the Black Box Data Driven Visualizaion of Neural N

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Optimization Methods for Machine Learning (OMML)

Study of Residual Networks for Image Recognition

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

3 Nonlinear Regression

Administrative. Assignment 1 due Wednesday April 18, 11:59pm

Global Optimality in Neural Network Training

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

CHAPTER 7 MASS LOSS PREDICTION USING ARTIFICIAL NEURAL NETWORK (ANN)

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Implementing an Interface to Tensorflow

The Fly & Anti-Fly Missile

Adaptive Regularization. in Neural Network Filters

Transcription:

Practical Tips for using Backpropagation Keith L. Downing August 31, 2017 1 Introduction In practice, backpropagation is as much an art as a science. The user typically needs to try many combinations of parameter settings before finding that which best suits the current task. Few concrete rules exist to cover a wide range of application domains. There are many factors to consider, and we touch on a few of them below. A host of additional tips can be found in the neural network literature, particularly in the Deep Learning book (Goodfellow, Bengio and Courville, 2016). This document is far from a complete list of but merely a collection of issues that I have encountered in my own experiences with supervised learning using neural networks. 2 Hidden Layers and Nodes Although there are some theoretical results, there is no hard and fast rule as to how many hidden layers and nodes in each layer are required for a given problem. There are a few considerations. 1. If all of your training and test cases are, in fact, linearly separable, then you can get by without a hidden layer. Unfortunately, most real-life data sets are not linearly separable. 2. If the data are not linearly separable, then, in theory, a single hidden layer is sufficient for learning the data. However, the exact number of nodes in that layer cannot be accurately calculated ahead of time. Also, the practially-best solution may involve multiple hidden layers. 3. The more layers and nodes in an ANN, the more connections, and thus the more weights that need to be learned. Adding more connections than necessary can easily increase training time well above that needed to solve the problem. 4. Excess connections increase the risk of overfitting the ANN weight vector to the training set. Thus, the ANN may perform perfectly on the training set but poorly on the test cases. This indicates that the ANN has memorized the training cases but cannot generalize from them to handle new cases. This is highly undesirable. To prevent this (in cases where a perceived demand for many connections exists) it helps to include a validation set and to halt training when the error rate for validation testing begins to rise. 1

3 Activation Functions In working with shallow nets, it is often wise to choose between a sigmoid (a.k.a. the logistic function), a hyperbolic tangent (tanh) or a simple linear (i.e. identity) function. This choice often involves a simple analysis of the situation. If all nodes should be outputting positive values, then the sigmoid is preferable to tanh. If outputs should be bounded above and below, then the sigmoid (range (0,1)) or tanh (range (-1,1)) are preferable to a linear function, unless explicit upper and lower bounds will be imposed on the linear function. Although the logistic function (a.k.a. sigmoid) has been a common activation function for many years partly due to its biological relevance but mostly due to the fact that it is continuous and differentiable yet quite similar to a fixed-thresholded step function it has serious shortcomings when used in deep networks. Essentially, the gradients (i.e. derivatives of error with respect to weights) deteriorate rather quickly when derived from long chains of sigmoids. One of the key breakthroughs in neural network research was the discovery that the Rectified Linear Unit (RELU) preserves significant gradients across these long chains, thus giving more informed weight modifications throughout deep networks. Hence, the RELU and its relatives the Exponential Linear Unit (ELU) and the Leaky RELU are quite common fixtures in deep networks. 4 Ranges for Desired Values When converting desired outputs into target activations for the output nodes of an ANN, it is wise to use a restricted range of values. This pertains to ANNs that use any kind of sigmoidal activation function. For example, if the sigmoid s output range is [0, 1], then it pays to use a subrange such as [0.1, 0.9] for the desired output values. The problem lies in the nature of the sigmoid function, which has almost the same minimum or maximum value for large sections of the domain, i.e., the area far from the threshold point. In trying to attain this asymptote of the sigmoid, the backpropagation algorithm will often increase n s incoming weights indefinitely, and these weight increases can propagate backwards. Once weights become very large, backpropagation has a hard time reducing them quickly, or at all. The whole system can easily spiral out of control, with most weights approaching infinity and most sigmoids becoming saturated, i.e. having weighted sum inputs that push them to the extreme ends of their range. By using target values well below the asymptotic limit of the sigmoid, one insures that targets can be achieved exactly and weights will remain more stable. 5 Ranges for activation levels For any application, it is important to decide whether low values of a particular factor, such as an input parameter, should have either a) a low effect upon downstream nodes, or b) the opposite effect of a high value. In the former case, your nodes should output in the range [0,1], while the latter case probably calls for a [-1, 1] range. 2

6 Ranges for Initial Weights Although the initial settings of weights may have little effect in many architectures, it can have significant effects in others. As Glorot and Bengio (2010) describe in Understanding the difficulty of training deep feedforward neural networks, major performance increases can result by initializing all weights to random values within the range [- M, M], where M is the number of neurons in the upstream layer, which is normally the number of inputs to a neuron in the downstream layer. 7 Information Content of Nodes One should exercise caution when using tricky encodings of information on the input and output ends of an ANN, even though these can yield more compact networks (i.e., those with fewer nodes and connections). For example, if the color of an automobile is an input parameter to an ANN for assessing insurance risk, then a compact encoding might involve a single input neuron, n, whose activation value would indicate everything from dark colors (low activation levels) to bright colors (high activation levels). Keep in mind that the goal of the ANN is to learn a proper, fixed set of weights emanating from n that will reduce total error and thus give good predictions of insurance risk based on color (and other input variables). Whatever these weights may be, they represent the total effect of color upon all downstream nodes. And the activation level of n will further modulate that effect, but changes to that activation are severely constrained as to possible changes in effect. In essence, they are constrained to linear behavior. If the activation level goes up, then the absolute value of the effect will also increase. So only if there does indeed exist a linear relationship between the color and the net effect (on insurance risk), can this encoding work effectively. However, it could easily be the case that both very bright colors (such as hot red) and dark colors (black) are typical colors of cars owned by drivers with a street racer mentality. In fact, there is a strong correlation in the United States between these hot colors and speeding tickets. I am only speculating about the black colored cars for the purpose of this example. Conversely, colors in the middle of the spectrum, such as pale blue, are more typical family car colors. The problem is that if both black and red have effects that trigger a high insurance risk, then by tuning weights to realize that influence, one cannot avoid producing weights that will also give blue a similar effect. In this example, the weights would have to be very high to allow both a low color input (black) to achieve a similar influence to that of a high-color input (red). In this case, it is wiser to use k input nodes, where each represents a single color. Then, backpropagation is free to tune the individual effects of each color (by modifying its outgoing weights) without being hampered by any artificial connections between the colors (as imposed by the more compact representation). Conversely, if the input factor truly does have a linear effect upon the output factors, for example, if weight is an input condition and risk of heart-attack is an output class, then compressing the range of inputs into a single input node makes more sense. The same can be said about output nodes and the information encoded in them. In this case, the input arcs to an output node represent the net effect of the ANN s information upon the outputs. Once again, a linear 3

relationship holds such that the sum of the weighted inputs has a non-decreasing effect upon the output. 1. Once those weights are fixed by learning, they enforce a linear relationship between any given node, n 1 in the previous layer and a particular output node, n 2. Thus, the target concepts to which activation levels of n 2 correspond should have a linear relationship to the information represented by n 1. If not, then the ANN will not be able to determine an optimal mapping between inputs and outputs. The above advice is heuristic in nature, not absolute. A tricky encoding here or there may not destroy ANN performance, but it is important to be aware of the gains (compact ANNs) and losses (inability to properly differentiate certain cases) associated with witty encoding schemes. 8 Momentum w(t-1) w(t) Figure 1: Illustration of the effect of momentum in breaking through local minima. Given the current location of the system on the error landscape (upper blue dot), the standard weight-updating algorithm would move it toward the lowest point of the local minimum, as shown by the upper dotted line. However, the previous move (line labeled w(t 1)) provides momentum (long dotted line). The combination of the two (dotted line) vectors gives the resultant weight change, which breaks through the hump that encloses the local minimum. Backpropagation can easily get stuck at a local minimum, such as that pictured in Figure 8. There, all gradients point upward, toward higher error, so all progress stagnates. To combat the problem, a momentum term is added to the weight-update function such that the current weight change is a combination of the standard change and that change made during the previous learning pass of backpropagation, known as the momentum term. Equation 1 expresses this update formula, where α is the momentum factor. In many implementations, α decreases with the training epoch such that momentum has a large effect in the early epochs but a minor role later. w ij (t) = η E i w ij + α w ij (t 1) (1) 1 Here, we ignore less conventional, non-monotonic, activation schemes such as radial-basis functions 4

As shown in Figure 8, this momentum can help the search process blast through the walls that surround a local minimum to continue its descent on the error landscape. 9 Optimizing Trial and Error The lack of uncertainty as to the proper neural network configuration for a particular task often necessitates a trial-and-error process wherein a wide variety of nets must be built and tested. Tools that can speed up this process are extremely useful. Some of these tools, such as Tensorflow, Theano, Cafe and Keras, are off-the-shelf and appropriate for many needs, but in other cases, you may need to develop some of your own aids (often on top of the aforementioned systems). In general, it is well worth the effort to build and/or tailor a general system for neural network experimentation. The savings are enormous, even in the short term, since a single neural network project can easily entail the building and running of hundreds of network prototypes before you find the right one. 5