Neural Network Neurons

Similar documents
For Monday. Read chapter 18, sections Homework:

Data Mining. Neural Networks

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Notes on Multilayer, Feedforward Neural Networks

Supervised Learning in Neural Networks (Part 2)

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

CS6220: DATA MINING TECHNIQUES

Multilayer Feed-forward networks

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Machine Learning 13. week

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Introduction to Deep Learning

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Week 3: Perceptron and Multi-layer Perceptron

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CMPT 882 Week 3 Summary

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Deep Generative Models Variational Autoencoders

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

INTRODUCTION TO DEEP LEARNING

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

CP365 Artificial Intelligence

Deep Learning. Architecture Design for. Sargur N. Srihari

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

CS 4510/9010 Applied Machine Learning. Deep Learning. Paula Matuszek Fall copyright Paula Matuszek 2016

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CSC 578 Neural Networks and Deep Learning

Perceptron: This is convolution!

Technical University of Munich. Exercise 7: Neural Network Basics

CS229 Final Project: Predicting Expected Response Times

Deep Learning. Volker Tresp Summer 2014

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Multi-Layered Perceptrons (MLPs)

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

Neural Networks for Classification

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

International Journal of Electrical and Computer Engineering 4: Application of Neural Network in User Authentication for Smart Home System

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Global Optimality in Neural Network Training

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Machine Learning Classifiers and Boosting

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation

Optimizing Number of Hidden Nodes for Artificial Neural Network using Competitive Learning Approach

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

A Systematic Overview of Data Mining Algorithms

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

Knowledge Discovery and Data Mining

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Deep neural networks II

Simple Model Selection Cross Validation Regularization Neural Networks

Neural Network Weight Selection Using Genetic Algorithms

Practical Tips for using Backpropagation

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

COMPUTATIONAL INTELLIGENCE

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Hyperparameters and Validation Sets. Sargur N. Srihari

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Hidden Units. Sargur N. Srihari

Deep Learning for Computer Vision

COMPUTATIONAL INTELLIGENCE

2. Neural network basics

Deep Learning for Computer Vision II

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Dept. of Computing Science & Math

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Handwritten Hindi Numerals Recognition System

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Deep Learning. Volker Tresp Summer 2015

Model learning for robot control: a survey

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Deep Learning Cook Book

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Vulnerability of machine learning models to adversarial examples

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

SGD: Stochastic Gradient Descent

Neural Networks: What can a network represent. Deep Learning, Fall 2018

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

Transcription:

Neural Networks

Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result

Activation Functions Given the weighted sum as input, an activation function controls whether a neuron is active or inactive Commonly used activation functions: o Threshold function: outputs 1 when input is positive and 0 otherwise (similar to perceptron) o Sigmoid function: Differentiable a good property for optimization

Basic Multilayer Neural Network Output layer Hidden layer x 1 x 2 x 3 x 4 Input layer Each layer receives its inputs from the previous layer and forwards its outputs to the next feed forward structure Output layer: sigmoid activation function for classification, and linear activation function for regression Referred to as a two-layer network (2 layer of weights)

Representational Power Any Boolean Formula Consider a formula in disjunctive normal form: -0.5 1 1 OR units AND units -0.5 1-1 -1.5 1 1 x 1 x 2 x 3

Training: Backpropagation It is traditional to train neural networks to minimize the squared-error This is appropriate for regression problems, but for classification problems it is more appropriate to optimize log-likelihood But we will look at MSE as our example Gradient descent: must apply the chain rule many times to compute the gradient

Terminology a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 x 1 x 2 x 3 x 4 1, 1, 2, 3, 4 the input vector with the bias term 1, 6, 7, 8 the output of the hidden layer with the bias term represents the weight vector leading to node, represents the weight connecting from the -th node to the -th node 9, 6 is the weight connecting from 6 to 9 We will use to represent the activation function, so

Mean Squared Error We adjust the weights of the neural network to minimize the mean squared error (MSE) on training set. Useful fact: the derivative of the sigmoid activation function is 1

Gradient Descent: Output Unit a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 w 9,6 is a component of W 9, connecting from a 6 to the output node. x 1 x 2 x 3 x 4 1

The Delta Rule a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 w 9,6 x 1 x 2 x 3 x 4 Define then

Derivation: Hidden Units a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 w 6,2 x 1 x 2 x 3 x 4

Delta Rule for Hidden Units a 9 W 9 a 6 a 7 a 8 W 6 W 7 W 8 Define and rewrite as w 6,2 x 1 x 2 x 3 x 4

Networks with Multiple Output Units a 9 a 10 W 9 a 6 a 7 a 8 W 6 W 7 W 8 x 1 x 2 x 3 x 4 We get a separate contribution to the gradient from each output unit. Hence, for input-to-hidden weights, we must sum up the contributions:

Backpropagation Training Initialize all the weights with small random values Repeat For all training examples, do Begin Epoch For each training example do Compute the network output Compute the error Backpropagate this error from layer to layer and adjust weights to decrease this error End Epoch

Backpropagation for a Single Example Forward Pass Given, compute a u and ŷ v for hidden units u and output units v. Compute Errors Compute ŷ for each output unit v Compute Output Deltas ŷ ŷ 1 ŷ Compute Hidden Deltas 1 Σ, Compute Gradient Compute for hidden-to-output weights. Compute for input-to-hidden weights. Backpropagate error from layer to layer Take Gradient Descent Step,,,,,,

Remarks on Training No guarantee of convergence, may oscillate or reach a local minima. However, in practice many large networks can be adequately trained on large amounts of data for realistic problems, e.g., Driving a car Recognizing handwritten zip codes Play world championship level Backgammon Many epochs (thousands) may be needed for adequate training, large data sets may require hours or days of CPU time. Termination criteria can be: Fixed number of epochs Threshold on training set error Increased error on a validation set To avoid local minima problems, can run several trials starting from different initial random weights and: Take the result with the best training or validation performance. Build a committee of networks that vote during testing, possibly weighting vote by training or validation accuracy

Notes on Proper Initialization Start in the linear regions keep all weights near zero, so that all sigmoid units are in their linear regions. This makes the whole net the equivalent of one linear threshold unit a relatively simple function. This will also avoid having very small gradient Break symmetry If we start with all the weights equal, what would happen? Ensure that each hidden unit has different input weights so that the hidden units move in different directions. Set each weight to a random number in the range where the fan-in of weight w v,u is the number of inputs to unit v.

Overtraining Prevention Running too many epochs may overtrain the network and result in overfitting. Keep a validation set and test accuracy after every epoch. Maintain weights for best performing network on the validation set and return it when performance decreases significantly beyond this. To avoid losing training data to validation: Use 10-fold cross-validation to determine the average number of epochs that optimizes validation performance. Train on the full data set using this many epochs to produce the final result.

Over-fitting Prevention Too few hidden units prevent the system from adequately fitting the data and learning the concept. Too many hidden units leads to over-fitting. Similar cross-validation method can be used to decide an appropriate number of hidden units. Another approach to preventing over-fitting is weight decay, in which we multiply all weights by some fraction between 0 and 1 after each epoch. - Encourages smaller weights and less complex hypotheses. - Equivalent to including an additive penalty in the error function proportional to the sum of the squares of the weights of the network.

Input/Output Coding Appropriate coding of inputs/outputs can make learning easier and improve generalization. Best to encode discrete multi-category features using multiple input units and include one binary unit per value Continuous inputs can be handled by a single input unit, but scaling them between 0 and 1 For classification problems, best to have one output unit per class. Continuous output values then represent certainty in various classes. Assign test instances to the class with the highest output. Use target values of 0.9 and 0.1 for binary problems rather than forcing weights to grow large enough to closely approximate 0/1 outputs. Continuous outputs (regression) can also be handled by scaling to the range between 0 and 1

Softmax for multi-class classification For K classes, we have K nodes in the output layer, one for each class Let be the output of the class- node, i.e., where is the output of the hidden layer, and is the weight vector leading into the class-k node We define: ŷ 1 ŷ 2 softmax W 9 W 10 1 a 6 a 7 a 8 W 6 W 7 W 8 1 x 1 x 2 x 3 x 4

Likelihood objective Given an instance, we compute the outputs for : These will be the probabilities computed using the softmax, and they will be a function of the weights of the neural net (thus the subscript ) Log-likelihood function: Similar gradient descent algorithm could be derived

Recent Development A recent trend in ML is deep learning, which learns feature hierarchies from large amounts of unlabeled data The feature hierarchies are expected to capture the inherent structure in the data Can often lead to better classification when used the learned features to train with labeled data Neural networks provide one approach for deep learning

Shallow vs Deep Architectures Traditional shallow architecture Image Video Text Class label Image Video Text Deep architecture Class label Learned feature representation

Deep learning with Auto-encoders Network is trained to output the input The hidden layer serves to extract features from the input that is essential for representation Constraining layer 2 to be sparse Training is carried out with the same back-prop procedure

Deep structure: Stacked Auto-encoders This continues and learns a feature hierarchy The final feature layer can then be used as features for supervised learning Can also do supervised training on the entire network to finetune all weights

That s the basic idea There are many many types of deep learning Different kinds of autoencoder, variations on architectures and training algorithms, etc Various packages: Theano, Caffe, Torch, Convnet Tremendous impact in vision, speech and natural language processing Very fast growing area To learn more about deep learning, take Dr. Fuxin Li s CS519 next term