COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Similar documents
Deep Learning for Computer Vision

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Deep Learning with Tensorflow AlexNet

CS489/698: Intro to ML

Deep Learning. Volker Tresp Summer 2014

2. Neural network basics

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Hidden Units. Sargur N. Srihari

Deep Learning. Architecture Design for. Sargur N. Srihari

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Neural Networks and Deep Learning

Machine Learning 13. week

Perceptron: This is convolution!

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Neural Networks Optimization

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Linear Separability. Linear Separability. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Machine Learning. MGS Lecture 3: Deep Learning

For Monday. Read chapter 18, sections Homework:

EE 511 Neural Networks

Practical Tips for using Backpropagation

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Learning internal representations

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Deep Learning for Computer Vision II

Deep Learning Cook Book

Facial Expression Classification with Random Filters Feature Extraction

Keras: Handwritten Digit Recognition using MNIST Dataset

Convolutional Neural Networks

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Backpropagation + Deep Learning

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CSC 578 Neural Networks and Deep Learning

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

5 Learning hypothesis classes (16 points)

IMPROVEMENTS TO THE BACKPROPAGATION ALGORITHM

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Keras: Handwritten Digit Recognition using MNIST Dataset

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

6.034 Quiz 2, Spring 2005

Kernels and Clustering

Machine Learning Classifiers and Boosting

Advanced Introduction to Machine Learning, CMU-10715

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Richard S. Zemel 1 Georey E. Hinton North Torrey Pines Rd. Toronto, ONT M5S 1A4. Abstract

Global Optimality in Neural Network Training

Deep Learning Applications

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

CS 343: Artificial Intelligence

The exam is closed book, closed notes except your one-page cheat sheet.

DM6 Support Vector Machines

Inception Network Overview. David White CS793

Fig 2.1 Typical prediction results by linear regression with HOG

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

All lecture slides will be available at CSC2515_Winter15.html

Logical Rhythm - Class 3. August 27, 2018

A Quick Guide on Training a neural network using Keras.

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Practical 8: Neural networks

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Supervised Learning in Neural Networks (Part 2)

Neural Nets. General Model Building

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

ImageNet Classification with Deep Convolutional Neural Networks

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Convolutional Layer Pooling Layer Fully Connected Layer Regularization

11. Neural Network Regularization

Deepest Neural Networks

Unsupervised Learning

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Support Vector Machines

Linear methods for supervised learning

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

Transcription:

COMP9 8s Geometry of Hidden Units COMP9 Neural Networks and Deep Learning 5. Geometry of Hidden Units Outline Geometry of Hidden Unit Activations Limitations of -layer networks Alternative transfer functions (6.) Textbook, Sections 6.., 6. Dropout COMP9 8s Geometry of Hidden Units Encoder Networks Inputs Outputs COMP9 8s Geometry of Hidden Units N N Encoder HU Space: identity mapping through a bottleneck also called N M N task used to investigate hidden unit representations

COMP9 8s Geometry of Hidden Units COMP9 8s Geometry of Hidden Units 5 8 8 Encoder Hinton Diagrams Exercise: Sharp Left Straight Ahead Sharp Right Draw the hidden unit space for --, --, -- and 5--5 encoders. Output Units Represent the input-to-hidden weights for each input unit by a point, and the hidden-to-output weights for each output unit by a line. Now consider the 8--8 encoder with its -dimensional hidden unit space. What shape would be formed by the 8 points representing the input-to-hidden weights for the 8 input units? What shape would be formed by the planes representing the hidden-to-output weights for each output unit? Hint: think of two platonic solids, which are dual to each other. used to visualize higher dimensions white = positive, black = negative Hidden Units x Sensor Input Retina COMP9 8s Geometry of Hidden Units 6 COMP9 8s Geometry of Hidden Units 7 Learning Face Direction Learning Face Direction

COMP9 8s Geometry of Hidden Units 8 Symmetries COMP9 8s Geometry of Hidden Units 9 Controlled Nonlinearity swap any pair of hidden nodes, overall function will be the same on any hidden node, reverse the sign of all incoming and outgoing weights (assuming symmetric transfer function) hidden nodes with identical input-to-hidden weights in theory would never separate; so, they all have to begin with different (small) random weights in practice, all hidden nodes try to do similar job at first, then gradually specialize. for small weights, each layer implements an approximately linear function, so multiple layers also implement an approximately linear function. for large weights, transfer function approximates a step function, so computation becomes digital and learning becomes very slow. with typical weight values, two-layer neural network implements a function which is close to linear, but takes advantage of a limited degree of nonlinearity. COMP9 8s Geometry of Hidden Units Limitations of Two-Layer Neural Networks COMP9 8s Geometry of Hidden Units Adding Hidden Layers Some functions cannot be learned with a -layer sigmoidal network. 6 6 6 6 For example, this Twin Spirals problem cannot be learned with a -layer network, but it can be learned using a -layer network if we include shortcut connections between non-consecutive layers. Twin Spirals can be learned by -layer network with shortcut connections first hidden layer learns linearly separable features second hidden layer learns convex features output layer combines these to produce concave features training the -layer network is delicate learning rate and initial weight values must be very small otherwise, the network will converge to a local optimum

COMP9 8s Geometry of Hidden Units Vanishing / Exploding Gradients COMP9 8s Geometry of Hidden Units Solutions to Vanishing Gradients Training by backpropagation in networks with many layers is difficult. When the weights are small, the differentials become smaller and smaller as we backpropagate through the layers, and end up having no effect. When the weights are large, the activations in the higher layers will saturate to extreme values. As a result, the gradients at those layers will become very small, and will not be propagated to the earlier layers. When the weights have intermediate values, the differentials will sometimes get multiplied many times is places where the transfer function is steep, causing them to blow up to large values. layerwise unsupervised pre-training long short term memory (LSTM) new activations functions LSTM is specifically for recurrent neural networks. We will discuss unsupervised pre-training and LSTM later in the course. COMP9 8s Geometry of Hidden Units Activation Functions (6.) COMP9 8s Geometry of Hidden Units 5 Activation Functions (6.) - - - - - Sigmoid - - - - Rectified Linear Unit (ReLU) - Sigmoid and hyperbolic tangent traditionally used for -layer networks, but suffer from vanishing gradient problem in deeper networks. Rectified Linear Units (ReLUs) are popular for deep networks, including convolutional networks. Gradients don t vanish. But, their highly linear nature may cause other problems. Scaled Exponential Linear Units (SELUs) are a recent innovation which seems to work well for very deep networks. - - - Hyperbolic Tangent - - - Scaled Exponential Linear Unit (SELU)

COMP9 8s Geometry of Hidden Units 6 Dropout (7.) COMP9 8s Geometry of Hidden Units 7 Dropout (7.) When training is finished and the network is deployed, all nodes are used, but their activations are multiplied by the same probability that was used in the dropout. Thus, the activation received by each unit is the average value of what it would have received during training. Dropout forces the network to achieve redundancy because it must deal with situations where some features are missing. Nodes are randomly chosen to not be used, with some fixed probability (usually, one half). Another way to view dropout is that it implicitly (and efficiently) simulates an ensemble of different architectures. COMP9 8s Geometry of Hidden Units 8 Ensembling COMP9 8s Geometry of Hidden Units 9 Bagging Ensembling is a method where a number of different classifiers are trained on the same task, and the final class is decided by voting among them. In order to benefit from ensembling, we need to have diversity in the different classifiers. For example, we could train three neural networks with different architectures, three Support Vector Machines with different dimensions and kernels, as well as two other classifiers, and ensemble all of them to produce a final result. (Kaggle Competition entries are often done in this way). Diversity can also be achieved by training on different subsets of data. Suppose we are given N training items. Each time we train a new classifier, we choose N items from the training set with replacement. This means that some items will not be chosen, while others are chosen two or three times. There will be diversity among the resulting classifiers because they have each been trained on a different subset of data. They can be ensembled to produce a more accurate result than a single classifier.

COMP9 8s Geometry of Hidden Units Dropout as an Implicit Ensemble In the case of dropout, the same data are used each time but a different architecture is created by removing the nodes that are dropped. The trick of multiplying the output of each node by the probability of dropout implicitly averages the output over all of these different models.