Opening the black box of Deep Neural Networks via Information (Ravid Shwartz-Ziv and Naftali Tishby) An overview by Philip Amortila and Nicolas Gagné

Similar documents
Medical Image Segmentation Based on Mutual Information Maximization

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CSC 578 Neural Networks and Deep Learning

Motivation. Problem: With our linear methods, we can train the weights but not the basis functions: Activator Trainable weight. Fixed basis function

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Neural Networks Optimization

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning and Its Applications

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Deep Learning for Computer Vision II

Perceptron: This is convolution!

Deep Learning with Tensorflow AlexNet

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

The Mathematics Behind Neural Networks

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

CS489/698: Intro to ML

Simulation of Zhang Suen Algorithm using Feed- Forward Neural Networks

A new approach for supervised power disaggregation by using a deep recurrent LSTM network

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Neural Network Based Eye Tracking

Logistic Regression. Abstract

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Neural Network Neurons

Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift. Amit Patel Sambit Pradhan

ImageNet Classification with Deep Convolutional Neural Networks

Deep neural networks II

Deep Generative Models Variational Autoencoders

DEEP LEARNING IN PYTHON. The need for optimization

A Simple (?) Exercise: Predicting the Next Word

CS230: Deep Learning Winter Quarter 2018 Stanford University

Homework 5. Due: April 20, 2018 at 7:00PM

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

How Learning Differs from Optimization. Sargur N. Srihari

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Keras: Handwritten Digit Recognition using MNIST Dataset

SGD: Stochastic Gradient Descent

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Deep Learning. Architecture Design for. Sargur N. Srihari

Convolutional Neural Networks

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Auxiliary Variational Information Maximization for Dimensionality Reduction

Decentralized and Distributed Machine Learning Model Training with Actors

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Deep Learning. Volker Tresp Summer 2014

Deep Learning for Computer Vision

Novel Lossy Compression Algorithms with Stacked Autoencoders

Machine Learning 13. week

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Data Mining. Neural Networks

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Quantifying Data Needs for Deep Feed-forward Neural Network Application in Reservoir Property Predictions

Deep Learning Cook Book

Richard S. Zemel 1 Georey E. Hinton North Torrey Pines Rd. Toronto, ONT M5S 1A4. Abstract

Notes on Multilayer, Feedforward Neural Networks

Know your data - many types of networks

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Practical Tips for using Backpropagation

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Study of Residual Networks for Image Recognition

Slides credited from Dr. David Silver & Hung-Yi Lee

Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Logical Rhythm - Class 3. August 27, 2018

A Fast Learning Algorithm for Deep Belief Nets

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

A Co-Clustering approach for Sum-Product Network Structure Learning

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

arxiv: v1 [stat.ml] 21 Feb 2018

Stacked Denoising Autoencoders for Face Pose Normalization

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Automated Crystal Structure Identification from X-ray Diffraction Patterns

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Information-Theoretic Co-clustering

MEMORY PREFETCHING WITH NEURAL-NETWORKS: CHALLENGES AND INSIGHTS. Leeor Peled, Shie Mannor, Uri Weiser, Yoav Etsion

Learning Social Graph Topologies using Generative Adversarial Neural Networks

Machine Learning. Sourangshu Bhattacharya

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Recurrent Neural Network (RNN) Industrial AI Lab.

Knowledge Discovery and Data Mining

11/14/2010 Intelligent Systems and Soft Computing 1

A Quick Guide on Training a neural network using Keras.

27: Hybrid Graphical Models and Neural Networks

Bias-Variance Analysis of Ensemble Learning

Transcription:

Opening the black box of Deep Neural Networks via Information (Ravid Shwartz-Ziv and Naftali Tishby) An overview by Philip Amortila and Nicolas Gagné 1

Problem: The usual "black box" story: "Despite their great success, there is still no comprehensive understanding of the internal organization of Deep Neural Networks." "bull"??? 2

Solution: Opening the blackbox using mutual information! Mutual information "bull"??? 3

Solution: Opening the blackbox using mutual information! Mutual information "bull"??? 4

Solution: Opening the blackbox using mutual information! Mutual information "bull"??? 5

Solution: Opening the blackbox using mutual information! Mutual information "bull" 6

Spoiler alert! As we train a deep neural network, its layers will 7

Spoiler alert! As we train a deep neural network, its layers will 1. Gain "information" about the true label and the input. 2. Then, will gain "information" about the true label but will loose "information" about the input! 8

Spoiler alert! As we train a deep neural network, its layers will 1. Gain "information" about the true label and the input. 2. Then, will gain "information" about the true label but will lose "information" about the input! 9

Spoiler alert! As we train a deep neural network, its layers will 1. Gain "information" about the true label and the input. 2. Then, will gain "information" about the true label but will lose "information" about the input! During that second stage, it forgets the details of the input that are irrelevant to the task. 10

Spoiler alert! As we train a deep neural network, its layers will 1. Gain "information" about the true label and the input. 2. Then, will gain "information" about the true label but will lose "information" about the input! During that second stage, it forgets the details of the input that are irrelevant to the task. (Not unlike Picasso's deconstruction of the bull.) 11

Spoiler alert! Before delving in, we first need to define what we mean by "information". 12

Spoiler alert! Before delving in, we first need to define what we mean by "information". By information, we mean mutual information. 13

What is mutual information? For discrete random variables X and Y : Entropy of X given nothing H(X) := nx i=1 p(x i ) log (p(x i ))

What is mutual information? For discrete random variables X and Y : Entropy of X given nothing Entropy of X given y H(X) := H(X y) := nx i=1 nx p(x i ) log (p(x i )) i=1 p(x i y) log (p(x i y))

What is mutual information? For discrete random variables X and Y : Entropy of X given nothing Entropy of X given y Entropy of X given Y nx H(X) := p(x i ) log (p(x i )) i=1 nx H(X y) := p(x i y) log (p(x i y)) i=1 mx H(X Y ):= p(y j )H(X y j ) j=1 16

What is mutual information? For discrete random variables X and Y : Entropy of X given nothing Entropy of X given y Entropy of X given Y nx H(X) := p(x i ) log (p(x i )) i=1 nx H(X y) := p(x i y) log (p(x i y)) i=1 mx H(X Y ):= p(y j )H(X y j ) j=1 Mutual information between X and Y I(X; Y ):=H(X) H(X Y ) 17

Setup: Given a deep neural network where X and Y are discrete p(x y) random variables: 18

Setup: We identify the propagated values at layer 1 with p(x y) the vector T 1 19

Setup: We identify the propagated values at layer i with the p(x y) vector T i 20

Setup: We identify the values for every layer with their corresponding vector. p(x y) 21

Setup: We identify the values for every layer with their corresponding vector. And doing so, we get the following Markov chain. 22

Setup: We identify the values for every layer with their corresponding vector. And doing so, we get the following Markov chain. (Hint: Markov chain rhymes with data processing inequality) I(Y,T i ) I(Y,T i+n ) and I(X, T i ) I(X, T i+n ) 23

Setup: And doing so, we get the following Markov chain. Next, pick your favourite layer, say T i 24

Setup: And doing so, we get the following Markov chain. Next, pick your favourite layer, say T i 25

"How much it knows about Y" I(Y;T) Next, pick your favorite layer, say T. We will plot T's current location on the "information plane". I(X;T) 26 "How much it knows about X"

Then, we train our deep neural network for a bit. And plot the new corresponding distribution of T "How much it knows about Y" I(Y;T) I(X;T) 27 "How much it knows about X"

We train a bit more... "How much it knows about Y" I(Y;T) I(X;T) 28 "How much it knows about X"

And as we train, we trace the layer's trajectory in the information plane. "How much it knows about Y" I(Y;T) I(X;T) 29 "How much it knows about X"

Let's see what it looks like for a fixed data point ( ),"bull" I(Y;T) "How much it knows about Y" "How much it knows about X" I(X;T) 30

Let's see what it looks like for a fixed data point ( ),"bull" I(Y;T) "How much it knows about Y" "How much it knows about X" I(X;T) 31

Let's see what it looks like for a fixed data point ( ),"bull" I(Y;T) "How much it knows about Y" ( ), "dog" "How much it knows about X" I(X;T) 32

Let's see what it looks like for a fixed data point ( ),"bull" I(Y;T) "How much it knows about Y" ( ), "dog" "How much it knows about X" I(X;T) 33

Let's see what it looks like for a fixed data point ( ),"bull" I(Y;T) "How much it knows about Y" ( ), "dog" "How much it knows about X" ( ),"goat" I(X;T) 34

Let's see what it looks like for a fixed data point ( ),"bull" I(Y;T) "How much it knows about Y" ( ), "dog" "How much it knows about X" ( ),"goat" I(X;T) 35

Let's see what it looks like for a fixed data point ( ),"bull" I(Y;T) "How much it knows about Y" ( ), "dog" "How much it knows about X" ( ), "bull" ( ),"goat" I(X;T) 36

Let's see what it looks like for a fixed data point ( ),"bull" I(Y;T) "How much it knows about Y" ( ), "dog" "How much it knows about X" ( ), "bull" ( ),"goat" I(X;T) 37

Let's see what it looks like for a fixed data point ( ),"bull" I(Y;T) ( ),"bull" "How much it knows about Y" ( ), "dog" "How much it knows about X" ( ), "bull" ( ),"goat" I(X;T) 38

Numerical Experiments and Results Examining the dynamics of SGD in the mutual information plane

Experimental Setup Explored fully connected feed-forward neural nets, with no other architecture constraints: 7 fully connected hidden layers with widths 12-10-7-5-4-3-2 sigmoid activation on final layer, tanh activation on all other layers Binary decision task, synthetic data is used D P (X, Y ) Experiment with 50 different randomized weight initializations and 50 different datasets generated from the same distribution trained with SGD to minimize the cross-entropy loss function, and with no regularization

Dynamics in the Information Plane

Dynamics in the Information Plane All 50 test runs follow similar paths in the information plane Two different phases of training: a fast ERM reduction phase and a slower representation compression phase In the first phase (~400 epochs), the test error is rapidly reduced (increase in I(T ; Y )) In the second phase (from 400-9000 epochs) the error is relatively unchanged but the layers lose input information (decrease in I(X; T ))

Dynamics in the Information Plane All 50 test runs follow similar paths in the information plane Two different phases of training: a fast ER reduction phase and a slower representation compression phase In the first phase (~400 epochs), the test error is rapidly reduced (increase in I(T ; Y )) In the second phase (from 400-9000 epochs) the error is relatively unchanged but the layers lose input information (decrease in I(X; T ))

Dynamics in the Information Plane All 50 test runs follow similar paths in the information plane Two different phases of training: a fast ER reduction phase and a slower representation compression phase In the first phase (~400 epochs), the test error is rapidly reduced (increase in I(T ; Y )) In the second phase (from 400-9000 epochs) the error is relatively unchanged but the layers lose input information (decrease in I(X; T ))

Dynamics in the Information Plane All 50 test runs follow similar paths in the information plane Two different phases of training: a fast ER reduction phase and a slower representation compression phase In the first phase (~400 epochs), the test error is rapidly reduced (increase in I(T ; Y )) In the second phase (from 400-9000 epochs) the error is relatively unchanged but the layers lose input information (decrease in I(X; T ))

Representation Compression The loss of input information occurs without any form of regularization This prevents overfitting since the layers lose irrelevant information ( generalizing by forgetting ) However, overfitting can still occur with less data 5%, 45%, and 85% of the data respectively

Representation Compression The loss of input information occurs without any form of regularization This prevents overfitting since the layers lose irrelevant information (generalizing by forgetting) However, overfitting can still occur with less data 5%, 45%, and 85% of the data respectively

Representation Compression The loss of input information occurs without any form of regularization This prevents overfitting since the layers lose irrelevant information (generalizing by forgetting) However, overfitting can still occur with less data 5%, 45%, and 85% of the data respectively

Phase transitions The ER reduction phase is called a drift phase, where the gradients are large and the weights are changing rapidly (high signal-to-noise) The Representation compression phase is called a diffusion phase, where the gradients are small compared to their variance The phase transition occurs at the dotted line

Phase transitions The ER reduction phase is called a drift phase, where the gradients are large and the weights are changing rapidly (high signal-to-noise) The Representation compression phase is called a diffusion phase, where the gradients are small compared to their variance (low signal-to-noise) The phase transition occurs at the dotted line

Effectiveness of Deep Nets Because of the low Signal-to-noise Ratio in the diffusion phase, the final weights obtained by the DNN are effectively random Across different experiments, the correlations between the weights of different neurons in the same layer was very small This indicates that there is a huge number of different networks with essentially optimal performance, and attempts to interpret single weights or single neutrons in such networks can be meaningless

Effectiveness of Deep Nets Because of the low Signal-to-noise Ratio in the diffusion phase, the final weights obtained by the DNN are effectively random Across different experiments, the correlations between the weights of different neurons in the same layer was very small This indicates that there is a huge number of different networks with essentially optimal performance, and attempts to interpret single weights or single neutrons in such networks can be meaningless

Effectiveness of Deep Nets Because of the low Signal-to-noise Ratio in the diffusion phase, the final weights obtained by the DNN are effectively random Across different experiments, the correlations between the weights of different neurons in the same layer was very small This indicates that there is a huge number of different networks with essentially optimal performance, and attempts to interpret single weights or single neurons in such networks can be meaningless

Why go deep? Faster ER minimization (in epochs) Faster representation compression time

Why go deep? Faster ER minimization (in epochs) Faster representation compression time (in epochs)

Discussions/Disclaimers The claims being made are certainly very interesting, although the scope of experiments is limited: only one specific distribution and one specific network are examined. The paper acknowledges that different setups need to be tested: do the results hold with different decision rules and network architectures? And is this observed in real world problems?

Discussions/Disclaimers The claims being made are certainly very interesting, although the scope of experiments is limited: only one specific distribution and one specific network are examined. The paper acknowledges that different setups need to be tested: do the results hold with different decision rules and network architectures? And is this observed in real world problems?

Discussions/Disclaimers As of now there is some controversy about whether or not these claims hold up: see On the Information Bottleneck Theory of Deep Learning (Saxe et al., 2018) Here we show that none of these claims hold true in the general case. [ ] we demonstrate that the information plane trajectory is predominantly a function of the neural nonlinearity employed [ ] Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent.

The End