Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

Table of Contents What Really is a Hidden Unit? Visualizing Feed-Forward NNs Visualizing Convolutional NNs Visualizing Recurrent NNs Visualizing Attention Visualizing High Dimensional Data What do visualizations get us now?

What Really is a Hidden Unit? Some operation that outputs a weighted value. Sometimes its between -1 and +1 or its just weighted by 1 resulting in a linear value Lets say between -1 and +1 for this example -1 This is really a projection into a weighted space +1

Let s Look at a Distribution We Want to Model Simple distribution of blue points totally surrounded by red points

Let s Put This Data Into Random Hidden Units Still not linearly separable Still not linearly separable Finally linearly separable We just visualized a simple hidden layer!

Let s Put This Data Into Random Hidden Units Still not linearly separable Still not linearly separable Finally linearly separable This distribution is obviously not linearly separable To separate this data with a NN we need multiple layers

Let s issee This how A you Network visualize of Layers a basic Like NN... These Learning... Let s imagine a network structured like this is it able to learn this distribution?

Visualize a basic NN... Let s look at some more networks http://playground.tensorflow.org

Paper Here: https://cs.nyu.edu/~fergus/papers/zeilereccv2014.pdf Visualize a Convolutional NN...

Paper Here: https://cs.nyu.edu/~fergus/papers/zeilereccv2014.pdf Visualize a Convolutional NN... Layer 4 Layer 5

Visualize a Convolutional NN... Let s look at a more interactive visualization http://scs.ryerson.ca/~aharley/vis/conv/

What about an RNN Turns out that its similar but different

Visualize a Recurrent NN... Activations that change dependent on the input Paper Here: https://arxiv.org/pdf/1506.02078.pdf

Let s See aarecurrent Visualize Network ofnn... Layers Like These Learning... Average LSTM Gate positions dependent on their inputs Paper Here: https://arxiv.org/pdf/1506.02078.pdf

Let s See aarecurrent Visualize Network ofnn... Layers Like These Learning... Average GRU Gate positions dependent on their inputs Paper Here: https://arxiv.org/pdf/1506.02078.pdf

Let s See AAttention... Visualizing Network of Layers Like These Learning... First let's talk about attention we have not covered it yet. This is going to be handwavy, but the professor will cover this in lecture shortly Note that there is only one value calculated per input Outputs Attention Values (Calculated by the attention model) Inputs Paper Here: https://arxiv.org/pdf/1409.0473.pdf

Let s See AAttention... Visualizing Network of Layers Like These Learning... a() is a learned neural network Paper Here: https://arxiv.org/pdf/1409.0473.pdf

Visualizing Attention... We can visualize these values!

Visualizing Attention... Soft Attention (Softmax Activation - Differentiable) Hard Attention (Step Function Activation - Not Differentiable) Paper Here: https://arxiv.org/pdf/1502.03044.pdf

Visualizing Attention... Paper Here: https://arxiv.org/pdf/1502.03044.pdf

Let s See AAttention... Visualizing Network of Layers Like These Learning... Sometimes this can show us where our errors are... Paper Here: https://arxiv.org/pdf/1502.03044.pdf

Let s See AAttention... Visualizing Network of Layers Like These Learning... You can also do this with Seq2Seq Models Don t these look a lot like word alignment charts Topic for another time Paper Here: https://arxiv.org/pdf/1409.0473.pdf

Let s See AHigh Visualizing Network Dimensional of LayersData Like These Learning... How do you view data that is in more than 3 dimensions? Projections! Paper Here: https://arxiv.org/pdf/1409.0473.pdf

Visualizing Let s See AHigh Network Dimensional of LayersData Like These Learning... (Dimensionality Reduction) Principal Component Analysis Mathematical Projection of Principal Components Common for visualizing embedding spaces (deterministic) Great PCA Reading here: https://www.cs.cmu.edu/~elaw/papers/pca.pdf T-Distributed Stochastic Neighbor Embedding (t-sne) Stochastic Projection Trained using Gradient Descent Using KL Divergence Common for visualizing training data (non-deterministic) Paper Here: http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

Visualizing Let s See AHigh Network Dimensional of LayersData Like These Learning... (Dimensionality Reduction) Let s look at some MNIST Examples http://colah.github.io/posts/2014-10-visualizing-mnist/ Paper Here: http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

Visualizing Let s See AHigh Network Dimensional of LayersData Like These Learning... (Dimensionality Reduction) Let s look at some of our own examples https://colab.research.google.com/drive/1bjjxecml544xp3hc ZwFPNxAhFlF6VMTy#scrollTo=ULqt5rdaQPoi Paper Here: http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

Let s See What do visualizations A Network of get Layers us now? Like These Learning... Right now nothing NNs are still black boxes No way to really use these visualizations to help improve our models They are currently just cool images to look at... maybe later it will be better

Let s See A Network of Layers Like These Learning... Conclusion Any Questions?