Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Similar documents
Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Machine Learning. MGS Lecture 3: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Deep Learning. Volker Tresp Summer 2014

Deep Learning with Tensorflow AlexNet

Deep Learning for Computer Vision II

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Data Compression. The Encoder and PCA

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

ImageNet Classification with Deep Convolutional Neural Networks

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Dynamic Routing Between Capsules

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Fuzzy Set Theory in Computer Vision: Example 3, Part II

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Using Machine Learning for Classification of Cancer Cells

Does the Brain do Inverse Graphics?

Does the Brain do Inverse Graphics?

Introduction to Deep Learning

Su et al. Shape Descriptors - III

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Rotation Invariance Neural Network

Dimension Reduction CS534

Neural Network Neurons

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Handwritten Hindi Numerals Recognition System

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Backpropagation + Deep Learning

Advanced Introduction to Machine Learning, CMU-10715

Computer vision: models, learning and inference. Chapter 13 Image preprocessing and feature extraction

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Ruslan Salakhutdinov and Geoffrey Hinton. University of Toronto, Machine Learning Group IRGM Workshop July 2007

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization

ROB 537: Learning-Based Control

Why equivariance is better than premature invariance

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Perceptron: This is convolution!

Machine Learning 13. week

11. Neural Network Regularization

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Neural Networks and Deep Learning

Grundlagen der Künstlichen Intelligenz

! References: ! Computer eyesight gets a lot more accurate, NY Times. ! Stanford CS 231n. ! Christopher Olah s blog. ! Take ECS 174!

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

Lab 8 CSC 5930/ Computer Vision

Know your data - many types of networks

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Model Generalization and the Bias-Variance Trade-Off

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Understanding Faces. Detection, Recognition, and. Transformation of Faces 12/5/17

Depth Image Dimension Reduction Using Deep Belief Networks

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Convolutional Neural Networks

Clustering algorithms and autoencoders for anomaly detection

Deep Learning Applications

Large scale object/scene recognition

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Alternatives to Direct Supervision

Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

COMPARATIVE DEEP LEARNING FOR CONTENT- BASED MEDICAL IMAGE RETRIEVAL

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Unsupervised Learning

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

CSC 578 Neural Networks and Deep Learning

Large-scale visual recognition Efficient matching

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Outlier detection using autoencoders

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Semantic Image Search. Alex Egg

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Return of the Devil in the Details: Delving Deep into Convolutional Nets

COMPUTATIONAL INTELLIGENCE

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Deep Learning for Computer Vision

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

CSE 158. Web Mining and Recommender Systems. Midterm recap

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

MoonRiver: Deep Neural Network in C++

ConvolutionalNN's... ConvNet's... deep learnig

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Deep Generative Models Variational Autoencoders

Face detection and recognition. Detection Recognition Sally

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Novel Lossy Compression Algorithms with Stacked Autoencoders

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

HW Assignment 3 (Due by 9:00am on Mar 6)

FastText. Jon Koss, Abhishek Jindal

Transcription:

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing feature 3 PC 3 Beate Sick Many slides are taken form Hinton s great lecture on NN: https://www.coursera.org/course/neuralnets

What is semantic hashing? Semantic hashing in my understanding: Procedure that allows to find in an efficient way documents or images that are similar in their meaning

Dimensionality Reduction Input data may have thousands or millions of dimensions! images have many thousands of pixels text data has many letters or words sensors collect thousands of signals Are all those dimensions (letters, pixels) necessary? Goal: Represent data with fewer dimensions! Work with fewer features: get rid of p>>n situation Better visualization: our imagination does not go beyond 3D Discover intrinsic dimensionality of data: learn low-dim information representation Get a compact hash tag for each high dimensional data point Get a measure for abnormality by quantifying the reconstruction error

Pioneer work

Artificial Neural Networks fully connected or CNN Fully connected: CNN: Image: http://neuralnetworksanddeeplearning.com/chap6.html input conv1 pool1 conv2 pool2 layer3 output Image: Master Thesis Christopher Mitchell http://www.cemetech.net/projects/ee/scouter/thesis.pdf

Dimension reduction via Autoencoder (AE) Train AE using stochastic gradient backpropagation Rather than picking a subset of the features, we can construct k<n new features y from existing n features. The AE network gets the optimization task to reconstruct the input which requires that essential information is represented in only few dimensions corresponding to the number neurons in the bottleneck. Image from: http://codingplayground.blogspot.ch/2015/10/what-are-autoencoders-and-stacked.html

Dimension reduction by PCA x (2) y(1) y (2) y(2) Preparation center data x (1) PCA rotation y(1) PCA is a rotation of the coordinate system -> no information is lost The first principal component (PC 1 or Y 1 ) explains most of the variance, the second the second most,. All PC components explains the whole variance. Dimension reduction is done by dropping higher PCs. Each principle component can be represented as linear combination of the original features and the coefficients a ij in the linear combination are called loadings. linear combination: Y a X a X k 1k 1 2k 2 7

A picture of PCA starting from 2D -> 1D representation Reconstruction error 1.PC The red point is represented in 1D by the green point the position orthogonal to the 1.PC (low-dim hyperplane) cannot be reconstructed, instead the mean of (over all data) of this orthogonal direction is taken. The reconstruction of the red point has an reconstruction-error equal to the squared distance between red and green points.

Construction of PCs in PCA PCA Rotation can be achieved by multiplying X with an orthogonal rotation matrix A y 6 a x k jk j j 1 Y X A X Y A T ( nxp) ( nxp) ( pxp) ( nxp) ( nxp) ( pxp) The columns a k of A are given by the normalized Eigenvector a of the the Covariance-matrix Sx y 1 1 y 2 2 ^ ^ ^ Reconstruct X with only k<p PCs: X ˆ Y,0 A T ( nxp) ( nxk ) ( nxp k ) ( pxp) y 3 3 y 4 4 ^ ^ PCA minimizes reconstruction error over all available m data points m m () i () i 2 ˆ 2 xˆ x X i X i i 1 i 1 x 6 6 The first k PCs define k-dim hyperplane to which sum of squared projection errors is minimal ^ 9

Dimensionality Reduction by an Autoencoder (AE) y f ( z) (3) (3) (3x5) (5) (5) (5) (5 x 3) (3) T y b W x xˆ b' W y encoding decoding reconstruction-error m i 1 m i 1 xˆ x () i () i 2 (n) ( n) z x1w1 x2w2 w3 b' W b W x x T () i () i 2 (n) (n xk ) ( k ) ( kxn) ( n) ( n) y 1 y 2 y 3 The k linear units of an AE with 1 hidden layer span the same hyperplane as 1to k-pcs

Shallow and deep Autoencoder (AE) AE allow to generalize PCA by using non-linear units or more hidden layers: - Defining curved subspaces - Defining non-linear, hierarchical, complex features - Defining localized, distributed features (feature maps in convolutional architectures)

Deep AE are often hard to train -> Use greedy layer-wise training of deep AE Weight decay term penalty, regularization

AE can be used for feature construction R package: autoencoder

Use reconstruction error for abnormality detection compare PCA and AE based methods PCA AE (1 non-linear hidden layer) Mayu Sakurada 2015 25 components of simulated Lorenz system 17 continuous sensor data from a satellite

A comparison of methods for compressing digit images to 30 real numbers real data 30-D deep auto 30-D PCA

How to find documents that are similar to a query document Convert each document into a bag of words. This is word count vector ignoring order. Ignore stop words (like the or over ) but still has a length of >2000 We could compare word count vectors of the query document and millions of other documents, but this is too slow. Approach: Reduce word count vectors to a much smaller vector that still contains most of the information about the content of the document. 0 0 2 2 0 2 1 1 0 0 2 fish cheese vector count school query reduce bag pulpit iraq word

How to compress the count vector 2000 reconstructed counts 500 neurons output vector We train the neural network to reproduce its input vector as its output 250 neurons 10 This forces it to compress as much information as possible into the 10 numbers in the central bottleneck. 250 neurons These 10 numbers are then a good way to compare documents. 500 neurons 2000 word counts input vector

The non-linearity used for reconstructing bags of words Divide the counts in a bag of words vector by N, where N is the total number of nonstop words in the document. The resulting probability vector gives the probability of getting a particular word if we pick a non-stop word at random from the document. At the output of the autoencoder, we use a softmax. We treat the word counts as probabilities, but we make the visible to hidden weights N times bigger than the hidden to hidden weights take into account that we have N observations from the probability distribution. The probability vector defines the desired outputs of the softmax.

First compress all documents to 2 numbers using PCA Then use different colors for different categories. Train on bags of 2000 words for 400,000 training cases of business documents.

First compress all documents to 2 numbers using deep auto. Then use different colors for different document categories

Finding binary codes for documents Train an auto-encoder using 30 logistic units for the code layer. During the fine-tuning stage, add noise to the inputs to the code units. The noise forces their activities to become bimodal in order to resist the effects of the noise. Then we simply threshold the activities of the 30 code units to get a binary code. Krizhevsky discovered later that its easier to just use binary stochastic units in the code layer during training. 2000 reconstructed counts 500 neurons 250 neurons 30 code=hash 250 neurons 500 neurons 2000 word counts

Using a deep autoencoder as a hash-function for finding approximate matches supermarket search hash function

Binary codes for image retrieval Image retrieval is typically done by using the captions. Why not use the images too? Pixels are not like words: individual pixels do not tell us much about the content. Extracting object classes from images is hard (now it is possible!) Maybe we should extract a real-valued vector that has information about the content? Matching real-valued vectors in a big database is slow and requires a lot of storage. Short binary codes are very easy to store and match.

Krizhevsky s deep autoencoder The encoder has about 67,000,000 parameters. It takes a few days on a GTX 285 GPU to train on two million images. 512 1024 2048 256-bit binary code There is no theory to justify this architecture 4096 8192 Use many neurons, since logistic units have lower capacity 32x32 images 1024 1024 1024

Reconstructions of 32x32 color images from 256-bit codes

retrieved using 256 bit codes retrieved using Euclidean distance in pixel intensity space

retrieved using 256 bit codes retrieved using Euclidean distance in pixel intensity space

How to make image retrieval more sensitive to objects and less sensitive to pixels First train a big net to recognize lots of different types of objects in real images. Use net that won the ImageNet competition Then use the activity vector in the last hidden layer as the representation of the image. Use e.g. the Euclidian distance between the activity vectors in the last hidden layer. This should be a much better representation to match than the pixel intensities.

Leftmost column is the search image. Other columns are the images that have the most similar feature activities in the last hidden layer.