CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Similar documents
CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

1 Training/Validation/Testing

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

FastText. Jon Koss, Abhishek Jindal

Lecture 20: Neural Networks for NLP. Zubin Pahuja

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Machine Learning 13. week

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

A Neuro Probabilistic Language Model Bengio et. al. 2003

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

INTRODUCTION TO DEEP LEARNING

ECG782: Multidimensional Digital Signal Processing

Artificial Neural Networks (Feedforward Nets)

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Neural Networks and Deep Learning

Deep Learning. Volker Tresp Summer 2014

Data Mining. Neural Networks

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization

Model learning for robot control: a survey

CSC 578 Neural Networks and Deep Learning

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

Lecture #11: The Perceptron

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Neural Network Neurons

CS 224N: Assignment #1

CS 224N: Assignment #1

Dimension reduction : PCA and Clustering

CPSC 340: Machine Learning and Data Mining

Machine Learning in Biology

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Facial Expression Classification with Random Filters Feature Extraction

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Lecture 19: Generative Adversarial Networks

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

CS 224d: Assignment #1

COMPUTATIONAL INTELLIGENCE

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

Variational Autoencoders. Sargur N. Srihari

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

Week 3: Perceptron and Multi-layer Perceptron

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

node2vec: Scalable Feature Learning for Networks

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Machine Learning Techniques

A Systematic Overview of Data Mining Algorithms

CS6220: DATA MINING TECHNIQUES

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CPSC Coding Project (due December 17)

Contents. Preface to the Second Edition

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Neural Nets & Deep Learning

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

CS229 Final Project: Predicting Expected Response Times

scikit-learn (Machine Learning in Python)

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

Data mining with Support Vector Machine

Supervised classification of law area in the legal domain

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Louis Fourrier Fabien Gaie Thomas Rolf

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Machine Learning Classifiers and Boosting

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Courtesy of Prof. Shixia University

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Dimension Reduction CS534

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Perceptron: This is convolution!

Deep Learning with Tensorflow AlexNet

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Transcription:

CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2018

Last Time: Multi-Dimensional Scaling Multi-dimensional scaling (MDS): Non-parametric visualization: directly optimize the z i locations. Traditional MDS methods lead to a crowding effect.

Sammon s Map vs. ISOMAP vs. t-sne http://lvdmaaten.github.io/publications/papers/jmlr_2008.pdf

Sammon s Map vs. ISOMAP vs. t-sne http://lvdmaaten.github.io/publications/papers/jmlr_2008.pdf

t-distributed Stochastic Neighbour Embedding One key idea in t-sne: Focus on distance to neighbours (allow large variance in other distances)

t-distributed Stochastic Neighbour Embedding t-sne is a special case of MDS (specific d 1, d 2, and d 3 choices): d 1 : for each x i, compute probability that each x j is a neighbour. Computation is similar to k-means++, but most weight to close points (Gaussian). Doesn t require explicit graph. d 2 : for each z i, compute probability that each z j is a neighbour. Similar to above, but uses student s t (grows really slowly with distance). Avoids crowding, because you have a huge range that large distances can fill. d 3 : Compare x i and z i using an entropy-like measure: How much randomness is in probabilities of x i if you know the z i (and vice versa)? Interactive demo: https://distill.pub/2016/misread-tsne

http://jasneetsabharwal.com/assets/files/wiki_tsne_report.pdf t-sne on Wikipedia Articles

t-sne on Product Features http://blog.kaggle.com/2015/06/09/otto-product-classification-winners-interview-2nd-place-alexander-guschin/

http://www.ncbi.nlm.nih.gov/pmc/articles/pmc4076922/ t-sne on Leukemia Heterogeneity

(pause)

Latent-Factor Representation of Words For natural language, we often represent words by an index. E.g., cat is word 124056 among a bag of words. But this may be inefficient: Should cat and kitten share parameters in some way? We want a latent-factor representation of individual words: Closeness in latent space should indicate similarity. Distances could represent meaning? Recent alternative to PCA/NMF is word2vec

Consider these phrases: the cat purred the kitten purred Using Context black cat ran black kitten ran Words that occur in the same context likely have similar meanings. Word2vec uses this insight to design an MDS distance function.

Word2Vec Two common word2vec approaches: 1. Try to predict word from surrounding words (continuous bag of words). 2. Try to predict surrounding words from word (skip-gram). Train latent-factors to solve one of these supervised learning tasks. https://arxiv.org/pdf/1301.3781.pdf

Word2Vec In both cases, each word i is represented by a vector z i. In continuous bag of words (CBOW), we optimize the following likelihood: Apply gradient descent to logarithm: Encourages z it z j to be big for words in same context (making z i close to z 1 ). Encourages z it z j to be small for words not appearing in same context (makes z i and z j far). For CBOW, denominator sums over all words. For skip-gram it will be over all possible surrounding words. Common trick to speed things up: sample terms in denominator ( negative sampling ).

Word2Vec Example MDS visualization of a set of related words: Distances between vectors might represent semantics. http://sebastianruder.com/secret-word2vec

Word2Vec Subtracting word vectors to find related vectors. Word vectors for 157 languages here. https://arxiv.org/pdf/1301.3781.pdf

End of Part 4: Key Concepts We discussed linear latent-factor models: Represent X as linear combination of latent factors w c. Latent features z i give a lower-dimensional version of each x i. When k=1, finds direction that minimizes squared orthogonal distance. Applications: Outlier detection, dimensionality reduction, data compression, features for linear models, visualization, factor discovery, filling in missing entries.

End of Part 4: Key Concepts We discussed linear latent-factor models: Principal component analysis (PCA): Often uses orthogonal factors and fits them sequentially (via SVD). Non-negative matrix factorization: Uses non-negative factors giving sparsity. Can be minimized with projected gradient. Many variations are possible: Different regularizers (sparse coding) or loss functions (robust/binary PCA). Missing values (recommender systems) or change of basis (kernel PCA).

End of Part 4: Key Concepts We discussed multi-dimensional scaling (MDS): Non-parametric method for high-dimensional data visualization. Tries to match distance/similarity in high-/low-dimensions. Gradient descent on scatterplot points. Main challenge in MDS methods is crowding effect: Methods focus on large distances and lose local structure. Common solutions: Sammon mapping: use weighted cost function. ISOMAP: approximate geodesic distance using via shortest paths in graph. T-SNE: give up on large distances and focus on neighbour distances. Word2vec is a recent MDS method giving better word features.

Supervised Learning Roadmap Part 1: Direct Supervised Learning. We learned parameters w based on the original features x i and target y i. Part 3: Change of Basis. We learned parameters v based on a change of basis z i and target y i. Part 4: Latent-Factor Models. We learned parameters W for basis z i based on only on features x i. You can then learn v based on change of basis z i and target y i. Part 5: Neural Networks. Jointly learn W and v based on x i and y i. Learn basis z i that is good for supervised learning.

A Graphical Summary of CPSC 340 Parts 1-5

Notation for Neural Networks

Linear-Linear Model Obvious choice: linear latent-factor model with linear regression. We want to train W and v jointly, so we could minimize: But this is just a linear model:

Introducing Non-Linearity To increase flexibility, something needs to be non-linear. Typical choice: transform z i by non-linear function h. Here the function h transforms k inputs to k outputs. Common choice for h : applying sigmoid function element-wise: So this takes the z ic in (-, ) and maps it to (0,1). This is called a multi-layer perceptron or a neural network.

Why Sigmoid? Consider setting h to define binary features z i using: Each h(zi) can be viewed as binary feature. You either have this part or you don t have it. We can make 2 k objects by all the possible part combinations.

Why Sigmoid? Consider setting h to define binary features z i using: Each h(zi) can be viewed as binary feature. You either have this part or you don t have it. We can make 2 k objects by all the possible part combinations. But this is hard to optimize (non-differentiable/discontinuous). Sigmoid is a smooth approximation to these binary features.

Supervised Learning Roadmap

Cartoon of typical neuron: Why Neural Network? Neuron has many dendrites, which take an input signal. Neuron has a single axon, which sends an output signal. With the right input to dendrites: Action potential along axon (like a binary signal): https://en.wikipedia.org/wiki/neuron https://en.wikipedia.org/wiki/action_potential

https://en.wikipedia.org/wiki/neuron Why Neural Network?

https://en.wikipedia.org/wiki/neuron Why Neural Network?

Why Neural Network?

Artificial Neural Nets vs. Real Networks Nets Artificial neural network: x i is measurement of the world. z i is internal representation of world. y i is output of neuron for classification/regression. Real neural networks are more complicated: Timing of action potentials seems to be important. Rate coding : frequency of action potentials simulates continuous output. Neural networks don t reflect sparsity of action potentials. How much computation is done inside neuron? Brain is highly organized (e.g., substructures and cortical columns). Connection structure changes. Different types of neurotransmitters.

Deep Learning

Hierarchies of Parts Motivation for Deep Learning Each neuron might recognize a part of a digit. Deeper neurons might recognize combinations of parts. Represent complex objects as hierarchical combinations of re-useable parts (a simple grammar ). Watch the full video here: https://www.youtube.com/watch?v=aircaruvnkk

Word2vec: Summary Latent-factor (continuous) representation of words. Based on predicting word from its context. Neural networks learn features z i for supervised learning. Sigmoid function avoids degeneracy by introducing non-linearity. Biological motivation for (deep) neural networks. Deep learning considers neural networks with many hidden layers. Next time: Training deep networks.

Does t-sne always outperform PCA? Consider 3D data living on a 2D hyper-plane: PCA can perfectly capture the low-dimensional structure. T-SNE can capture the local structure, but can twist the plane. It doesn t try to get long distances correct.

Multiple Word Prototypes What about homonyms and polysemy? The word vectors would need to account for all meanings. More recent approaches: Try to cluster the different contexts where words appear. Use different vectors for different contexts.

Multiple Word Prototypes http://www.socher.org/index.php/main/improvingwordrepresentationsviaglobalcontextandmultiplewordprototypes

Why z i = Wx i? In PCA we had that the optimal Z = XW T (WW T ) -1. If W had normalized+orthogonal rows, Z = XW T (since WW T = I). So z i = Wx i in this normalized+orthogonal case. Why we would use z i = Wx i in neural networks? We didn t enforce normalization or orthogonality. Well, the value W T (WW T ) -1 is just some matrix. You can think of neural networks as just directly learning this matrix.

Cool Picture Motivation for Deep Learning Faces might be composed of different parts : http://www.datarobot.com/blog/a-primer-on-deep-learning/

Cool Picture Motivation for Deep Learning First layer of z i trained on 10 by 10 image patches: Attempt to visualize second layer: Corners, angles, surface boundaries? Models require many tricks to work. We ll discuss these next time. http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf

Cool Picture Motivation for Deep Learning First layer of z i trained on 10 by 10 image patches: Visualization of second and third layers trained on specific objects: http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf

Cool Picture Motivation for Deep Learning First layer of z i trained on 10 by 10 image patches: Visualization of second and third layers trained on specific objects: http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf

Cool Picture Motivation for Deep Learning First layer of z i trained on 10 by 10 image patches: Visualization of second and third layers trained on specific objects: http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf

Cool Picture Motivation for Deep Learning First layer of z i trained on 10 by 10 image patches: Visualization of second and third layers trained on specific objects: http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf

Cool Picture Motivation for Deep Learning First layer of z i trained on 10 by 10 image patches: Visualization of second and third layers trained on specific objects: http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf