Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Similar documents
Machine Learning 13. week

Dynamic Routing Between Capsules

Deep Learning with Tensorflow AlexNet

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Deep Learning. Architecture Design for. Sargur N. Srihari

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Deep Learning for Computer Vision

Supervised Learning in Neural Networks (Part 2)

Character Recognition Using Convolutional Neural Networks

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Deep Learning. Volker Tresp Summer 2014

Vulnerability of machine learning models to adversarial examples

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Su et al. Shape Descriptors - III

Chapter 8: Convolutional Networks

Traffic Sign Localization and Classification Methods: An Overview

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Advanced Introduction to Machine Learning, CMU-10715

Deep Learning Cook Book

Convolutional Neural Networks

Visual object classification by sparse convolutional neural networks

Deep Learning and Its Applications

Introduction to Deep Learning

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

27: Hybrid Graphical Models and Neural Networks

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Neural Networks (pp )

Data Mining. Neural Networks

Deep Learning & Neural Networks

Introduction to Neural Networks: Structure and Training

Logical Rhythm - Class 3. August 27, 2018

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

2. Neural network basics

Parallelization and optimization of the neuromorphic simulation code. Application on the MNIST problem

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Lecture 20: Neural Networks for NLP. Zubin Pahuja

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

INTRODUCTION TO DEEP LEARNING

Introduction to Neural Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Vulnerability of machine learning models to adversarial examples

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

Deep Learning. Volker Tresp Summer 2015

Hidden Units. Sargur N. Srihari

Stacked Denoising Autoencoders for Face Pose Normalization

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

CHAPTER VI BACK PROPAGATION ALGORITHM

Image Compression: An Artificial Neural Network Approach

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Neural Networks and Deep Learning

Artificial Intelligence for Robotics: A Brief Summary

Perceptron: This is convolution!

COMPUTATIONAL INTELLIGENCE

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

Convolutional Neural Networks

Efficient Algorithms may not be those we think

Non-Profiled Deep Learning-Based Side-Channel Attacks

Back propagation Algorithm:

A *69>H>N6 #DJGC6A DG C<>C::G>C<,8>:C8:H /DA 'D 2:6G, ()-"&"3 -"(' ( +-" " " % '.+ % ' -0(+$,

Chap.12 Kernel methods [Book, Chap.7]

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Deep (1) Matthieu Cord LIP6 / UPMC Paris 6

Collaborative Filtering Applied to Educational Data Mining

Artificial Intelligence

Using Machine Learning to Optimize Storage Systems

11/14/2010 Intelligent Systems and Soft Computing 1

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Neural Network Neurons

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

Deep Learning. Volker Tresp Summer 2017

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Machine Learning. MGS Lecture 3: Deep Learning

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Notes on Multilayer, Feedforward Neural Networks

The exam is closed book, closed notes except your one-page cheat sheet.

CS6220: DATA MINING TECHNIQUES

Transcription:

Neural Network and Deep Learning

Early history of deep learning Deep learning dates back to 1940s: known as cybernetics in the 1940s-60s, connectionism in the 1980s-90s, and under the current name starting in 2006. Deep learning has gone by name Artificial Neural Networks (ANNs), one of machine learning methods originally aiming to understand brain function (although nothing to do with actual biological functions; Hinton and Shallice, 1991) The first wave was simple linear models to relate input to output (McCulloch and Pitts, 1943) from a neuroscientific perspective.

Early history: continue Deep learning gradually deviates from neuroscience due to lack of knowledge about brain function, but retains common belief that much of the mammalian brain might solve different tasks using a single algorithm (True or False?), including language processing, vision, motion planning and speech recognition. The second wave of deep learning (neural network) started in 1980s and lasted till the mid-1990s due to emerging movement called connectionism (parallel distributed processing). However, kernel machines (SVM) and graphical models became dominating later.

Recent development in deep learning The explosive revival of deep learning started from Hinton et al. (2006) with its outperforming other machine learning methods. Why deep learning becomes more exciting than ever: huge amount of training data, especially data with repetitive structures (image, speech), have lessened the concern on statistical generalizability, the point most criticized by statistical learning communities; more powerful computers and better software infrastructures enable computation with a highly complex model containing parallelizable components.

Deep learning applications Deep learning has been successful in recent applications: In the ImageNet Large Scale Visual Recognition Challenge of 2012, it improved the top-5 error rate from 26.1% to 15.3% and further down to 3.6% in 2015. It also had a dramatic impact on speech recognition, resulting a sudden drop in error rates with some cut in half. It showed superhuman performance in traffic sign classification and image segmentation. It is incorporated with reinforcement learning to reach human-level performance (e.g., DeepMind). Deep learning has been successfully used in other applications, eg., predicting how molecules interact, searching for subatomic particles and automatically parsing microscope images to construct a 3-D brain map.

Neural Networks It has other names: deep feedforward networks, feedforward neural networks, or multilayer perceptrons (MLPs). They are feedforward since information flows through function from input X to output Y and there are no feedback. A typical network is X f (1) (X) f (2) (f (1) (X)) Y. f (1) : first layer; f (2) : second layer the overall length of the chain is the depth of the model all the layers between X and Y are hidden layers. The architecture of ANN is closely related to direct acylic graph or structural equation model.

ANN example

Single-layer neural network Let Z 1,..., Z m be the variables in the hidden layer (hidden units) and Z k = h k (X). Furthermore, we use g(z 1,..., Z m ) to predict Y. Questions: how to specify the link functions (also called activation functions) h 1,..., h m and g? The key motivation is: we want these functions to be simple and conveniently computed, since by increasing the number of Z s, simple functions are sufficient approximate any possibly nonlinear relationship between X and Y. Universal approximation theorem A feedforward network with a linear output layer and at least one hidden layer with any squashing activation functions can approximate any Borel measurable function, provided that the network is given enough hidden units.

Activation functions Commonly used function for h s linear function: h(x) = β 0 + β T x sigmoid function: h(x) = max{0, min{1, β 0 + β T x}} logit sigmoid function: h(x) = exp{β 0 + β T x}/[1 + exp{β 0 + β T x}] rectified linear function: h(x) = max(0, β 0 + β T x) hyperbolic tangent function: h(x) = tanh(β 0 + β T x) The choice of g can also be one of these link functions depending on the type of output Y.

Network architecture One main feature of the architecture is characterized by depth (the number of hidden layers) and width (the number of hidden units at each layer). Empirically, greater depth results in better performance as compared to a shallow network with large width. Another feature of the architecture is how to connect two neighboring layers input layer can be connected to a subset of the units in the output layer; connection coefficients can be shared across some units in the input layer (convolutional networks); the advantage over a full connected network is the reduced number of parameters and less computation. Remark: the architecture of network is task-specific!

Computation algorithm for ANN Despite of the complexity in the network architecture, estimation of the parameters can be effectively obtained due to simplicity of the activation function and the so-called forward- and backward-propagation algorithm. Suppose Z k = σ k (X T α k ), k = 1,..., m, and E[Y X] = g(β 1 Z 1 +... + β m Z m + β 0 ) f (X). Based on n observations, we wish to minimize n {Y i g(β 1 σ 1 (Xi T α 1) +... + β m σ m (Xi T α m))} 2 i=1 if Y is continuous, or n Y i log g(β 1 σ 1 (Xi T α 1) +... + β m σ m (Xi T α m)) i=1 if Y is binary.

Gradient-descent algorithm The key is to compute the gradient with respect to all parameters. At (r + 1)st iteration, from the chain-rule, n β (r+1) k α (r+1) kl = β (r) k = α (r) kl γ r γ r i=1 δ i Z ki, n s ik X il, where γ r is the step size in the decent algorithm (called learning rate) and s ik = σ k (XT i α k)β k δ i, and i=1 δ i = 2(Y i f (X i ))f (β 1 Z i1 +... + β m Z im + β 0 ) for the continuous Y and for the binary Y. δ i = Y i /f (X i )f (β 1 Z i1 +... + β m Z im + β 0 )

Remarks on computation The update for the parameters can be carried in two-pass algorithm. In the forward pass, we use the current parameters to estimate f ( ); in the backward pass, we compute δ i then s ik. Each hidden unit passes and receives information only to and from units that share a connection, this algorithm can be implemented efficiently on a parallel architecture computer. Using the chain rule, we can also develop recursive computation for the hessian matrix.

Improve estimation in ANN Parameter regularization: L 2 or L 1 penalization Data augmentation: create fake data and augment it to training sample (boostrap sample, add noises?) One particularly effective technique for structured data is to introduce perturbation in data augmentation: for example, for object recognition, we translate images in a few pixels in each direction to improve generalization. Rotation or scaling are also effective. Multitask learning: it assumes that some factors are shared across two or more tasks. Early stopping, bagging or other ensemble methods

Alternative optimization algorithms stochastic gradient descent adaptive learning rates conjugate gradient method Boryden-Fletcher-Goldfarb-Shanno (BFGS) algorithm coordinate descent

Convolution Networks It is also known as convolutional neural networks (CNNs). It involves convolution and pooling operations in neural network. It has been quite successful for processing data with grid-like topology, such as time-series data and image data.

Convolution Recall convolution operation s(t) = x(a)w(t a)da, where w is call the convolution kernel. In a discrete version, it is a x(a)w(t a) for a 2-d image, s(i, j) = m,n x(m, n)w(i m, j n). Discrete convolution can be viewed as multiplication by a kernel matrix with constrained entries (Toeplitz matrix, doubly block circulant matrix)

Features in CNN Sparse connectivity: convolution can be treated as another layer network from input data x to the transformed data s. Since kernel matrix is much smaller than input matrix, the connection is sparse. Parameter sharing: the parameters for the same input unit are the same for different units in the layer of s. Equivariance: if the input changes, the output should changes in the same way. For example, every image pixel is shifted by one unit to the right, the output of ANN should shift by one pixel too (use invariance property to improve prediction).

Pooling operation It adds a pooling layer so replaces the input layer data with some summary statistics of the nearby outputs (for example, the maximum value in the neighborhood, or some weighted average). Pooling increases robustness to small translation of the input. Pooling results in data reduction so improves computation efficiency.

Convolution diagram

Pooling diagram

Additional deep learning Recurrent neural network: input data are in a sequence X (1), Y (1), X (2), and learning uses sharing parameters across different parts of a model. The application of recurrent neural networks includes computer vision, speech recognition, natural language processing, contextual bandits in reinforcement learning Autoencoders: it is a neural network that is trained to copy its input to its output (X X) so is useful for dimension reduction or feature learning. Some variants include sparse autoencoders, denoising autoencoders. Representation learning Graphical models