Weiguang Guan Code & data: guanw.sharcnet.ca/ss2017-deeplearning.tar.gz

Similar documents
SGD: Stochastic Gradient Descent

Lecture 20: Neural Networks for NLP. Zubin Pahuja

CS6220: DATA MINING TECHNIQUES

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Machine Learning 13. week

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Perceptron: This is convolution!

Deep Learning and Its Applications

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Implement NN using NumPy

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

INTRODUCTION TO DEEP LEARNING

Neural Network Neurons

Deep Learning Applications

Evaluating Classifiers

Keras: Handwritten Digit Recognition using MNIST Dataset

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Keras: Handwritten Digit Recognition using MNIST Dataset

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Como funciona o Deep Learning

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

All You Want To Know About CNNs. Yukun Zhu

Week 3: Perceptron and Multi-layer Perceptron

Machine Learning. MGS Lecture 3: Deep Learning

Evaluating Classifiers

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Deep Learning with Tensorflow AlexNet

Machine Learning Classifiers and Boosting

Neural Networks (Overview) Prof. Richard Zanibbi

Data Mining. Neural Networks

ECE 5470 Classification, Machine Learning, and Neural Network Review

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Network Traffic Measurements and Analysis

Advanced Introduction to Machine Learning, CMU-10715

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

Tutorial on Machine Learning Tools

The exam is closed book, closed notes except your one-page cheat sheet.

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

CS229 Final Project: Predicting Expected Response Times

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Deep (1) Matthieu Cord LIP6 / UPMC Paris 6

Deep Neural Networks with Flexible Activation Function

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Deep Neural Networks Optimization

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

ImageNet Classification with Deep Convolutional Neural Networks

A Quick Guide on Training a neural network using Keras.

M. Sc. (Artificial Intelligence and Machine Learning)

Deep Learning for Computer Vision II

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Report: Privacy-Preserving Classification on Deep Neural Network

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Advanced Video Analysis & Imaging

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

CNN Basics. Chongruo Wu

Frameworks in Python for Numeric Computation / ML

Know your data - many types of networks

Convolutional Neural Networks

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

CP365 Artificial Intelligence

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Machine Learning in Biology

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Neural Networks (pp )

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2017

Neural Bag-of-Features Learning

CSC 578 Neural Networks and Deep Learning

Neural Networks: What can a network represent. Deep Learning, Fall 2018

Logical Rhythm - Class 3. August 27, 2018

Machine Learning in Telecommunications

Deep Learning. Architecture Design for. Sargur N. Srihari

Logistic Regression: Probabilistic Interpretation

CS145: INTRODUCTION TO DATA MINING

Back Propagation and Other Differentiation Algorithms. Sargur N. Srihari

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Programming Exercise 4: Neural Networks Learning

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Supervised Learning in Neural Networks (Part 2)

A Systematic Overview of Data Mining Algorithms

Classification using Weka (Brain, Computation, and Neural Learning)

Introduction to Neural Networks

Probabilistic Siamese Network for Learning Representations. Chen Liu

Transcription:

Weiguang Guan guanw@sharcnet.ca Code & data: guanw.sharcnet.ca/ss2017-deeplearning.tar.gz

Outline Part I: Introduction Overview of machine learning and AI Introduction to neural network and deep learning (DL) Part II: Case study Recognition of handwritten digits Write our own DL code Use a DL library

Reference Deep Learning Tutorial by Yann LeCun, http://www.cs.nyu.edu/~yann/talks/lecun-ranzatoicml2013.pdf Deep Learning Tutorial by Yoshua Bengio, http://deeplearning.net/tutorial/deeplearning.pdf Neural Networks and Deep Learning by Michael Nielsen, http://neuralnetworksanddeeplearning.com

Part I Introduction to AI, Machine learning, and neural network

Overview What is AI? What/how can a machine learn? Machine learning methods with focus on deep learning Caveats and pitfalls of machine learning

Artificial Intelligence (AI) What is AI Def 1: Computer systems able to perform tasks that normally require human intelligence. Def 2: intelligent machines that work and react like humans Def 3 : more on the internet

Artificial Intelligence (AI) Are these in the domain of AI? Computing Database Logical operations

Artificial Intelligence (AI) Intelligent robot made by Boston Dynamics https://youtu.be/rvlhmgqgdky AI system (my definition) Is able to perform an intelligent task by learning from examples We humans don t know the explicit rules/instructions to perform the task

Artificial Intelligence (AI) Speech Recognition Natural language understanding Robotics Machine Learning Visual Perception Knowledge representation Expert Systems

Artificial Intelligence (AI) History of AI Brain vs computer Logic gates Bio-chemical operations??? Memory Information processing (Computing vs thinking) Sensing (camera and microphone vs eyes and ears) Responding

Artificial Intelligence (AI) What are minds? What is thinking? To what extent can computers have intelligence? Strong AI vs weak AI

Artificial Intelligence (AI) Computers don t and won t have Passion, feeling, consciousness Inherent common sense Motivation Computers can be trained to do particular tasks (as good as humans or even better) Thinking is computing

How does Machine Learning work What is learned by computer? A parameterized model used to perform a particular task. Task: to predict diabetes based on sugar intake x and hours of exercise y per day. x, y y<b yes x>a no no ND D Input: x, y Output: either D (Diabetes) or ND (not Diabetes) Parameters: a, b yes ND

Machine Learning How does a computer learn? Learns from (many) samples y D ND x, y b y<b yes x>a no no ND D yes ND a x

Machine Learning Learning becomes an optimization problem: Determine parameters (a, b) so that a pre-defined cost function (e.g., misclassification error) is minimized. Training or learning is usually an iterative process where parameters are gradually changed to make the cost function smaller and smaller.

Machine Learning Basic concepts (cont d) Feature space Decision boundary y D ND b a x

Two categories of learning Supervised learning Learn from annotated samples Faces Non-faces Unsupervised learning Learn from samples without annotation

Machine learning methods Deep learning Boosting Support Vector Machine (SVM) Naïve Bayes Decision tree Linear or logistic regression K-nearest neighbours (KNN)

Sample data Basic concepts Sample Sample is defined as a vector of attributes (or features), each of which can be Numerical Categorical Ordered No order Binary or Boolean Label can be Categorical (most often, binary) --- classification Numerical --- regression

Sample Example of data sample Name Occupa tion Smoking Sugar intake Hours of exercise Glucose level Peter Driver Yes 100.0 5.5 80 No Nancy Teacher No 50.0 3 120 Yes Diabetes Feature vector Annotation

Performance measures Basic concepts (cont d) Classification error Binary classification TP, TN, FP, FN ROC curve TPR = TP/(TP+FN), FPR =FP/(FP+TN)

Performance measures Basic concepts (cont d) Classification error Multiclass classification Confusion matrix

Neural network and deep learning Directed graph of neurons

Neurons Neurons Linear weighted sum Perceptron Sigmoid Rectified Linear Units (ReLu) Layers: Pooling Convolution Loss

Linear weighted sum z = j w j x j = w t x

Perceptron z = 0, if j w jx j T 1, otherwise z = 0, if wt x + b 0 1, otherwise weights: w = (w 1, w 2,, w n ) bias: b = T

Sigmoid neuron 1 σ z = 1 + e z where z = w t x + b

Rectified Linear Units σ z = max (0, z) where z = w t x + b

Convolution layer A set of learnable filters (or kernels) --- 2D array of weights Each neuron (blue) of a filter has its receptive field (red) in the input image Dot product between receptive field and a filter

Pooling layer Sub-sampling Example of max pooling

Architecture of neural network

Train a neural network Purpose: To determine the parameters of a given network so that it will produce desired outputs.

How to achieve the training goal Define a cost (objective) function to measure the performance of a NN, such as n MSE = 1 y n i g(x i ) 2 i=1 where x i and y i are the feature vector and label of the i-th sample. g(x i ) is the output of the NN. g(x i ) depends on the parameters of the NN. Gradient descent is used to find the values of these parameters that minimize the cost function.

Gradient descent (1D function) y = f x = x 2 6x + 10 Starting x = 6, we get y = 10 x = x ± x, where x = 0.01, we get y = 9.9401 for x = 5.99 and y = 10.0601 for x = 6.01 y = lim = 6.01 0.01 where y = f x + x f(x) Analytic y = 2x 6 y 0.0601 x 0 x y 10 y 6 y x x

Gradient descent (2D function) z z = f x, y = x 2 + y 2 6x + 10 Starting x = 6, y = 2, we get z = 14 Analytic: z x = 2x 6, z y = 2y Gradient z = ( z, z x y ) is the fastest ascending direction Iteratively approaching with learning rate η = 0.01 Round #1: (x, y) = (6, 2), we get z = (6, 4) Round #2: x, y = x, y η z = (5.94, 1.96), we get z = (5.88, 3.92) Round #3: y x

Train a neural network Initialize the state of a neural network (by randomizing all the parameters) Iterative process Feed forward Backpropagation (chain rule) x y z f(x,y) q g(q,z) p p x = p q q x p y = p q q y p z = p z

Chain rule of derivatives

Wait a minute Questions We want to minimize a cost function, not the output of NN We want to tweak the parameters of NN, not the input data (x, y, z ) to do the minimization Let change the roles Consider a, b,, f as variables in z = f x, y = ax 2 + by 2 + cxy + dx + ey + f The cost function C is a function of the output and ground truth so that we can compute C, C a b Apply gradient descent on a, b,, f the same way,, C f,

Training a neural network Cost function C(w, b) = 1 C n i=1 i = y i g(x i ) 2, where w, b are the weights and biases of NN. C = 1 n n i=1 m i=1 n C i, n is the number of training samples C 1 C m i, randomly divide training set into small subsets (mini-batches), each of which contains m n samples. An epoch is one complete pass going through all the mini-batches. Training with mini-batches is called stochastic gradient descent (SGD)

Caveats and pitfalls Feature selection Relevance Redundancy Sample data Mislabeling Outliers Overfitting Small training set Low-quality training data Over-training Confidence level of classification Re-tune a trained model to operate on a different position on the ROC curve

Part II Case Study: Recognition of hand-written digits

Tasks Write our own NN code Use DL libraries (Tensorflow)

Four elements in deep learning Datasets (for training, testing ) Design of a neural network Cost function that training tries to minimize Training method(solver or optimizer, such as SGD, AdaDelta, Adaptive Gradient, etc)

Dataset of handwritten digits Dataset (http://yann.lecun.com/exdb/mnist/) 60,000 training samples 10,000 testing samples Each sample is 28x28 gray scale image with a label telling what digit it is. The label (0, 1,, 9) is vectorized as a one-hot vector, e.g., [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] represents 3. checkdata.py

Dataset of handwritten digits One-hot vectorized label: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

Architecture of neural network Multilayer Perceptron (MLP) or fully connected NN Input: 28x28=784 One hidden layer Output: a vector of 10 elements

Architecture of neural network Multilayer perceptron (MLP) argmax

Cost function Some notations (l) w j,k denotes the weight of the connection between the k th neuron in the (l 1) th layer to the j th neuron in the l th layer. Similarly, we define bias b j (l), weighted sum+bias zj (l), and activation a j (l) = σ layer. zj (l) at the j th neuron in the l th

Cost function Given a input (x, y) Feedforward calculation z j (l) = k (l) wj,k a j (l) = σ(zj (l) ) or a (l) = σ z (l) When l = 1, a j (1) = xj a k (l 1) + bj (l) or z (l) = w (l) a (l 1) + b (l) Cost function (quadratic function, MSE) C = 1 m x C x, where C x = a L x y 2 = a j (L) yj 2 10 j=1, L is the number of layers of NN (L = 3 in this case).

As if there is an extra node a (3) y 2 y

Backpropagation (l) (l) C depends (indirectly) on w j,k, bj j, k C l 2, 3 1: 30, [1: 784], if l = 2 1: 10, [1: 30], if l = 3 C (l) (l) w j,k + w j,k C b j (l) b j (l) where If we can calculate C C (l), (l), then we will w j,k b j know how to change each of these parameters (l) (l), bj to make C smaller. w j,k

Backpropagation Loop over m samples, then calculate the average: C (l) = 1 m w j,k C x x, (l) w j,k C b j (l) = 1 m C x x (l) b j Let s define δ j (l) C x z j (l) be the error of neuron j on layer l, or δ (l) = δ 1 (l), δ2 (l), t be the error of layer l. Since z (l+1) = w (l+1) a (l) + b l+1 = w (l+1) σ z (l) + b (l+1), we can get δ (l) = w (l+1) t δ (l+1) σ z (l) where is Hadamard product operator (element-wise multiplication) This means we can pass the error backward δ (L) δ (L 1) δ 2

Backpropagation Then what if we know δ (l) or (l) C x (l) b = δ (l) z j j (l) j b = δ j j C x (l) w = δ j j,k (l) z (l) j w j,k (l) = δ j C x = δ(l) b (l) (l) (l) ak (l 1) C x w (l) = δ(l) a (l 1) t

Backpropagation Backpropagation starts at layer L = 3 where Or δ j (L) = C x a j (L) σ C x a j (L) = 2 a j z j L (L) yj δ (L) = a j (L) yj σ z (L)

Two passes Feed forward a y 2 y Backpropagation δ (2) δ 3 C x w (2), C x b (2) C x w (3), C x b (3)

Write our own NN code Assuming 30 neurons in the hidden layer Architecture: self.num_layers = 3 self.sizes = [784, 30, 10] Parameters: self.weights = [[30x784], [10x30]] self.biases = [30, 10] We need to determine the values of these 30x784 + 10x30 + 30 + 10 = 23860 parameters through a training process. How? In each iteration of the training process Feedforward Backpropagation

Training train.py import mnist_loader import network # load reformatted data. training_data, validation_data, test_data = mnist_loader.load_data_wrapper() # define the NN net = network.network([784, 30, 10]) # train with Stochastic Gradient Descent net.sgd(training_data, 30, 10, 3.0, test_data=test_data) NOTE: training_data is a list of 50000 tuples, each of which is ([784], [10])

Class Network (network.py) init (self, sizes) initialize parameters with random numbers feedforward(self, a) takes a as input and return the output of the NN. evaluate(self, test_data) evaluates the performance of the NN. SGD(self, training_data, epochs, mini_batch_size, eta, test_data=none) update_mini_batch(self, mini_batch, eta) twopasses(self, x, y)

Two helper functions def sigmoid(z): """The sigmoid function.""" return 1.0/(1.0+np.exp(-z)) σ z = 1 1 + e z def sigmoid_prime(z): """Derivative of the sigmoid function.""" return sigmoid(z)*(1-sigmoid(z)) σ z = σ z (1 σ z )

Functions of Network def feedforward(self, x): for b, w in zip(self.biases, self.weights): x = sigmoid(np.dot(w, x)+b) return x def evaluate(self, test_data): test_results = [(np.argmax(self.feedforward(x)), y) \ for (x, y) in test_data] return sum(int(x == y) for (x, y) in test_results) NOTE: self.weights = [[30,784], [10, 30]] self.biases = [30, 10] x=[784] is passed in as argument In loop #1: w t = [30, 784], b = [30] x = σ(w t x + b) x=[30] In loop #2: w t = [10, 30], b = [10] x = σ(w t x + b) x=[10]

Functions of Network def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=none): if test_data: n_test = len(test_data) n = len(training_data) for j in xrange(epochs): random.shuffle(training_data) mini_batches = [ training_data[k:k+mini_batch_size] for k in xrange(0, n, mini_batch_size)] for mini_batch in mini_batches: self.update_mini_batch(mini_batch, eta) if test_data: print "Epoch {0}: {1} / {2}".format( j, self.evaluate(test_data), n_test) else: print "Epoch {0} complete".format(j) NOTE: epochs=30, mini_batch_size = 10 eta (or η)=3.0 learning rate Generate 5000 randomized mini-batches Update weights and biases by learning from each batch If test data is provided, then evaluate the performance of the current NN

Functions of Network def update_mini_batch(self, mini_batch, eta): nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] for x, y in mini_batch: delta_nabla_b, delta_nabla_w = self.twopasses(x, y) C i nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)] m i=1 m w, b = w, b + η 1 m C i i=1 C i nabla_w= C (2) w 1,1 C (2) w 1,784 C (2) w 30,1 C (2) w 30,784, C (3) w 1,1 C (3) w 1,30 C (3) w 10,1 C (3) w 10,30 nabla_b= C b 1 (2),, C (2), b 30 C b 1 (3),, C (3) b 10

Functions of Network def twopasses(self, x, y): nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] # feedforward activation = x activations = [x] # list to store all the activations, layer by layer zs = [] # list to store all the z vectors, layer by layer for b, w in zip(self.biases, self.weights): z = np.dot(w, activation)+b zs.append(z) activation = sigmoid(z) activations.append(activation) # backward pass delta = (activations[-1] - y) * sigmoid_prime(zs[-1]) nabla_b[-1] = delta nabla_w[-1] = np.dot(delta, activations[-2].transpose()) for l in xrange(2, self.num_layers): z = zs[-l] sp = sigmoid_prime(z) delta = np.dot(self.weights[-l+1].transpose(), delta) * sp nabla_b[-l] = delta nabla_w[-l] = np.dot(delta, activations[-l-1].transpose()) return (nabla_b, nabla_w) C i that depends on a single sample (x, y) Feedforward pass: z = w t x + b zs[] 1 σ z = z activations[] 1+e Backpropagation pass: delta= δ (L) = C x a (L) σ z(l) Nabla_b[-1] = C x b (L) = δ(l) Nabla_w[-1]= C x = w (L) δ(l) (L 1) t a delta = δ (l) = w (l+1) t δ (l+1) σ z (l) Calculate nabla_w, nabla_b for layer 2 from δ (l)

Tweak around Learning rate: 0.001, 1.0, 100.0, Size of mini-batches: 10, 50, 100, Number of neurons in the hidden layer: 15, 30, 100, Number of hidden layers: 1, 2, 5,

Use Tensorflow in recognition of handwritten digits Introduction to Tensorflow A warm-up A 2-layer NN (~92% recognition rate) A 3-layer NN (~94% recognition rate) A much better NN (~99% recognition rate)

Introduction to Tensorflow Tensorflow APIs Low-level APIs --- Tensorflow Core that gives you a fine control High-level APIs --- built upon the Core, which are more convenient and efficient to program with Tensor --- a multi-dimensional array which is the central unit of data structure, e.g., [batch, height, width, channel] for image data. Session --- encapsulation of the control and the state of Tensorflow runtime.

Introduction to Tensorflow To perform a specific computational task, one needs to Build a computational graph Run the computational graph Computational graph is a directed graph with edges connecting nodes specifying the data flow. A node could be A constant (no input) A variable A placeholder (reserved for input data)

A warm-up exercise (tf_warmup.py) Problem: There is a linear model y = w x + b, where w, b are parameters of the model Given a set of training data x i, y i i = 1, 0, 2, 1, 3, 2,, we need to find the values of w, b so that the model best fits into the data The criteria of best fit is minimization of a loss function C = 1 wx n i + b y 2 i i

A simple method (tf_mnist_2layers.py) A two-layer fully-connected network Input: 28x28=784 Output: a vector of 10 elements y i = w ij x j + b i j

A simple method (cont.) Training x Softmax cross entropy cost y

Softmax cross entropy Softmax(y 1, y 2,, y n )= i 1 e y i e y 1, e y 2,, e y n turns a vector of outputs into a probability distribution Cross entropy between a prediction probability distribution y and a true distribution y is defined as H y y = y i log y i i Since 0 y i, y i 1, so H y y 0 Tf.nn.softmax_cross_entropy_with_logits y_ =Softmax(y) H y y_

A simple method (cont.) Use trained model x argmax prediction

What s learned

Add a hidden layer (tf_mnist_3layers_*.py)

Fully connected ReLU Dropout Sum + bias Conv ReLU Pooling Conv ReLU Pooling A much better NN Multilayer convolutional network

height Some functions in TF tf.nn.conv2d(input, filter, strides, padding, ) input: 4D tensor [batch, height, width, channels] filter: 4D tensor [f_height, f_width, in_channels, out_channels] strides: 4D tensor padding: SAME or VALID width input filter

Discussions Generalization, overfitting, regularization Difference between deep NN and traditional NN Why deep NN works much better

Generalization, overfitting, regularization Goal of machine learning: Generalization, i.e., learning general rules/patterns from training data) Pitfall: overfitting, i.e., a learned model performs much worse on unseen data Mechanism to prevent overfitting: regularization Dropout layer Monitoring performance with evaluation data during training process

Difference between deep NN and traditional NN Deeper: more hidden layers Combination of unsupervised and supervised learning (autoencoder, a generative NN) Better generalization and regularization mechanisms More advanced layers/neurons: convolutional, pooling,

Why is deep NN so successful? It can approximate arbitrary functions well Features are extracted in a hierarchical way Features extracted in lower layers are more concrete and local Features extracted in higher layers are more abstract and global Deep NN and cheap learning