SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Similar documents
Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Lecture 20: Neural Networks for NLP. Zubin Pahuja

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Deep Learning and Its Applications

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Deep Learning with Tensorflow AlexNet

Machine Learning. MGS Lecture 3: Deep Learning

Deep Learning for Computer Vision II

Convolutional Neural Networks

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Deep neural networks II

Machine Learning 13. week

Introduction to Neural Networks

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

CSC 578 Neural Networks and Deep Learning

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

INTRODUCTION TO DEEP LEARNING

Perceptron: This is convolution!

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

27: Hybrid Graphical Models and Neural Networks

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

A Quick Guide on Training a neural network using Keras.

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Feedforward Networks

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Deep Learning Applications

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Deep Learning. Architecture Design for. Sargur N. Srihari

CS489/698: Intro to ML

Deep Generative Models Variational Autoencoders

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

ImageNet Classification with Deep Convolutional Neural Networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

The Mathematics Behind Neural Networks

Keras: Handwritten Digit Recognition using MNIST Dataset

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning & Neural Networks

Hidden Units. Sargur N. Srihari

Deep Learning With Noise

Deep Learning for Computer Vision

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Deep Neural Networks Optimization

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Convolutional Neural Network for Image Classification

Neural Network Neurons

CS 224d: Assignment #1

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Data Mining. Neural Networks

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

SGD: Stochastic Gradient Descent

CS6220: DATA MINING TECHNIQUES

Neural Nets & Deep Learning

CS 224N: Assignment #1

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

3D model classification using convolutional neural network

Advanced Introduction to Machine Learning, CMU-10715

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

A Fast Learning Algorithm for Deep Belief Nets

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

Deep Learning Cook Book

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Introduction to Neural Networks

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Knowledge Discovery and Data Mining

ECE 5470 Classification, Machine Learning, and Neural Network Review

CS230: Deep Learning Winter Quarter 2018 Stanford University

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Lecture #11: The Perceptron

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Neural Network Weight Selection Using Genetic Algorithms

Structured Prediction using Convolutional Neural Networks

Alternatives to Direct Supervision

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Convolutional Neural Networks and Supervised Learning

Introduction to Deep Learning

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Slides credited from Dr. David Silver & Hung-Yi Lee

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Artificial Intelligence

CSE 559A: Computer Vision

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Transcription:

SEMANTIC COMPUTING Lecture 8: Introduction to Deep Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 7 December 2018

Overview Introduction Deep Learning General Neural Networks Feedforward Neural Networks Dagmar Gromann, 7 December 2018 Semantic Computing 2

Introduction Deep Learning Dagmar Gromann, 7 December 2018 Semantic Computing 3

Machine Learning Refresher Statistical machine learning algorithms rely on human-crafted representations of raw data and hand-designed features (e.g. POS tag, previous word, next word, TF-IDF, character n-gram, etc.) Main goal is to discover a mapping from the representation to the output Optimization of weights in the target function to optimize final prediction Most time has to be invested into finding the optimal features for your task and optimizing the parameter settings of your algorithm Dagmar Gromann, 7 December 2018 Semantic Computing 4

Deep Learning A subfield of machine learning Representation learning attempts to automatically learn good features from raw data Multiple levels of representations (here: 3 layers) Builds complex representations from simpler ones Representation Learning Image source: Goodfellow et al. (2016) Deep Learning, MIT Press. http://www.deeplearningbook.org/ Dagmar Gromann, 7 December 2018 Semantic Computing 5

Brief History Speech recognition breakthrough in 2010 Hinton, G. et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), pp. 82-97. Computer vision breakthrough in 2012 (halving the error rate to 16%) Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097-1105. Today: vast amounts of papers published on a daily basis; some help: http://www.arxiv-sanity.com/ Dagmar Gromann, 7 December 2018 Semantic Computing 6

Application Examples of Deep Learning Neural Machine Translation Text summarization Text generation preserving the style Linguistic analysis (syntactic parsing, semantic parsing, etc.)... Dagmar Gromann, 7 December 2018 Semantic Computing 7

Image Captioning Automatically generating image descriptions: Image source: Fang, H. et al. (2015). From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1473-1482). Semantic Computing Dagmar Gromann, 7 December 2018 8

Lip Sync from Audio Paper: Suwajanakorn, S., Seitz, S. M., & Kemelmacher-Shlizerman, I. (2017). Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4), 95. Dagmar Gromann, 7 December 2018 Semantic Computing 9

Winning Atari Breakout Google s DeepMind learns to play the Atari game Breakout without any prior knowledge based on deep reinforcement learning. Dagmar Gromann, 7 December 2018 Semantic Computing 10

General Neural Network Example Dagmar Gromann, 7 December 2018 Semantic Computing 11

Architecture: 2-Layer Feedforward NN Dagmar Gromann, 7 December 2018 Semantic Computing 12

Why neural? inspired by neuroscience neurons: highly connected and perform computations by combining signals from other neurons deep learning: densely interconnected set of simple units (e.g. sigmoid units depending on activation function computed based on inputs from other units) vector-based representation: X features space: (vector of) continuous or discrete variables Y output space: (vector of) continuous or discrete variables many layers of vector-valued representations, each unit: vector-to-scalar function more function approximation machine than model of the brain Dagmar Gromann, 7 December 2018 Semantic Computing 13

Why network? compose together many different activation functions across different layers in a chain format (directed, acyclic graph) connection of layers: input - hidden (number of layers specified) - output training algorithm decides itself how to use the neurons of each layer between input and output, which is always called hidden layer depth of a network is specified by the number of its layers width of a layer is specified by the predefined number of units (can only be defined for a hidden layer Dagmar Gromann, 7 December 2018 Semantic Computing 14

Neural Network: Main Design Decisions You have to decide on the selection of or number of the following when building a neural network: form of units cost or loss function optimizer architecture design: number of units in each layer number of layers type of connection between layers Dagmar Gromann, 7 December 2018 Semantic Computing 15

Input: Handwritten Digits (MNIST dataset) Input images: Dataset: http://yann.lecun.com/exdb/mnist/ Dagmar Gromann, 7 December 2018 Semantic Computing 16

Form of Hidden Units The type of unit is defined by its activation function; Activation functions decide whether a neuron should be activated ( fire ) or not, that is, whether the received information is relevant or should be ignored (= nonlinear transformation over the input signal) Sigmoid Tanh Rectified Linear Unit (ReLU) Leaky ReLU Softmax... Hidden units: frequently ReLU Dagmar Gromann, 7 December 2018 Semantic Computing 17

Sigmoid Unit sigmoid function: σ(x) = 1 1+e x non-linear activation function (0 to 1 in the S-Shape) g(x) = dσ(x) dx = σ(x)(1 σ(x)) => high between values of -3 and 3: in this range small changes of x would bring about large changes of y (highly desirable property) with non-linear functions we can backpropagate (update weights) and have several layers (no difference with linear function) Dagmar Gromann, 7 December 2018 Semantic Computing 18

Rectified Linear Unit (ReLU) function: f (x) = max(0, x) not all neurons are activated at the same time if input is negative => conversion to zero and neuron inactive advantage: more efficient computation Dagmar Gromann, 7 December 2018 Semantic Computing 19

Overview Output Layer Output Type Output Distribution Output Layer Cost Function Continuous Gaussian Linear Gaussian cross-entropy (MSE) Binary Bernoulli (binary variable) Sigmoid Binary cross-entropy Discrete Multinoulli Softmax Discrete cross-entropy... Dagmar Gromann, 7 December 2018 Semantic Computing 20

Details Softmax Output Units As output unit: squishes a general vector z of arbitrary real values to a vector σ(z) of real values where each entry is in the range of [0,1] and entries add up to 1 Objective: all neurons representing important features for a specific input (e.g. loop and line for a 9 ) should be activated, i.e., fire, when that input is passed through the network Notation: σ(z) j = exp(z j) j exp(z j ) Used for: multiclass classification, e.g. multinomial logistic regression and neural networks Dagmar Gromann, 7 December 2018 Semantic Computing 21

Forward Propagation input to hidden layer 1 : H 1 = W 1 x + b 1 hidden 1 to hidden 2: H 2 = W 2 H 1 + b 2 in other words: weighted sum of activation of previous layer and weights of current layer: w 1 h (1) 1 + w 2 h (1) 2 + w 3 h (1) 3 +... + w n h (1) n where h (1) 1 n H 1 and w 1 n W 2 all bright pixels are set to a positive value, direct neighbors to negative and the rest to zero maybe we only want to consider weighted sums that are greater than 10: such constraints are introduced by means of the bias (b 2 ) w 1 h (1) 1 + w 2 h (1) 2 + w 3 h (1) 3 +... + w n h (1) n 10 weights: what pixel pattern the layer is picking up on bias: how high the weighted sum needs to be before the neuron starts getting active Dagmar Gromann, 7 December 2018 Semantic Computing 22

Cost or loss function similar to linear models: our model defines a distribution p(y x; θ) and we use the principle of maximum likelihood (negative log-likelihood) cost function is frequently cross-entropy between training data and predictions binary : (y log(ŷ) + (1 y) log(1 ŷ)) M multiclass : y i log(ŷ i ) M... number of classes; y... correct label; ŷ... prediction i=1 Dagmar Gromann, 7 December 2018 Semantic Computing 23

Cross-entropy cost function Example: Of this neuron, the output = a = (z) where z = j w j x j + b is the weighted sum of inputs. Then the cross entropy cost function for this neuron is C = y log(a) Dagmar Gromann, 7 December 2018 Semantic Computing 24

Backpropagation Backpropagation Backpropagation, also called backprop, allows information from the cost to flow backward through the network to compute the gradient. NOT a learning method, but a very efficient method to compute gradients that are used in learning methods (optimizers). Just the chain rule of derivatives (z = W h + b and h l = σ(z l ) and J = (h l ŷ) 2 ): J = zl h l J = h l 1 σ (z l )2(h l ŷ) w l w l z l h l n 1 J k k=0 = one w l J average over all training examples: = 1 w l n value of the vector that represent the gradient of the cost function (repeat for all weights and biases) Dagmar Gromann, 7 December 2018 Semantic Computing 25

What is optimization? If we knew the distribution of our data, the training would be an optimization problem. However, we only have a sample of training data and do not know the full distribution of the data in machine learning. Task: minimize the expected loss on the training set Usual algorithm: (Mini-batch) Gradient Descent Dagmar Gromann, 7 December 2018 Semantic Computing 26

Optimization Optimization Minimizing or maximizing some objective function f (x) by altering x; when minimizing, the function is also called cost function, loss function, or error function; value that minimizes or maximizes a function usually denoted with *, e.g. x How to optimize? Many different options. Most important: take the derivative f (x) because it gives you the slope of f (x) at the point x; specifies which small change is needed in the input to obtain a small improvement in the prediction Dagmar Gromann, 7 December 2018 Semantic Computing 27

Gradient Descent Image source: Goodfellow et al. (2016) Deep Learning, MIT Press. http://www.deeplearningbook.org/ Dagmar Gromann, 7 December 2018 Semantic Computing 28

Gradient Functions with multiple inputs require partial derivatives. The gradient of f is a vector containing all the partial derivatives denoted x f (x). 1: Set initial parameters θ 0 1,..., θ0 k 2: ɛ = learning rate (step size) 3: while not converged calculate f do 4: θ 1 = θ 1 ɛ f θ 1. 5:... 6: θ k = θ k ɛ f θ k 7: end while 8: small enough ɛ ensures that f (θ i 1,..., θi k ) f (θi 1 1,..., θi 1 k ) Dagmar Gromann, 7 December 2018 Semantic Computing 29

Learning Rate Also called step size which is more intuitive: 0.06 = 138 steps to reach the minimum; 0.60 = 12 steps needed 1.60 = 1 step; 1.70 = 2 steps Image produced by https://developers.google.com/machine-learning/crash-course/fitter/graph Dagmar Gromann, 7 December 2018 Semantic Computing 30

Overview: Gradient Descent Definition Gradient Descent in NN Method to (step-wise) converge to a local minimum of a cost function by iteratively calculating the gradient and adapting the weights of the neural network accordingly in order to minimize the loss. (Batch) gradient descent: update weights after having passed through the whole dataset (use all examples at once); that is J W Stochastic gradient descent: update weights incrementally J with each training input W (one example at a time) Mini-batch gradient descent: 10-1000 training examples in a batch; loss and gradient are averaged over the batch (equivalent to one step) Dagmar Gromann, 7 December 2018 Semantic Computing 31

SGD needs more steps than GD but each step is cheaper (computation) there is not always a decrease for each step minibatch: common way to uniformly draw examples from the training data usually very small between one and a few hundred examples batch size is held fixed even with an increasing training set size SGD outside of deep learning: e.g. main way to train linear models on large datasets Dagmar Gromann, 7 December 2018 Semantic Computing 32

Architecture Design number of units: dependent on presumed number of characteristics for each feature (e.g. how many different loops in MNIST task?) that layer should represent; deeper networks (more layers) = potentially fewer units number of layers: dependent on the presumed different types of features (e.g. contours, edges, etc. in image classification); start out with simplest solution and then slowly increase interconnections: fully connected means each input is connected to every output; reducing the number of connections reduces the number of parameters that have to be computed (more later) Dagmar Gromann, 7 December 2018 Semantic Computing 33

Feedforward Neural Networks Dagmar Gromann, 7 December 2018 Semantic Computing 34

Feedforward Neural Networks approximate a function f defines a mapping y = f (x, θ) and learns the value of θ that represents the best function approximation feedforward means information flows from function being evaluated from x through intermediate computation to define f to output ŷ no feedback connections from output that are fed back - no cycles, no loops with feedback connections = recurrent neural network special kind of feedforward: Convolutional Neural Networks (CNN) Dagmar Gromann, 7 December 2018 Semantic Computing 35

Review of Lecture 8 How does deep learning differ from statistical machine learning? What are some typical application scenarios of deep learning? Why is it called a network? And why neural? Which optimizer do you know and how does it work? What other design decisions are central in deep learning? How do you choose the correct number of layers? Dagmar Gromann, 7 December 2018 Semantic Computing 36