Intro to Deep Learning for Computer Vision

Similar documents
Deep Neural Networks:

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Computer Vision Lecture 16

Deep Learning for Computer Vision II

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Part-Based Models for Object Class Recognition Part 3

ImageNet Classification with Deep Convolutional Neural Networks

INTRODUCTION TO DEEP LEARNING

Spatial Localization and Detection. Lecture 8-1

Progress in Computer Vision in the Last Decade & Open Problems: People Detection & Human Pose Estimation

Machine Learning 13. week

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2

Learning-based Methods in Vision

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Introduction to Neural Networks

Deep Learning with Tensorflow AlexNet

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Object Recognition II

Image and Video Understanding

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Neural Networks and Deep Learning

Deep (1) Matthieu Cord LIP6 / UPMC Paris 6

Computer Vision Lecture 16

Computer Vision Lecture 16

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Advanced Introduction to Machine Learning, CMU-10715

Yiqi Yan. May 10, 2017

Machine Learning. MGS Lecture 3: Deep Learning

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

YOLO9000: Better, Faster, Stronger

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

CSE 559A: Computer Vision

Introduction to Neural Networks

Structured Prediction using Convolutional Neural Networks

Know your data - many types of networks

All You Want To Know About CNNs. Yukun Zhu

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Study of Residual Networks for Image Recognition

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

DL Tutorial. Xudong Cao

Simplifying ConvNets for Fast Learning

Convolutional Neural Networks

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Weakly Supervised Object Recognition with Convolutional Neural Networks

Multi-Glance Attention Models For Image Classification

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Convolutional Neural Networks. CSC 4510/9010 Andrew Keenan

Deep Learning for Vision: Tricks of the Trade

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Introduction to Deep Learning

Deep neural networks II

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

arxiv: v1 [cs.cv] 6 Jul 2016

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Logical Rhythm - Class 3. August 27, 2018

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Video Object Segmentation using Deep Learning

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

Deep Learning & Neural Networks

Dynamic Routing Between Capsules

Deep Learning & Feature Learning Methods for Vision

Deconvolutions in Convolutional Neural Networks

CS 523: Multimedia Systems

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Deformable Part Models

Character Recognition Using Convolutional Neural Networks

Deep Learning. Volker Tresp Summer 2014

Weighted Convolutional Neural Network. Ensemble.

Single Image Depth Estimation via Deep Learning

High Level Computer Vision

Flow-Based Video Recognition

Global Optimality in Neural Network Training

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Neural Network Neurons

Category Recognition. Jia-Bin Huang Virginia Tech ECE 6554 Advanced Computer Vision

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Como funciona o Deep Learning

Supervised Hashing for Image Retrieval via Image Representation Learning

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Transcription:

High Level Computer Vision Intro to Deep Learning for Computer Vision Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de https://www.mpi-inf.mpg.de/hlcv

Overview Today Recent successes of Deep Learning necessarily incomplete and biased :-) What is Deep Learning? Neural network basics Supervised learning with error backpropagation [reading Bishop chapter 5.1-5.3] Convolutional Neural Networks "2

Computer Vision: Image Classification 1000 classes; 1ms for an image; 92.5% recognition rate http://demo.caffe.berkeleyvision.org "3

Architectures get deeper 20.0 2.6 17.5 2.3 Top 5 error 15.0 12.5 10.0 7.5 5.0 More accurate 2.0 1.6 1.3 1.0 0.7 2.5 0.0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag 2012 2015 0.3 0.0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag ~ 2.6x improvement in 3 years "4

Architectures get deeper "5

Architectures get deeper "6

Computer Vision: Semantic Segmentation Badrinarayanan et al. 2015 SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling "7

Computer Vision: Semantic Segmentation Input Groundtruth Output Yang He; Wei-Chen Chiu; Margret Keuper; Mario Fritz STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 "8

Deep(er)Cut: Joint Subset Partition & Labeling for Multi Person Pose Estimation DeepCut: Joint Subset Partition and Labeling for Multi-Person Pose Estimation L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele, CVPR 16 DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, ECCV 16

Analysis - overall performance Best Methods now: deep learning takes over PCKh total, MPII Single Person Best Method as of ICCV 13 since CVPR 14, dataset has become de-facto standard benchmark large training set facilitated development of deep learning methods Progress in Person Detection and Human Pose Estimation in the last Decade Bernt Schiele "10

Predicting the Future Apratim Bhattacharyya; Mateusz Malinowski; Bernt Schiele; Mario Fritz Long-Term Image Boundary Extrapolation arxiv:1611.08841 [cs.cv] & AAAI 18 "11

Computer Vision & NLP: Image Captioning Ours: a person on skis jumping over a ramp Ours: a skier is making a turn on a course Ours: a cross country skier makes his way through the snow Ours: a skier is headed down a steep slope Baseline: a man riding skis down a snow covered slope Rakshith Shetty; Marcus Rohrbach; Lisa Anne Hendricks; Mario Fritz; Bernt Schiele Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training arxiv:1703.10476 [cs.cv], ICCV 2017 "12

Computer Vision + NLP: Visual Turing Test What is on the refrigerator? What is the colour of the comforter? What is the color of the comforter? How many drawers are there? magnet, paper blue, white 3 Mateusz Malinowski; Mario Fritz A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input Neural Information Processing Systems (NIPS), 2014. Mateusz Malinowski; Marcus Rohrbach; Mario Fritz Ask Your Neurons: A Neural-based Approach to Answering Questions about Images IEEE International Conference on Computer Vision (ICCV), 2015, "13

AI and Planning "14

Robotics "15

Creative Aspects deepart.io "16

Imitating famous painters UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS 19-19

Privacy & Safety Seong Joon Oh; Mario Fritz; Bernt Schiele Adversarial Image Perturbation for Privacy Protection - A Game Theory Perspective arxiv:1703.09471 [cs.cv], ICCV 2017. Tribhuvanesh Orekondy; Bernt Schiele; Mario Fritz Towards a Visual Privacy Advisor: Understanding and Predicting Privacy Risks in Images arxiv:1703.10660 [cs.cv], ICCV 2017. "18

Overview Today Recent successes of Deep Learning necessarily incomplete What is Deep Learning? Neural network basics Supervised learning with error backpropagation [reading Bishop chapter 5.1-5.3] Convolutional Neural Networks "19

Deep Learning Deep Learning is based on Availability of large datasets Massive parallel compute power Advances in machine learning over the years Strong improvements due to Internet (availability of large-scale data) GPUs (availability of parallel compute power) Deep / hierarchical models with end-to-end learning "20

Overview of Deep Learning "21

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Traditional Approach feature extraction (hand crafted) Classification Car Feature extraction often hand crafted and fixed might be too general (not task-specific enough) might be too specific (does not generalize to other tasks) How to achieve best classification performance more complex classifier (e.g. multi-feature, non-linear)? how specialized for the task? "22

Features Alone Can Explain (Almost) 10 Years of Progress [Benenson,Omran,Hosang,Schiele@ECCV workshop 14] Progress in Person Detection and Human Pose Estimation in the last Decade Bernt Schiele "23

Motivation Features are key to recent progress in recognition Multitude of hand-designed features currently in use SIFT, HOG, LBP, MSER, Color-SIFT. Where next? Better classifiers? Or keep building more features? Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2007 Yan & Huang (Winner of PASCAL 2010 classification competition) slide credit: Rob Fergus

Hand-Crafted Features LP-β Multiple Kernel Learning (MKL) Gehler and Nowozin, On Feature Combination for Multiclass Object Classification, ICCV 09 39 different kernels PHOG, SIFT, V1S+, Region Cov. Etc. MKL only gets few % gain over averaging features à Features are doing the work slide credit: Rob Fergus

x<latexit sha1_base64="f2 <latexit sha1_base64="qg0sjzvwmzsqdenusdi4/2+vvru=">aaacknicbvdlsgmxfm34rpvvdekmwiqkpcz0ge1cqlhxwce+ob1kjr3thmyejbmhdp0en/6kmy6u4typmx2jwg8edufck9wcj+rmktocghubw9s7u4m95p7b4dfx6us0iyniukjtgaei5rajnplqv0xxaiuciodwadrdu5nffaihwea/qleitkf6pnmzjupl3drt3jlf0hz9x47nxkvuyhclwu3khxzfzfo5c44xhueb7gzcicosilo4e3jqj1fjbiq9iufvdk9i3ysnlqh1u5nol6crb y<latexit sha1_base64="l2 Deep Learning: Trainable features Features features = g(x, ) Classifier y = f(features,!) Car Parameterized feature extraction Features should be efficient to compute efficient to train (differentiable) "26

x<latexit sha1_base64="f2 <latexit sha1_base64="qg0sjzvwmzsqdenusdi4/2+vvru=">aaacknicbvdlsgmxfm34rpvvdekmwiqkpcz0ge1cqlhxwce+ob1kjr3thmyejbmhdp0en/6kmy6u4typmx2jwg8edufck9wcj+rmktocghubw9s7u4m95p7b4dfx6us0iyniukjtgaei5rajnplqv0xxaiuciodwadrdu5nffaihwea/qleitkf6pnmzjupl3drt3jlf0hz9x47nxkvuyhclwu3khxzfzfo5c44xhueb7gzcicosilo4e3jqj1fjbiq9iufvdk9i3ysnlqh1u5nol6crb y<latexit sha1_base64="l2 Deep Learning: Joint Training of all Parameters Features features = g(x, ) Classifier y = f(features,!) Car Parameterized feature extraction Features should be efficient to compute End-to-End System efficient to train (differentiable) Joint training of feature extraction and classification Feature extraction and classification merge into one pipeline "27

x<latexit sha1_base64="f2 <latexit sha1_base64="qg0sjzvwmzsqdenusdi4/2+vvru=">aaacknicbvdlsgmxfm34rpvvdekmwiqkpcz0ge1cqlhxwce+ob1kjr3thmyejbmhdp0en/6kmy6u4typmx2jwg8edufck9wcj+rmktocghubw9s7u4m95p7b4dfx6us0iyniukjtgaei5rajnplqv0xxaiuciodwadrdu5nffaihwea/qleitkf6pnmzjupl3drt3jlf0hz9x47nxkvuyhclwu3khxzfzfo5c44xhueb7gzcicosilo4e3jqj1fjbiq9iufvdk9i3ysnlqh1u5nol6crb y<latexit sha1_base64="l2 Deep Learning: Joint Training of all Parameters Features features = g(x, ) Classifier y = f(features,!) Car End-to-End System All parts are adaptive No differentiation between feature extraction and classification Non linear transformation from input to desired output "28

x<latexit sha1_base64="f2 <latexit sha1_base64="qg0sjzvwmzsqdenusdi4/2+vvru=">aaacknicbvdlsgmxfm34rpvvdekmwiqkpcz0ge1cqlhxwce+ob1kjr3thmyejbmhdp0en/6kmy6u4typmx2jwg8edufck9wcj+rmktocghubw9s7u4m95p7b4dfx6us0iyniukjtgaei5rajnplqv0xxaiuciodwadrdu5nffaihwea/qleitkf6pnmzjupl3drt3jlf0hz9x47nxkvuyhclwu3khxzfzfo5c44xhueb7gzcicosilo4e3jqj1fjbiq9iufvdk9i3ysnlqh1u5nol6crb y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition Features features = g(x, ) Classifier y = f(features,!) Car End-to-End System How can we build such systems? What is the parameterization (hypothesis)? Composition of simple building blocks can lead to complex systems (e.g. neurons - brain) "29

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition 1 2 3 4 Car How can we build such systems? What is the parameterization (hypothesis)? Composition of simple building blocks can lead to complex systems (e.g. neurons - brain) each block has trainable parameters Each block has trainable parameters i "30

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition 1 2 3 4 Car intermediate representations How can we build such systems? What is the parameterization (hypothesis)? Composition of simple building blocks can lead to complex systems (e.g. neurons - brain) each block has trainable parameters Each block has trainable parameters i "31

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition 1 2 3 4 Car Lee et al. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations "32

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition 1 2 3 4 Car Lee et al. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations "33

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition 1 2 3 4 Car Lee et al. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations "34

Summary of Main Ideas in Deep Learning 1. Learning of feature extraction (across many layers) 2. Efficient and trainable systems by differentiable building blocks 3. Composition of deep architectures via non-linear modules 4. End-to-End training: no differentiation between feature extraction and classification "35

Summary of Main Ideas in Deep Learning 1. Learning of feature extraction (across many layers) 2. Efficient and trainable systems by differentiable building blocks 3. Composition of deep architectures via non-linear modules 4. End-to-End training: no differentiation between feature extraction and classification "36

Overview Today Recent successes of Deep Learning necessarily incomplete What is Deep Learning? Neural network basics Supervised learning with error backpropagation [reading Bishop chapter 5.1-5.3] Convolutional Neural Networks "37

Some Inspiration from the (Human) Brain Motivation from the brain effective universal learning machine? Highly simplified neural model Combination of inputs is linear No spike No temporal model Fixed topology Properties we want: differentiable efficient "38

<latexit sha1_base64="bt0wvb2/c4ych9uvc+nkbx5n5uq=">aaab+hicbvdlssnafl2pr1ofjbp0m1gef6ukiuhkcrpwwce+oa1hmp20qyetmdor1tavcencebd+ijv/xmmbhbyegdiccw/3zgkszpr2ng+rsla+sblv3c7t <latexit sha1_base64="ip0vb2yartlwduubaa6y x<latexit sha1_base64=" <latexit sha1_base64="t9knuvgwh0bdbnc+u4o54setwds=">aaab8hicbvbnswmxej31s9avqkcvwslus9kvqrgughepfeyhtevjptk2nmkusvzyl/4klx4u8erp8ea/mw33ok0pbh7vztazl4g z<latexit sha1_base6 <latexit sha1_base64= [Reading - Chapter 5.1-5.3 @ Bishop 2006] Neural Network Basics Core component of a neural network: processing unit ~ neuron of the human brain a processing unit maps multiple input values onto one output value y<latexit sha1_base64= x 0,...,x D are input units w 0 is an external input called bias the propagation rule maps all input values on the actual input the activation function is applied to obtain: y := f(z) slide adapted from David Stutz "39

<latexit sha1_base64="8zyp5zxohwzp73ybmhg8do61faa="> <latexit sha1_base64="yvg82ih+xozcgc0czea6whqqg98=">aaacihicbvbns8naen34wetx1aoxwsj4kcurqfegqi8ek9gqnkvstpt2cbmjuxm1hp4ul/4vlx4u0zv+grc1b7u+ghi8n7m784jecoou++hmzm7nlyywlsrlk6tr65wnzbaju814i8uy1lcbnvwkxvsoupkrrhmabzjfbtensx95w7ursbralohdia6ucawjakve5vacq9yt4edbfjcd3aocaozakkkbl3mifu7v/h6mpgyn8luydnef9spvt+5oanpek0ivfgj2ku/2dzzgxcgt1jio5ybyzalgwsqflf3u8isyazrghusvjbjp5pmdr7brlt6esbalecbqz4mcrszkuwa7i4pd89cbi/9 <latexit sha1_base64="cgd6hc9y5yxoqh7kwfta8eyimfu=">aaacxxicbvfns8mwge7r9/yaevdg5cuhtjtriqaxrdcdrwwnwjplmqvbzvpb8ladpx/sm178k2a1b52+ehjyfjdkszbkodfx3i17anpmdm5+oba4tlyywl9bv9vjphhvs0qm6j6gmksr8zyklpw+vzxgger3wep5wl974kqljl7bucq7ee3hihsmoqh8oo58au142ydn2iutcjuvvigbejkh2arpz5gfd0/c4uecnv1cdat48yewv26cajwl+gp8n+nmzcqrx284lacc+avccjrinvd+/c3rjsyleixmuq07rpnin6ckbzo8qhmz5illj7tpowbgnok6m5ftflbjmb6eitirrijzn4mcrlqposa4i4odpamnyf+0tobhctcxczohj9n3qwemarmyvw09othdotkamixmxyenqkimzyfutanu5jp/gtudluu03ovdxtlpvcc82slbpelccktoycw5im3cyidfrawrzn3am/asvfjtta0qs0f+jb35br8rsbw=</latexit> i<latexit sha <latexit sha1_base64="z0ragrrbgeijgv <latexit sha1_base64="wtmrxvv1afs/uirgycbzxwafe4i=">aaacd3icbvdlssnafj3uv62vqes3g0urhjkiojtkqrcuk9ghtdfmppn22pkkzezugvihbvwvny4ucevwnx/jtm1cww9cohpovcy9x4sylcqyvo3c3pzc4lj+ubcyura+yw5u1wuyc0xqogshahpiekydulnumdkmbehcy6thdc5hfuowcend4foni+jw1a2otzfswnln/qexwjjsy5i7sb9spzcx8m5na <latexit sha1_base64="mtycvzzyxl <latexit sha1_base64="6eznfk1ek25sns <latexit sha1_base64="sh7gpcp8wxkv12c <latexit sha1_base64="eyrjlu/z1pz/6xn7koj <latexit sha1_base64="hp+6lruf2d3tzaldqaqqvekmxyw=" sha1_base64="c9rktwtrme5pezcp4bi4eewdoc4=" sha1_base64="vwmiait3qcyv3r6mnx5qxr1z2fa=" sha1_base64="hbtzlwl23bixrhv7kydwsfqoagk=" Neural Network Basics: Perceptron [Rosenblatt 1958] A single-layer perceptron consists of D input units and C output units input unit j : x j with j 2 {0,...,D}(x 0 := 1) propagation rule: weighted sum over inputs with weights value and additional bias w i0 DX z i = w ij x j + w i0 activation function: next slide output unit calculates output for class j=1 i : y i with i 2 {1,...,C} x j w 10 w C0 w ij w 1D w CD 0 1 DX y i (x, w) =f(z i )=f@ w ij x j + w i0 A = f j=1 0 1 DX @ w ij x j A j=0 slide adapted from David Stutz "40

Short Intro: Perceptron - Activation Functions slide taken from David Stutz "41

Single Layer Perceptron slide taken from David Stutz "42

Two-Layer Perceptron slide taken from David Stutz "43

l<latexit sh <latexit sha1_base64="biagji/gfszf20aqsz965gp/1ko=">aaab7nicbvbnsw i<latexit sha <latexit sha1_base64="m3py6/pva2qknlnzgofyvjzy5v4=">aaab8nicbvbnswmxej2tx7v+vt16crahxsquchprcnrwwmf+whyt2ttbhmatjckkzenp8ojbea/+gm/+g9n2d9r6yodx3gwz88kem21c99s l<latexit <latexit sha1_base64="5sfe1kkt6rzsezt3bso2rwcrf90=">aaacnxicbvdlssnafj34rpvvdelmsajtwpkiobtfcopcrqwrqlpdzdppp51jwsynekj+yo3/4uoxlhrx6y84rrf8hrg4nhmud+7xy8e12pajnte5nt0zw5orzy8sli1xvlbpdzqoylo0epg69ilmgoesbrweu4wvi9ix7mifho38i2umni/cm0hj1pgkf/kauwjg8ionqcevspqo53gfb9gvliaadnuivwywb+dxmrzzw049z/gnl/fbxsrtb/dlyffxxh/qxqvqn+wx8f/ifkskcjs9yr3bjwgiwqhuek3bjh1djymkobusl7ujzjghq9jjbundi <latexit sha1_base64="drjec8awsn+nsha0nudnp6wvlra=">aaacgnicbvdlsgnbejynrxhfqx69dayhgorderqpeoghj1hma7ixze4myzdznwwm1xcwficxf8wlb0w8irf/xsnjebmlgoqqbrq7/ehwdy7zy6wwlldw19lrmy3nre0de3evomwskcttkasq+uqzwunwbg6c1slfsoalvvv7xzfffwrkcxnewybijyb0qt7mlicrmry7yhm0jegkf3yjvyba1/etu+hdnfyu73sbkcx7s <latexit sha1_base64="lftn0prylzmqf6zriwchflmjqja=">aaab9hicbvbnswmxej2tx7v+vt16crahiprdefsifhrx4kgc/yb2ldk0buot7jpkc2xp7/diqrgv/hhv/hvtdg/a+mdg8d4mm/ocidntxpfbyaysrq1vzddzw9s7u3v5/yo6dmnfai2epftnagvkma <latexit sha1_base64="gobhlay4gk0oz6a8weqv0zxcycq=">aaacfnicbvhlssnafj3ed31vxbozlup91uqeu1ekbly4ulaqndvmjrft4gqszizqcp0mf8yd3+lgartqaw8mhm65r7k3sdht2ne+lxtqemz2bn6htli0vljaxlu/v3eqktrpzgp5gbafnaloaqy5pcyssbrweaielwf+wwtixwjxp7me2hhpctzhlggj+ex3rpp2+lqhz7exqjejpimiluytjzpfhvmeh72xmnzqwdl/slbbhd/be9of8ur1gbvxn1xmzp1tzi9xnjozbp5p3ijuuiebv/zhhtfnixcacqjuy3us3c6j1ixy6je8vefc6dppqstqqsjq7xy4vj7emuqio7e0t2g8vh9n5crskosce2kg7klxbybo8lqp7ttbornjqkhquanoyrgo8eawogqsqoaziyrkzmbfteckodpcrgsw4i5/+t+5p6m5ts29pa00lop1zknnti2qyevnqigu0a1qioq+rc1r3zqwkb1rh9nho1dbkni20b/y9w84571q</latexit> Multi-Layer Perceptron (MLP) add additional L > 0 hidden layers in between the input and output layers m (l) m (0) = D m (L+1) = C layer has units with and then the output of unit in layer is given by: y (l) i = f 0 @ m (l 1) X j=0 w (l) ij y(l 1) j 1 A a multi-layer perceptron models a function: y(,w):r D! R C y(x, w) = 0 1 y 1 (x, w) B @. y C (x, w) C A = 0 B @ 1 1 (x, w) C. A C (x, w) y (L+1) y (L+1) slide adapted from David Stutz "44

x<latexit sha1_base64="f2yzimwbr/dgjz y<latexit sha1_base64="l29wxoub9d Training: Overview 1 2 3 4 Car? Setting generate output y for input x (forward pass) "45

x<latexit sha1_base64="f2yzimwbr/dgjz y<latexit sha1_base64="l29wxoub9d Training: Overview 1 2 3 4 Error? Setting generate output y for input x (forward pass) when there is an error, propagate error backwards to update weights (error back propagation)

Network Training slide taken from David Stutz "47

<latexit sha1_base64="lbhufbfzeo+wstvcxlb/mie6vhq=">aaacu3icbvfdsxtbfj1du7wxarspfbk0cbhasbsefrfeexwqkzheymzldjkbjm7oljn3w8oy/1gepvhhfpghnst74ecvdjx77jncmtnrjovbz3t03jxah9w19y/1ju+bw9unnd2+sxpnei+lmtvxetvccsv7kfdyq0xzmkssd6lbs/l88itri1j1ibomjxi6usiwjkklwsbneev3phxdypiklnsxx17/gpnqvwysksv8suiurxxlrsybmwgkj7efs1c07kl1daz1o6dvihig0giyxf3rdosnptf2fgxvgv+bjqmqgzyegnhk8oqrzjiam/s9decf1siy5gu9ya3pklulez60ungem1gxyksepcumiu61pqphwb50fdqxzpzevplqnjq3szn5v9kwx/hovaiv5cgvwy6kcwmywjxggavngcqzbzrpye8kbeptjmi/ow5d8n8++t3od9q+1/z/hjrptqs41skx8pw0ie8oyqm5if3si4zckyfy1yhoh+fzdd3auuo6leczevxu5j+fj6+q</latexit> <latexit sha1_base64="7kynmkdp7c6elhiyibpi+qyvzcu=">aaacoxicbzblswmxemez9vxra9wjl2arwtcyk4jecsvs8cqv7apadcmm2ty0m12srlus/vpe/bbebc8efphqfzb9hgx1ipcf38wwmb8xmsqvzb0yqaxlldw19hpmy3nre8fc3avlmbay1hdiqth0kcsmcljtvdhsjarbgcdiw+uxx/xgprgshvxwdspibkjlqu8xuhq5zrwsg+rhebzlhlgjl9qju2tycfmunszzaulhsrkqjekitlnyhuox5h5cfgwhedfmwgvrevcvsgcic2zrdc3ndifecuc4wgxj2bktsdkjeopirkazdixjhhafdulls44cip1kcvkihmnsgx4o9omktujviqqfug4dt3cgspxkym0m/6u1yuvfoanluawix9nffsygcuhyrtihgmdfhloglkj+k8q9jbbw2uymnsfeppmvqj8wbktg35xls5czo9lgabychldbosibk1afnydbi3gf7+ddedleje/ja9qammyz+2aujo8f8y6qew==</latexit> Network Training - Error Measures Sum-of-squared error function: E(w) = NX E n (w) = 1 2 n=1 weight vector NX n=1 CX (y i (x n,w) t ni ) 2 i=1 i-th component of function y i-th entry of target t Cross-entropy error function NX NX CX E(w) = E n (w) = t ni log y i (x n,w) n=1 n=1 i=1 slide adapted from David Stutz "48

<latexit sha1_base64="mselxol8+csqoz9dzmua0i0z984=">aaab7hicbvbns8naej34wetx1aoxxslus0 <latexit sha1_base64="lbhufbfzeo+wstvcxlb/mie6vhq=">aaacu3icbvfdsxtbfj1du7wxarspfbk0cbhasbsefrfeexwqkzheymzldjkbjm7oljn3w8oy/1gepvhhfpghnst74ecvdjx77jncmtnrjovbz3t03jxah9w19y/1ju+bw9unnd2+sxpnei+lmtvxetvccsv7kfdyq0xzmkssd6lbs/l88itri1j1ibomjxi6usiwjkklwsbneev3phxdypiklnsxx17/gpnqvwysksv8suiurxxlrsybmwgkj7efs1c07kl1daz1o6dvihig0giyxf3rdosnptf2fgxvgv+bjqmqgzyegnhk8oqrzjiam/s9decf1siy5gu9ya3pklulez60ungem1gxyksepcumiu61pqphwb50fdqxzpzevplqnjq3szn5v9kwx/hovaiv5cgvwy6kcwmywjxggavngcqzbzrpye8kbeptjmi/ow5d8n8++t3od9q+1/z/hjrptqs41skx8pw0ie8oyqm5if3si4zckyfy1yhoh+fzdd3auuo6leczevxu5j+fj6+q</latexit> w<latexit sha1_base64="gme7o Learning by Optimization Highly non-convex error landscape E(w) E(w) = NX E n (w) = 1 2 n=1 NX n=1 CX (y i (x n,w) t ni ) 2 i=1 "49

<latexit sha1_base64="mselxol8+csqoz9dzmua0i0z984=">aaab7hicbvbns8naej34wetx1aoxxslus0 w<latexit sha1_base64="gme7o Learning by Optimization E(w) "50

<latexit sha1_base64="mselxol8+csqoz9dzmua0i0z984=">aaab7hicbvbns8naej34wetx1aoxxslus0 w<latexit sha1_base64="gme7o Learning by Optimization Gradient Descent E(w) "51

<latexit sha1_base64="y1sga0hc1+vyvzud2bsehdxorhg=">aaab63icb <latexit sha1_base64="dhgzajuhkm6fz1n54zeef5aaur4=">aaab63icbvbns8n <latexit sha1_base64="dp <latexit sha1_base64="bf8kdamfcwhxbnbq2/som65km0i=">aaab63icbvbnswmxej2tx7v+vt16crahxsq <latexit sha1_base64="w7zcg9cyd8hi5p6sduv7uqvj5bg=">aaaci3icbvbds8mwfe3n9/ya+uhlcagtcbqikiiiiocjgvuatyzbln2csvusvbml/8ux/4ovpijiiw/+f9otoe4pbe7oufcm9/gxz0rb9odvmpqemz2bxygvli2vrfbw1psqsishdrlxslz9ujszkdy005y2y0lb+jy2/nvz3g/dualyfn7oyuw9af2qbyyanlk3cnzf0buoh0+wir7ew24fhadsbhji6sygnqool2r3o9n3na/nupwqxbdhwh+ju5aqkndvrby5vygkgoaacfcq49ix9tj8jue0k7ujojgqw+jtjqehckq8dlrjhren0snbjm0jnr6ppztseeonhw8qbeibmvry8t+vk+jgyetzgceahmt8ujbwrcocb4 Network Training Parameter Optimization by Gradient Descent Goal: Minimize E(w) cannot be solved analytically has many local minima Iterative optimization procedure w[0] let be the starting vector for the weights (initialization) w[t] is the weight vector in the t-th iteration gradient descent step: in iteration [t+1] update the weight vector: w[t + 1] = w[t] @E(w) @w[t] - with learning rate or step size slide adapted from David Stutz "52

Network Training - Approaches slide taken from David Stutz "53

<latexit sha1_base64="w7zcg9cyd8hi5p6sduv7uqvj5bg=">aaaci3icbvbds8mwfe3n9/ya+uhlcagtcbqikiiiiocjgvuatyzbln2csvusvbml/8ux/4ovpijiiw/+f9otoe4pbe7oufcm9/gxz0rb9odvmpqemz2bxygvli2vrfbw1psqsishdrlxslz9ujszkdy005y2y0lb+jy2/nvz3g/dualyfn7oyuw9af2qbyyanlk3cnzf0buoh0+wir7ew24fhadsbhji6sygnqool2r3o9n3na/nupwqxbdhwh+ju5aqkndvrby5vygkgoaacfcq49ix9tj8jue0k7ujojgqw+jtjqehckq8dlrjhren0snbjm0jnr6ppztseeonhw8qbeibmvry8t+vk+jgyetzgceahmt8ujbwrcoc <latexit sha1_base64="1ddb+j8hnjc23veomsa2sh4ck2g=">aaab9hicb <latexit sha1_base64="ou7jv5ivmelq3e/mqatd1v9md0k=">aaacshicbzblswmxfiuz9v1fvzdugkwoigvgbn0oggguk1gvounwj83uajizkoxshvl5bly68ze4cagio9mhqnulgcn37r1jtprypo3rpjulsfgjyanpmfls3pzcymvp+vwnmsk0srkeqmsinovm0qzhhtplvfeqeacx0e1rz7+4o0qzrj6zbkodar3jykbawbrwwvuw2fscmgc3xvve4xsf3sew/sj4c/sdeakwhysguz+cmgw4pg4lrt1vfn9kzlqik1w37vyl/xxeuftrsbph5clvjyqtvbrcqeuw56ymyhvrcadf2c80tyhcqoe2rjqgqa7yfhafxrekjene2smn7tofezkirbsisp0czlue9xrwp6+vmxgvyjlmm0mlgvwuzxybbpdsxw2mkdg8awuqxexbmbkgm5wx2zdtcn7ol/+k8+2659a9053q4cewjmm0itzqdxlofx2ie9ratut <latexit sha1_base64="+kfq56y9+8ypkzn1gifexnnghyu=">aaacgxicbvdlssnafj3uv62vqks3g0vonyurqvdsemflbfuajibjdnkonuzczmrsqn7djb/ixouilnxl Backpropagation = Parameter Optimization by Gradient Descent General gradient descent step: w[t + 1] = w[t] @E(w) @w[t] More specifically for weight w (l) ij w[t + 1] (l) ij = w[t](l) ij @E n (w) @w[t] (l) ij How to calculate @E n (w) @w[t] (l) ij slide taken from David Stutz "54

<latexit sha1_base64="n6j4eskoxv3u/jrxk3h7gcpjxce=">aaacunicbvlps8mwgm3m7/lr6tflcajbwdgkobdf8ojrwelgnsxn0i0zauvyvamlf6mgxvxdvhhqs62tuflb4phe+74kl/eiwtvy1nuhode/sli0vfjaxvvf2cxvbd/qmfaunwgoqtx0igacb6wbharrroor6ql25z1cdps7r6y0d4mbsclwlqqbcj9taozyyzxx+x1afbumn2ifo4l5umxpv6sjelchtunr0bf00/6pld2ncma7sgtzhp/clpezvc1x+2nlpmitv6y6nsw8c+wcvfbev2751emenjysacqi1i3biqcdegwccpavnfizinah0mutawmimw6nw0gyvg+ydvzdzvyaemhodqreap1izzglgz6e1gbkf1orbv+knfigioefdlsrhwsmir7kiztcmqoimybqxc1zme0rrsiyvyizeozpk8+c28o6bdxt66pk+vkexzlarxuoimx0jm7rjbpcdut <latexit sha1_base64="+iikcj4/nkyieeacrqeqqj47njs=">aaact3iclvfbs8mwge3rbc7b1edfgkpyxkyray+gdetwcyk74dzlmqyulk1rkjpm6v/0wtf/jdmn6aaihwqo5zvfsxi+n2juksv6mmyl5zxvtcx6dmnza3snt7txl2esmknhkiwi6sjjgowkpqhipbkjggkxkybbuxr2gy9esbryozwiscdaj5z6fcolksf31vyfwkk7qkjrxoc1wwv9yjoj+k5cn9khpmckaqov4f/6v4doxxpswec393nt7xgm0ft5bwsnl7dk1qjgirania8mvxvy720vxhfaumimsdmyruh1kqe3zitntmnjior76jg0noqoilktjhjp4zfmpoihqh+u4ij9opggqmpb4gplgfrxzveg5e+9vqz8005cerqrwvh4ij9muivwuetouugwygmnebzuvxxiltjrkb3qra7bnv/yiqgfl2yrzn+w85xlsrwzcaaoqqhy4aruwa2oghrartm4n7dhmwemy/pmdyw1jcnmpvhw5vmn/fxzka==</latexit> <latexit sha1_base64="dwe0w8sr6q7ahp4hwop6rlczcju=">aaackhicbvdlssnafj34rpuvdelmsajtwpkiobu14mzlbfuanobjdnjoo3kwm1fiyoe48vfcicjsrv/ipa2orqcgzj3nxu7c44smcmkyy21hcwl5zbwwvlzf2nza1nd2mykiocynhlcatx0kckm+augqgwmhncdpya <latexit sha1_base64="xe6esrqbwrgdcp+boog7ktkbnr0=">aaacm3icbzdlssnafiyn9vbrrerszwar2ouleue3skeecvxbxqcpytkdtnnojmfmyikh7+tgf3ehiatf3poottqc2vrdwm93zuhm+d2qualm88xilcwula9kv3nr6xubw/ntnbomiofjdqcsee0xsciojzvffspnubdku4w03mffwm/ceyfpwg/vkcrth3u59shgsimnf217auhydpfqfdf46fdisjt8gket035yfxdzkunggbq7hcnk0ambi6efukorldj5glk2x4lzxpqaapiq Backpropagation = Parameter Optimization by Gradient Descent Remember: y (l) i = f z (l) i = f 0 @ m (l 1) X j=0 w (l) ij y(l 1) j 1 A Using chain rule: dropping [t] for readability @E n (w) @w (l) ij = @E n(w) @z (l) i @z (l) i @w (l) ij := (l) i @z (l) i @w (l) ij Second term: @z (l) i @w (l) ij (l 1) = y j therefore: @E n (w) @w (l) ij = (l) (l 1) i y j "55

<latexit sha1_base64="vmmz4pzxtbmsbj/g+yunxvf98us=">aaacjnichvfds8mwfe3r19z8qpros3aig8jorvsu6uaeh3yy4d5gnsxn0i2ypivjlvn6c/xdvvlvzozgc1o8edicc25ucq4fmyqvbx8a5tlyyupabj1f2njc2rz2dpsysgqmdryxslr9jamjndquvyy0y0fq6dps8p9vrnrrhqhji/6ohjhphqjpauaxupryrhe3r5hchn1ks/dhtjmdf1ui3uagnloxeooibm89xnotz1piberpypvf+3dwdvnz9ln1jxgevbqr9rjginamoagmvfesd7cx4sqkxgggpow4dqy66ehuzeiwdxnjyosfuz90noqojlkbjupm4kfmejcihd5cwte725giumph6gtninrazmsj8jetk6jgvjtshiekcpw9kegyvbec7qb2qcbysaegcauq3wrxaomwln5gxofgzh95etspk45dcr5oirwrsrw5sa8oqak44azuwb2ogwbarsfwj <latexit sha1_base64="risbdnzg0uyft2tsppjxrggqt1u=">aaac0nicdvjda9swfjw9r877srfhvyiffydtwqqf9ivqwgz7gkydpi3ejpevorgvzspj61khjbhx/bq97sfsx0xodeub7ylgco6538oqzpsool+ef+v2nbv3tu4hdx4+evyks/30vjw1jhresl7k8wwrypmgi800p+evpljiod3llo4a/9knkhurxylevdqp8eywnbgshzv2fse5xmtefzaayq7fpik87nm/xcjlexo+f4v61skdibyunzdmnnfjlkbwdcyesaql1lahspojlsbczxn+tsvrenmdb6bojwbnhgszue5nbi1kop1pd8fwi9v6pjgodpakjsjppxv1o6xbtyba0awthaedn/g0jhvbhsyckzvguaut0/reolvbxctayxkbz3tsomafvylznstcl46zwryu7gknl+x6hmgfuosic8oc67m66wvif/ngtc73e8nevwsqykpqxnoos9jcf06zpetzhqoysoz6hwso3xg0+wwbwwk6ofimob30udrhh3e7b4ftorbac/achacbpxaa3ofjmale++dv3hfvq3/ix/nf/o8rqe+1mc/anfn//aemsdui</latexit> <latexit sha1_base64="fjnbw2rmlltusd5me5n3amt7ipu=">aaacqhicbzc7swnbemb34ju+opy2i0fiemkdcnoogo2fhyixqu5y7g3mkiv7d3bnhhjkt7pxt7cztrfqxnbktqwyhwmlh983w8z+glqkjbb9abwmpmdm5+yxiotlyyurpbx1k51kikodjzjr1whtieumdrqo4tpvwkjaqiponqzzxg0olzl4evspebhrxciunkgx/flddrxjuzsyhyjj2vdfk6+c7tjvwedbvz1w6senw26qratulrbiztkmrhkdllb9ut <latexit sha1_base64="lhermcp9fj2t27uwysvwaoutovg=">aaaconicbvbnaxsxfnsmazu6x05zzexufna0nsttac6g0fdiiyceajvgxs9awwsla7wl9dbpivz35zjfkvspvftqehlnd4js+jamhramm294epmuuhgigl/e2pp1p8+eb7xovhz1+s3b5ua7gcllzxif5tlxxwk1xarf+yba8uncc5olkg+t+d7ch55wbusufkbv8cijuyvswsg4kw4efy+vf9rgprymmjjlatutmq5nmcvw9eg93soh5cn4uirf2pohh0m79n/g6pnlfcyqwyuwas2mm2ipuzhutojosar+tmiktnakh3hzipzkrmy4aiapmsmsfbbzqkewyetgwbpeudanuz5yvngmm8gut6/xb6dmcjpr9xtgpxo/ywlmtjulbjkjmdmpvyx4p29uqrotwagkerhid4vsumli8ajhpbgam5cvi5rp4f6k2yy6csg13xalkicnpyadbocehxl0pbx7bvxhbtpg75g Backpropagation = Parameter Optimization by Gradient Descent First term for output layer: with (L+1) i sum-of-squared error function: @E n (w) @y (L+1) i = := @E n(w) @z (L+1) i E n (w) = 1 2 apple P 2 1 C @ 2 i=1 y (L+1) i (x n,w) t ni = y i (x n,w) t ni @y (L+1) i = @E n(w) @y (L+1) i CX i=1 @y (L+1) i @z (L+1) i y (L+1) i (x n,w) t ni 2 = y (L+1) i (x n,w) t ni @y (L+1) i @z (L+1) i = f 0 z (L+1) i "56

<latexit sha1_base64="uxnukpp993462mzygv6tprmfp94=">aaacghicbvfda9swfjw9ry77srfhvyifgco21bqdjkjgyqz22mhsfuleymp1olasjxs9khn/jv2vve3hdcandt3axracnxmpujo3k5v0gme/gvdw7tt37+3c7z14+ojxk/7u02nxvfbarbsqskczd6ckgqlkvhbawua6u3csnx9s9znvyj0szfdclzdtfglklgvht6x9h8kcfpjuzutidrt6mkzjbrmok5jblfzrt6mjlobnfff929zqmc3nswmlbpooydg6emli5xkfq5q4sqf12zg181q32ivwwi88j5vtnxzjng2jtd+ir/gm6e3aojagxr2l/z/johcvboncceemlc5xvrczcwvnl6kclfyc8yvmptrcg5vvmwab+tizc5ox1h+ddmp+7ai5dm6tm9+poa7cda0l/6dnk8zfz2ppygrbimuh8kprlgi7dbqqfgsqtqdcwolnpwlfff7od9bzibdrx74jjt+owdxix94ndj90ceyq5+qfiqgj++sqfczhzeie+r0mgtfbmzamo3avzjetydb5npf/kjz4ayhsv9c=</latexit> <latexit sha1_base64="elmf0crwayi7eutjtaasaymfpic=">aaab+hicbvbns8naej3ur1 <latexit sha1_base64="a/qtmvgdfbdzulf4fwrq88i3k7e=">aaacahicbzdlssnafizpvnz6i7pw4wawcbwhj <latexit sha1_base64="lrq/hp7kvqnkbjti4j8nxe5hmia=">aaab+nicbvbns8naen34wetxqkcvi0woccurqu9s8olbqwx7a Backpropagation = Parameter Optimization by Gradient Descent First term for all other layers: (l) i := @E n(w) @z (l) i m X (l+1) = f 0 z (l) i this can be evaluated recursively starting with the output layer L+1: j=1 w (l+1) ji (l+1) j (L+1) i w (l+1) 1i w (l+1) m (l+1) i "57

<latexit sha1_base64="xe6esrqbwrgdcp+boog7ktkbnr0=">aaacm3icbzdlssnafiyn9vbrrerszwar2ouleue3skeecvxbxqcpytkdtnnojmfmyikh7+tgf3ehiatf3poottqc2vrdwm93zuhm+d2qualm88xilcwula9kv3nr6xubw/ntnbomiofjdqcsee0xsciojzvffspnubdku4w03mffwm/ceyfpwg/vkcrth3u59shgsimnf217auhydpfqfdf46fdisjt8gket035yfxdzkunggbq7hcnk0ambi6efukorldj5glk2x4lzxpqaapiq6u <latexit sha1_base64="ou7jv5ivmelq3e/mqatd1v9md0k=">aaacshicbzblswmxfiuz9v1fvzdugkwoigvgbn0oggguk1gvounwj83uajizkoxshvl5bly68ze4cagio9mhqnulgcn37r1jtprypo3rpjulsfgjyanpmfls3pzcymvp+vwnmsk0srkeqmsinovm0qzhhtplvfeqeacx0e1rz7+4o0qzrj6zbkodar3jykbawbrwwvuw2fscmgc3xvve4xsf3sew/sj4c/sdeakwhysguz+cmgw4pg4lrt1vfn9kzlqik1w37vyl/xxeuftrsbph5clvjyqtvbrcqeuw56ymyhvrcadf2c80tyhcqoe2rjqgqa7yfhafxrekjene2smn7tofezkirbsisp0czlue9xrwp6+vmxgvyjlmm0mlgvwuzxybbpdsxw2mkdg8awuqxexbmbkgm5wx2zdtcn7ol/+k8+2659a9053q4cewjmm0itzqdxlofx2ie9ratutq Backpropagation = Parameter Optimization by Gradient Descent determine the required derivatives using: use derivatives to update all weights in iteration [t+1] : w[t + 1] (l) ij @E n (w) @w (l) ij = w[t](l) ij = (l) (l 1) i y j @E n (w) @w[t] (l) ij "58

Gradient & Training Only two things need to be implemented to define new layer: forward pass error back propagation Watch overfitting loss/energy test validation train epochs "59

Gradient & Training Only two things need to be implemented to define new layer: forward pass error back propagation Watch overfitting loss/energy test validation train epochs "60

Gradient & Training Only two things need to be implemented to define new layer: forward pass error back propagation Watch overfitting loss/energy test validation train epochs "61

Overview Today Recent successes of Deep Learning necessarily incomplete What is Deep Learning? Neural network basics Supervised learning with error backpropagation [reading Bishop chapter 5.1-5.3] Convolutional Neural Networks "62

Overview of Deep Learning "63

Multistage Hubel&Wiesel Architecture Slide: Y.LeCun [Hubel & Wiesel 1962] simple cells detect local features complex cells pool the outputs of simple cells within a retinotopic neighborhood. Cognitron / Neocognitron [Fukushima 1971-1982] Also HMAX [Poggio 2002-2006] Convolutional Networks [LeCun 1988-present]

Convolutional Neural Networks LeCun et al. 1989 Neural network with specialized connectivity structure

Convnet Successes Handwritten text/digits MNIST (0.17% error [Ciresan et al. 2011]) Arabic & Chinese [Ciresan et al. 2012] Simpler recognition benchmarks CIFAR-10 (9.3% error [Wan et al. 2013]) Traffic sign recognition 0.56% error vs 1.16% for humans [Ciresan et al. 2011] But (until recently) less good at more complex datasets E.g. Caltech-101/256 (few training examples)

Characteristics of Convnets Feed-forward: Convolve input Non-linearity (rectified linear) Pooling (local max) / (=subsampling) Supervised Train convolutional filters by back-propagating classification error Feature maps Pooling Non-linearity Convolution (Learned) Input Image [LeCun et al. 1989]

68

69

70

71

72

73

74

Characteristics of Convnets Feed-forward: Convolve input Non-linearity (rectified linear) Pooling (local max) / (=subsampling) Supervised Train convolutional filters by back-propagating classification error Feature maps Pooling Non-linearity Convolution (Learned) Input Image [LeCun et al. 1989]

Application to ImageNet [Deng et al. CVPR 2009] used 1.2 million images of 1000 labeled classes (overall ~ 14 million labeled images, 20k classes) Images gathered from Internet Human labels via Amazon Turk [NIPS 2012]

Krizhevsky et al. [NIPS2012] Same model as LeCun 98 but: - Bigger model (8 layers) - More data (10 6 vs 10 3 images) - GPU implementation (50x speedup over CPU) - Better regularization (DropOut) 7 hidden layers, 650,000 neurons, 60,000,000 parameters Trained on 2 GPUs for a week

ImageNet Classification 2012 Krizhevsky et al. - 16.4% error (top-5) Next best (non-convnet) 26.2% error 30 22.5 Top-5 error rate % 15 7.5 0 SuperVision ISI Oxford INRIA Amsterdam

Commercial Deployment Google & Baidu, Spring 2013 for personal image search