Intro to Deep Learning for Computer Vision

High Level Computer Vision Intro to Deep Learning for Computer Vision Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de https://www.mpi-inf.mpg.de/hlcv

Overview Today Recent successes of Deep Learning necessarily incomplete and biased :-) What is Deep Learning? Neural network basics Supervised learning with error backpropagation [reading Bishop chapter 5.1-5.3] Convolutional Neural Networks "2

Computer Vision: Image Classification 1000 classes; 1ms for an image; 92.5% recognition rate http://demo.caffe.berkeleyvision.org "3

Architectures get deeper 20.0 2.6 17.5 2.3 Top 5 error 15.0 12.5 10.0 7.5 5.0 More accurate 2.0 1.6 1.3 1.0 0.7 2.5 0.0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag 2012 2015 0.3 0.0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag ~ 2.6x improvement in 3 years "4

Architectures get deeper "5

Architectures get deeper "6

Computer Vision: Semantic Segmentation Badrinarayanan et al. 2015 SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling "7

Computer Vision: Semantic Segmentation Input Groundtruth Output Yang He; Wei-Chen Chiu; Margret Keuper; Mario Fritz STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 "8

Deep(er)Cut: Joint Subset Partition & Labeling for Multi Person Pose Estimation DeepCut: Joint Subset Partition and Labeling for Multi-Person Pose Estimation L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele, CVPR 16 DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, ECCV 16

Analysis - overall performance Best Methods now: deep learning takes over PCKh total, MPII Single Person Best Method as of ICCV 13 since CVPR 14, dataset has become de-facto standard benchmark large training set facilitated development of deep learning methods Progress in Person Detection and Human Pose Estimation in the last Decade Bernt Schiele "10

Predicting the Future Apratim Bhattacharyya; Mateusz Malinowski; Bernt Schiele; Mario Fritz Long-Term Image Boundary Extrapolation arxiv:1611.08841 [cs.cv] & AAAI 18 "11

Computer Vision & NLP: Image Captioning Ours: a person on skis jumping over a ramp Ours: a skier is making a turn on a course Ours: a cross country skier makes his way through the snow Ours: a skier is headed down a steep slope Baseline: a man riding skis down a snow covered slope Rakshith Shetty; Marcus Rohrbach; Lisa Anne Hendricks; Mario Fritz; Bernt Schiele Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training arxiv:1703.10476 [cs.cv], ICCV 2017 "12

Computer Vision + NLP: Visual Turing Test What is on the refrigerator? What is the colour of the comforter? What is the color of the comforter? How many drawers are there? magnet, paper blue, white 3 Mateusz Malinowski; Mario Fritz A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input Neural Information Processing Systems (NIPS), 2014. Mateusz Malinowski; Marcus Rohrbach; Mario Fritz Ask Your Neurons: A Neural-based Approach to Answering Questions about Images IEEE International Conference on Computer Vision (ICCV), 2015, "13

AI and Planning "14

Robotics "15

Creative Aspects deepart.io "16

Imitating famous painters UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS 19-19

Privacy & Safety Seong Joon Oh; Mario Fritz; Bernt Schiele Adversarial Image Perturbation for Privacy Protection - A Game Theory Perspective arxiv:1703.09471 [cs.cv], ICCV 2017. Tribhuvanesh Orekondy; Bernt Schiele; Mario Fritz Towards a Visual Privacy Advisor: Understanding and Predicting Privacy Risks in Images arxiv:1703.10660 [cs.cv], ICCV 2017. "18

Overview Today Recent successes of Deep Learning necessarily incomplete What is Deep Learning? Neural network basics Supervised learning with error backpropagation [reading Bishop chapter 5.1-5.3] Convolutional Neural Networks "19

Deep Learning Deep Learning is based on Availability of large datasets Massive parallel compute power Advances in machine learning over the years Strong improvements due to Internet (availability of large-scale data) GPUs (availability of parallel compute power) Deep / hierarchical models with end-to-end learning "20

Overview of Deep Learning "21

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Traditional Approach feature extraction (hand crafted) Classification Car Feature extraction often hand crafted and fixed might be too general (not task-specific enough) might be too specific (does not generalize to other tasks) How to achieve best classification performance more complex classifier (e.g. multi-feature, non-linear)? how specialized for the task? "22

Features Alone Can Explain (Almost) 10 Years of Progress [Benenson,Omran,Hosang,Schiele@ECCV workshop 14] Progress in Person Detection and Human Pose Estimation in the last Decade Bernt Schiele "23

Motivation Features are key to recent progress in recognition Multitude of hand-designed features currently in use SIFT, HOG, LBP, MSER, Color-SIFT. Where next? Better classifiers? Or keep building more features? Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2007 Yan & Huang (Winner of PASCAL 2010 classification competition) slide credit: Rob Fergus

Hand-Crafted Features LP-β Multiple Kernel Learning (MKL) Gehler and Nowozin, On Feature Combination for Multiclass Object Classification, ICCV 09 39 different kernels PHOG, SIFT, V1S+, Region Cov. Etc. MKL only gets few % gain over averaging features à Features are doing the work slide credit: Rob Fergus

x<latexit sha1_base64="f2 <latexit sha1_base64="qg0sjzvwmzsqdenusdi4/2+vvru=">aaacknicbvdlsgmxfm34rpvvdekmwiqkpcz0ge1cqlhxwce+ob1kjr3thmyejbmhdp0en/6kmy6u4typmx2jwg8edufck9wcj+rmktocghubw9s7u4m95p7b4dfx6us0iyniukjtgaei5rajnplqv0xxaiuciodwadrdu5nffaihwea/qleitkf6pnmzjupl3drt3jlf0hz9x47nxkvuyhclwu3khxzfzfo5c44xhueb7gzcicosilo4e3jqj1fjbiq9iufvdk9i3ysnlqh1u5nol6crb y<latexit sha1_base64="l2 Deep Learning: Trainable features Features features = g(x, ) Classifier y = f(features,!) Car Parameterized feature extraction Features should be efficient to compute efficient to train (differentiable) "26

x<latexit sha1_base64="f2 <latexit sha1_base64="qg0sjzvwmzsqdenusdi4/2+vvru=">aaacknicbvdlsgmxfm34rpvvdekmwiqkpcz0ge1cqlhxwce+ob1kjr3thmyejbmhdp0en/6kmy6u4typmx2jwg8edufck9wcj+rmktocghubw9s7u4m95p7b4dfx6us0iyniukjtgaei5rajnplqv0xxaiuciodwadrdu5nffaihwea/qleitkf6pnmzjupl3drt3jlf0hz9x47nxkvuyhclwu3khxzfzfo5c44xhueb7gzcicosilo4e3jqj1fjbiq9iufvdk9i3ysnlqh1u5nol6crb y<latexit sha1_base64="l2 Deep Learning: Joint Training of all Parameters Features features = g(x, ) Classifier y = f(features,!) Car Parameterized feature extraction Features should be efficient to compute End-to-End System efficient to train (differentiable) Joint training of feature extraction and classification Feature extraction and classification merge into one pipeline "27

x<latexit sha1_base64="f2 <latexit sha1_base64="qg0sjzvwmzsqdenusdi4/2+vvru=">aaacknicbvdlsgmxfm34rpvvdekmwiqkpcz0ge1cqlhxwce+ob1kjr3thmyejbmhdp0en/6kmy6u4typmx2jwg8edufck9wcj+rmktocghubw9s7u4m95p7b4dfx6us0iyniukjtgaei5rajnplqv0xxaiuciodwadrdu5nffaihwea/qleitkf6pnmzjupl3drt3jlf0hz9x47nxkvuyhclwu3khxzfzfo5c44xhueb7gzcicosilo4e3jqj1fjbiq9iufvdk9i3ysnlqh1u5nol6crb y<latexit sha1_base64="l2 Deep Learning: Joint Training of all Parameters Features features = g(x, ) Classifier y = f(features,!) Car End-to-End System All parts are adaptive No differentiation between feature extraction and classification Non linear transformation from input to desired output "28

x<latexit sha1_base64="f2 <latexit sha1_base64="qg0sjzvwmzsqdenusdi4/2+vvru=">aaacknicbvdlsgmxfm34rpvvdekmwiqkpcz0ge1cqlhxwce+ob1kjr3thmyejbmhdp0en/6kmy6u4typmx2jwg8edufck9wcj+rmktocghubw9s7u4m95p7b4dfx6us0iyniukjtgaei5rajnplqv0xxaiuciodwadrdu5nffaihwea/qleitkf6pnmzjupl3drt3jlf0hz9x47nxkvuyhclwu3khxzfzfo5c44xhueb7gzcicosilo4e3jqj1fjbiq9iufvdk9i3ysnlqh1u5nol6crb y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition Features features = g(x, ) Classifier y = f(features,!) Car End-to-End System How can we build such systems? What is the parameterization (hypothesis)? Composition of simple building blocks can lead to complex systems (e.g. neurons - brain) "29

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition 1 2 3 4 Car How can we build such systems? What is the parameterization (hypothesis)? Composition of simple building blocks can lead to complex systems (e.g. neurons - brain) each block has trainable parameters Each block has trainable parameters i "30

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition 1 2 3 4 Car intermediate representations How can we build such systems? What is the parameterization (hypothesis)? Composition of simple building blocks can lead to complex systems (e.g. neurons - brain) each block has trainable parameters Each block has trainable parameters i "31

x<latexit sha1_base64="f2 y<latexit sha1_base64="l2 Deep Learning: Complex Functions by Composition 1 2 3 4 Car Lee et al. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations "32

Summary of Main Ideas in Deep Learning 1. Learning of feature extraction (across many layers) 2. Efficient and trainable systems by differentiable building blocks 3. Composition of deep architectures via non-linear modules 4. End-to-End training: no differentiation between feature extraction and classification "35

Some Inspiration from the (Human) Brain Motivation from the brain effective universal learning machine? Highly simplified neural model Combination of inputs is linear No spike No temporal model Fixed topology Properties we want: differentiable efficient "38

<latexit sha1_base64="bt0wvb2/c4ych9uvc+nkbx5n5uq=">aaab+hicbvdlssnafl2pr1ofjbp0m1gef6ukiuhkcrpwwce+oa1hmp20qyetmdor1tavcencebd+ijv/xmmbhbyegdiccw/3zgkszpr2ng+rsla+sblv3c7t <latexit sha1_base64="ip0vb2yartlwduubaa6y x<latexit sha1_base64=" <latexit sha1_base64="t9knuvgwh0bdbnc+u4o54setwds=">aaab8hicbvbnswmxej31s9avqkcvwslus9kvqrgughepfeyhtevjptk2nmkusvzyl/4klx4u8erp8ea/mw33ok0pbh7vztazl4g z<latexit sha1_base6 <latexit sha1_base64= [Reading - Chapter 5.1-5.3 @ Bishop 2006] Neural Network Basics Core component of a neural network: processing unit ~ neuron of the human brain a processing unit maps multiple input values onto one output value y<latexit sha1_base64= x 0,...,x D are input units w 0 is an external input called bias the propagation rule maps all input values on the actual input the activation function is applied to obtain: y := f(z) slide adapted from David Stutz "39

<latexit sha1_base64="8zyp5zxohwzp73ybmhg8do61faa="> <latexit sha1_base64="yvg82ih+xozcgc0czea6whqqg98=">aaacihicbvbns8naen34wetx1aoxwsj4kcurqfegqi8ek9gqnkvstpt2cbmjuxm1hp4ul/4vlx4u0zv+grc1b7u+ghi8n7m784jecoou++hmzm7nlyywlsrlk6tr65wnzbaju814i8uy1lcbnvwkxvsoupkrrhmabzjfbtensx95w7ursbralohdia6ucawjakve5vacq9yt4edbfjcd3aocaozakkkbl3mifu7v/h6mpgyn8luydnef9spvt+5oanpek0ivfgj2ku/2dzzgxcgt1jio5ybyzalgwsqflf3u8isyazrghusvjbjp5pmdr7brlt6esbalecbqz4mcrszkuwa7i4pd89cbi/9 <latexit sha1_base64="cgd6hc9y5yxoqh7kwfta8eyimfu=">aaacxxicbvfns8mwge7r9/yaevdg5cuhtjtriqaxrdcdrwwnwjplmqvbzvpb8ladpx/sm178k2a1b52+ehjyfjdkszbkodfx3i17anpmdm5+oba4tlyywl9bv9vjphhvs0qm6j6gmksr8zyklpw+vzxgger3wep5wl974kqljl7bucq7ee3hihsmoqh8oo58au142ydn2iutcjuvvigbejkh2arpz5gfd0/c4uecnv1cdat48yewv26cajwl+gp8n+nmzcqrx284lacc+avccjrinvd+/c3rjsyleixmuq07rpnin6ckbzo8qhmz5illj7tpowbgnok6m5ftflbjmb6eitirrijzn4mcrlqposa4i4odpamnyf+0tobhctcxczohj9n3qwemarmyvw09othdotkamixmxyenqkimzyfutanu5jp/gtudluu03ovdxtlpvcc82slbpelccktoycw5im3cyidfrawrzn3am/asvfjtta0qs0f+jb35br8rsbw=</latexit> i<latexit sha <latexit sha1_base64="z0ragrrbgeijgv <latexit sha1_base64="wtmrxvv1afs/uirgycbzxwafe4i=">aaacd3icbvdlssnafj3uv62vqes3g0urhjkiojtkqrcuk9ghtdfmppn22pkkzezugvihbvwvny4ucevwnx/jtm1cww9cohpovcy9x4sylcqyvo3c3pzc4lj+ubcyura+yw5u1wuyc0xqogshahpiekydulnumdkmbehcy6thdc5hfuowcend4foni+jw1a2otzfswnln/qexwjjsy5i7sb9spzcx8m5na <latexit sha1_base64="mtycvzzyxl <latexit sha1_base64="6eznfk1ek25sns <latexit sha1_base64="sh7gpcp8wxkv12c <latexit sha1_base64="eyrjlu/z1pz/6xn7koj <latexit sha1_base64="hp+6lruf2d3tzaldqaqqvekmxyw=" sha1_base64="c9rktwtrme5pezcp4bi4eewdoc4=" sha1_base64="vwmiait3qcyv3r6mnx5qxr1z2fa=" sha1_base64="hbtzlwl23bixrhv7kydwsfqoagk=" Neural Network Basics: Perceptron [Rosenblatt 1958] A single-layer perceptron consists of D input units and C output units input unit j : x j with j 2 {0,...,D}(x 0 := 1) propagation rule: weighted sum over inputs with weights value and additional bias w i0 DX z i = w ij x j + w i0 activation function: next slide output unit calculates output for class j=1 i : y i with i 2 {1,...,C} x j w 10 w C0 w ij w 1D w CD 0 1 DX y i (x, w) =f(z i )=f@ w ij x j + w i0 A = f j=1 0 1 DX @ w ij x j A j=0 slide adapted from David Stutz "40

Short Intro: Perceptron - Activation Functions slide taken from David Stutz "41

Single Layer Perceptron slide taken from David Stutz "42

Two-Layer Perceptron slide taken from David Stutz "43

l<latexit sh <latexit sha1_base64="biagji/gfszf20aqsz965gp/1ko=">aaab7nicbvbnsw i<latexit sha <latexit sha1_base64="m3py6/pva2qknlnzgofyvjzy5v4=">aaab8nicbvbnswmxej2tx7v+vt16crahxsquchprcnrwwmf+whyt2ttbhmatjckkzenp8ojbea/+gm/+g9n2d9r6yodx3gwz88kem21c99s l<latexit <latexit sha1_base64="5sfe1kkt6rzsezt3bso2rwcrf90=">aaacnxicbvdlssnafj34rpvvdelmsajtwpkiobtfcopcrqwrqlpdzdppp51jwsynekj+yo3/4uoxlhrx6y84rrf8hrg4nhmud+7xy8e12pajnte5nt0zw5orzy8sli1xvlbpdzqoylo0epg69ilmgoesbrweu4wvi9ix7mifho38i2umni/cm0hj1pgkf/kauwjg8ionqcevspqo53gfb9gvliaadnuivwywb+dxmrzzw049z/gnl/fbxsrtb/dlyffxxh/qxqvqn+wx8f/ifkskcjs9yr3bjwgiwqhuek3bjh1djymkobusl7ujzjghq9jjbundi <latexit sha1_base64="drjec8awsn+nsha0nudnp6wvlra=">aaacgnicbvdlsgnbejynrxhfqx69dayhgorderqpeoghj1hma7ixze4myzdznwwm1xcwficxf8wlb0w8irf/xsnjebmlgoqqbrq7/ehwdy7zy6wwlldw19lrmy3nre0de3evomwskcttkasq+uqzwunwbg6c1slfsoalvvv7xzfffwrkcxnewybijyb0qt7mlicrmry7yhm0jegkf3yjvyba1/etu+hdnfyu73sbkcx7s <latexit sha1_base64="lftn0prylzmqf6zriwchflmjqja=">aaab9hicbvbnswmxej2tx7v+vt16crahiprdefsifhrx4kgc/yb2ldk0buot7jpkc2xp7/diqrgv/hhv/hvtdg/a+mdg8d4mm/ocidntxpfbyaysrq1vzddzw9s7u3v5/yo6dmnfai2epftnagvkma <latexit sha1_base64="gobhlay4gk0oz6a8weqv0zxcycq=">aaacfnicbvhlssnafj3ed31vxbozlup91uqeu1ekbly4ulaqndvmjrft4gqszizqcp0mf8yd3+lgartqaw8mhm65r7k3sdht2ne+lxtqemz2bn6htli0vljaxlu/v3eqktrpzgp5gbafnaloaqy5pcyssbrweaielwf+wwtixwjxp7me2hhpctzhlggj+ex3rpp2+lqhz7exqjejpimiluytjzpfhvmeh72xmnzqwdl/slbbhd/be9of8ur1gbvxn1xmzp1tzi9xnjozbp5p3ijuuiebv/zhhtfnixcacqjuy3us3c6j1ixy6je8vefc6dppqstqqsjq7xy4vj7emuqio7e0t2g8vh9n5crskosce2kg7klxbybo8lqp7ttbornjqkhquanoyrgo8eawogqsqoaziyrkzmbfteckodpcrgsw4i5/+t+5p6m5ts29pa00lop1zknnti2qyevnqigu0a1qioq+rc1r3zqwkb1rh9nho1dbkni20b/y9w84571q</latexit> Multi-Layer Perceptron (MLP) add additional L > 0 hidden layers in between the input and output layers m (l) m (0) = D m (L+1) = C layer has units with and then the output of unit in layer is given by: y (l) i = f 0 @ m (l 1) X j=0 w (l) ij y(l 1) j 1 A a multi-layer perceptron models a function: y(,w):r D! R C y(x, w) = 0 1 y 1 (x, w) B @. y C (x, w) C A = 0 B @ 1 1 (x, w) C. A C (x, w) y (L+1) y (L+1) slide adapted from David Stutz "44

x<latexit sha1_base64="f2yzimwbr/dgjz y<latexit sha1_base64="l29wxoub9d Training: Overview 1 2 3 4 Car? Setting generate output y for input x (forward pass) "45

x<latexit sha1_base64="f2yzimwbr/dgjz y<latexit sha1_base64="l29wxoub9d Training: Overview 1 2 3 4 Error? Setting generate output y for input x (forward pass) when there is an error, propagate error backwards to update weights (error back propagation)

Network Training slide taken from David Stutz "47

<latexit sha1_base64="lbhufbfzeo+wstvcxlb/mie6vhq=">aaacu3icbvfdsxtbfj1du7wxarspfbk0cbhasbsefrfeexwqkzheymzldjkbjm7oljn3w8oy/1gepvhhfpghnst74ecvdjx77jncmtnrjovbz3t03jxah9w19y/1ju+bw9unnd2+sxpnei+lmtvxetvccsv7kfdyq0xzmkssd6lbs/l88itri1j1ibomjxi6usiwjkklwsbneev3phxdypiklnsxx17/gpnqvwysksv8suiurxxlrsybmwgkj7efs1c07kl1daz1o6dvihig0giyxf3rdosnptf2fgxvgv+bjqmqgzyegnhk8oqrzjiam/s9decf1siy5gu9ya3pklulez60ungem1gxyksepcumiu61pqphwb50fdqxzpzevplqnjq3szn5v9kwx/hovaiv5cgvwy6kcwmywjxggavngcqzbzrpye8kbeptjmi/ow5d8n8++t3od9q+1/z/hjrptqs41skx8pw0ie8oyqm5if3si4zckyfy1yhoh+fzdd3auuo6leczevxu5j+fj6+q</latexit> <latexit sha1_base64="7kynmkdp7c6elhiyibpi+qyvzcu=">aaacoxicbzblswmxemez9vxra9wjl2arwtcyk4jecsvs8cqv7apadcmm2ty0m12srlus/vpe/bbebc8efphqfzb9hgx1ipcf38wwmb8xmsqvzb0yqaxlldw19hpmy3nre8fc3avlmbay1hdiqth0kcsmcljtvdhsjarbgcdiw+uxx/xgprgshvxwdspibkjlqu8xuhq5zrwsg+rhebzlhlgjl9qju2tycfmunszzaulhsrkqjekitlnyhuox5h5cfgwhedfmwgvrevcvsgcic2zrdc3ndifecuc4wgxj2bktsdkjeopirkazdixjhhafdulls44cip1kcvkihmnsgx4o9omktujviqqfug4dt3cgspxkym0m/6u1yuvfoanluawix9nffsygcuhyrtihgmdfhloglkj+k8q9jbbw2uymnsfeppmvqj8wbktg35xls5czo9lgabychldbosibk1afnydbi3gf7+ddedleje/ja9qammyz+2aujo8f8y6qew==</latexit> Network Training - Error Measures Sum-of-squared error function: E(w) = NX E n (w) = 1 2 n=1 weight vector NX n=1 CX (y i (x n,w) t ni ) 2 i=1 i-th component of function y i-th entry of target t Cross-entropy error function NX NX CX E(w) = E n (w) = t ni log y i (x n,w) n=1 n=1 i=1 slide adapted from David Stutz "48

<latexit sha1_base64="mselxol8+csqoz9dzmua0i0z984=">aaab7hicbvbns8naej34wetx1aoxxslus0 <latexit sha1_base64="lbhufbfzeo+wstvcxlb/mie6vhq=">aaacu3icbvfdsxtbfj1du7wxarspfbk0cbhasbsefrfeexwqkzheymzldjkbjm7oljn3w8oy/1gepvhhfpghnst74ecvdjx77jncmtnrjovbz3t03jxah9w19y/1ju+bw9unnd2+sxpnei+lmtvxetvccsv7kfdyq0xzmkssd6lbs/l88itri1j1ibomjxi6usiwjkklwsbneev3phxdypiklnsxx17/gpnqvwysksv8suiurxxlrsybmwgkj7efs1c07kl1daz1o6dvihig0giyxf3rdosnptf2fgxvgv+bjqmqgzyegnhk8oqrzjiam/s9decf1siy5gu9ya3pklulez60ungem1gxyksepcumiu61pqphwb50fdqxzpzevplqnjq3szn5v9kwx/hovaiv5cgvwy6kcwmywjxggavngcqzbzrpye8kbeptjmi/ow5d8n8++t3od9q+1/z/hjrptqs41skx8pw0ie8oyqm5if3si4zckyfy1yhoh+fzdd3auuo6leczevxu5j+fj6+q</latexit> w<latexit sha1_base64="gme7o Learning by Optimization Highly non-convex error landscape E(w) E(w) = NX E n (w) = 1 2 n=1 NX n=1 CX (y i (x n,w) t ni ) 2 i=1 "49

<latexit sha1_base64="mselxol8+csqoz9dzmua0i0z984=">aaab7hicbvbns8naej34wetx1aoxxslus0 w<latexit sha1_base64="gme7o Learning by Optimization E(w) "50

<latexit sha1_base64="mselxol8+csqoz9dzmua0i0z984=">aaab7hicbvbns8naej34wetx1aoxxslus0 w<latexit sha1_base64="gme7o Learning by Optimization Gradient Descent E(w) "51

<latexit sha1_base64="y1sga0hc1+vyvzud2bsehdxorhg=">aaab63icb <latexit sha1_base64="dhgzajuhkm6fz1n54zeef5aaur4=">aaab63icbvbns8n <latexit sha1_base64="dp <latexit sha1_base64="bf8kdamfcwhxbnbq2/som65km0i=">aaab63icbvbnswmxej2tx7v+vt16crahxsq <latexit sha1_base64="w7zcg9cyd8hi5p6sduv7uqvj5bg=">aaaci3icbvbds8mwfe3n9/ya+uhlcagtcbqikiiiiocjgvuatyzbln2csvusvbml/8ux/4ovpijiiw/+f9otoe4pbe7oufcm9/gxz0rb9odvmpqemz2bxygvli2vrfbw1psqsishdrlxslz9ujszkdy005y2y0lb+jy2/nvz3g/dualyfn7oyuw9af2qbyyanlk3cnzf0buoh0+wir7ew24fhadsbhji6sygnqool2r3o9n3na/nupwqxbdhwh+ju5aqkndvrby5vygkgoaacfcq49ix9tj8jue0k7ujojgqw+jtjqehckq8dlrjhren0snbjm0jnr6ppztseeonhw8qbeibmvry8t+vk+jgyetzgceahmt8ujbwrcocb4 Network Training Parameter Optimization by Gradient Descent Goal: Minimize E(w) cannot be solved analytically has many local minima Iterative optimization procedure w[0] let be the starting vector for the weights (initialization) w[t] is the weight vector in the t-th iteration gradient descent step: in iteration [t+1] update the weight vector: w[t + 1] = w[t] @E(w) @w[t] - with learning rate or step size slide adapted from David Stutz "52

Network Training - Approaches slide taken from David Stutz "53

<latexit sha1_base64="w7zcg9cyd8hi5p6sduv7uqvj5bg=">aaaci3icbvbds8mwfe3n9/ya+uhlcagtcbqikiiiiocjgvuatyzbln2csvusvbml/8ux/4ovpijiiw/+f9otoe4pbe7oufcm9/gxz0rb9odvmpqemz2bxygvli2vrfbw1psqsishdrlxslz9ujszkdy005y2y0lb+jy2/nvz3g/dualyfn7oyuw9af2qbyyanlk3cnzf0buoh0+wir7ew24fhadsbhji6sygnqool2r3o9n3na/nupwqxbdhwh+ju5aqkndvrby5vygkgoaacfcq49ix9tj8jue0k7ujojgqw+jtjqehckq8dlrjhren0snbjm0jnr6ppztseeonhw8qbeibmvry8t+vk+jgyetzgceahmt8ujbwrcoc <latexit sha1_base64="1ddb+j8hnjc23veomsa2sh4ck2g=">aaab9hicb <latexit sha1_base64="ou7jv5ivmelq3e/mqatd1v9md0k=">aaacshicbzblswmxfiuz9v1fvzdugkwoigvgbn0oggguk1gvounwj83uajizkoxshvl5bly68ze4cagio9mhqnulgcn37r1jtprypo3rpjulsfgjyanpmfls3pzcymvp+vwnmsk0srkeqmsinovm0qzhhtplvfeqeacx0e1rz7+4o0qzrj6zbkodar3jykbawbrwwvuw2fscmgc3xvve4xsf3sew/sj4c/sdeakwhysguz+cmgw4pg4lrt1vfn9kzlqik1w37vyl/xxeuftrsbph5clvjyqtvbrcqeuw56ymyhvrcadf2c80tyhcqoe2rjqgqa7yfhafxrekjene2smn7tofezkirbsisp0czlue9xrwp6+vmxgvyjlmm0mlgvwuzxybbpdsxw2mkdg8awuqxexbmbkgm5wx2zdtcn7ol/+k8+2659a9053q4cewjmm0itzqdxlofx2ie9ratut <latexit sha1_base64="+kfq56y9+8ypkzn1gifexnnghyu=">aaacgxicbvdlssnafj3uv62vqks3g0vonyurqvdsemflbfuajibjdnkonuzczmrsqn7djb/ixouilnxl Backpropagation = Parameter Optimization by Gradient Descent General gradient descent step: w[t + 1] = w[t] @E(w) @w[t] More specifically for weight w (l) ij w[t + 1] (l) ij = w[t](l) ij @E n (w) @w[t] (l) ij How to calculate @E n (w) @w[t] (l) ij slide taken from David Stutz "54

<latexit sha1_base64="n6j4eskoxv3u/jrxk3h7gcpjxce=">aaacunicbvlps8mwgm3m7/lr6tflcajbwdgkobdf8ojrwelgnsxn0i0zauvyvamlf6mgxvxdvhhqs62tuflb4phe+74kl/eiwtvy1nuhode/sli0vfjaxvvf2cxvbd/qmfaunwgoqtx0igacb6wbharrroor6ql25z1cdps7r6y0d4mbsclwlqqbcj9taozyyzxx+x1afbumn2ifo4l5umxpv6sjelchtunr0bf00/6pld2ncma7sgtzhp/clpezvc1x+2nlpmitv6y6nsw8c+wcvfbev2751emenjysacqi1i3biqcdegwccpavnfizinah0mutawmimw6nw0gyvg+ydvzdzvyaemhodqreap1izzglgz6e1gbkf1orbv+knfigioefdlsrhwsmir7kiztcmqoimybqxc1zme0rrsiyvyizeozpk8+c28o6bdxt66pk+vkexzlarxuoimx0jm7rjbpcdut <latexit sha1_base64="+iikcj4/nkyieeacrqeqqj47njs=">aaact3iclvfbs8mwge3rbc7b1edfgkpyxkyray+gdetwcyk74dzlmqyulk1rkjpm6v/0wtf/jdmn6aaihwqo5zvfsxi+n2juksv6mmyl5zxvtcx6dmnza3snt7txl2esmknhkiwi6sjjgowkpqhipbkjggkxkybbuxr2gy9esbryozwiscdaj5z6fcolksf31vyfwkk7qkjrxoc1wwv9yjoj+k5cn9khpmckaqov4f/6v4doxxpswec393nt7xgm0ft5bwsnl7dk1qjgirania8mvxvy720vxhfaumimsdmyruh1kqe3zitntmnjior76jg0noqoilktjhjp4zfmpoihqh+u4ij9opggqmpb4gplgfrxzveg5e+9vqz8005cerqrwvh4ij9muivwuetouugwygmnebzuvxxiltjrkb3qra7bnv/yiqgfl2yrzn+w85xlsrwzcaaoqqhy4aruwa2oghrartm4n7dhmwemy/pmdyw1jcnmpvhw5vmn/fxzka==</latexit> <latexit sha1_base64="dwe0w8sr6q7ahp4hwop6rlczcju=">aaackhicbvdlssnafj34rpuvdelmsajtwpkiobu14mzlbfuanobjdnjoo3kwm1fiyoe48vfcicjsrv/ipa2orqcgzj3nxu7c44smcmkyy21hcwl5zbwwvlzf2nza1nd2mykiocynhlcatx0kckm+augqgwmhncdpya <latexit sha1_base64="xe6esrqbwrgdcp+boog7ktkbnr0=">aaacm3icbzdlssnafiyn9vbrrerszwar2ouleue3skeecvxbxqcpytkdtnnojmfmyikh7+tgf3ehiatf3poottqc2vrdwm93zuhm+d2qualm88xilcwula9kv3nr6xubw/ntnbomiofjdqcsee0xsciojzvffspnubdku4w03mffwm/ceyfpwg/vkcrth3u59shgsimnf217auhydpfqfdf46fdisjt8gket035yfxdzkunggbq7hcnk0ambi6efukorldj5glk2x4lzxpqaapiq Backpropagation = Parameter Optimization by Gradient Descent Remember: y (l) i = f z (l) i = f 0 @ m (l 1) X j=0 w (l) ij y(l 1) j 1 A Using chain rule: dropping [t] for readability @E n (w) @w (l) ij = @E n(w) @z (l) i @z (l) i @w (l) ij := (l) i @z (l) i @w (l) ij Second term: @z (l) i @w (l) ij (l 1) = y j therefore: @E n (w) @w (l) ij = (l) (l 1) i y j "55

<latexit sha1_base64="vmmz4pzxtbmsbj/g+yunxvf98us=">aaacjnichvfds8mwfe3r19z8qpros3aig8jorvsu6uaeh3yy4d5gnsxn0i2ypivjlvn6c/xdvvlvzozgc1o8edicc25ucq4fmyqvbx8a5tlyyupabj1f2njc2rz2dpsysgqmdryxslr9jamjndquvyy0y0fq6dps8p9vrnrrhqhji/6ohjhphqjpauaxupryrhe3r5hchn1ks/dhtjmdf1ui3uagnloxeooibm89xnotz1piberpypvf+3dwdvnz9ln1jxgevbqr9rjginamoagmvfesd7cx4sqkxgggpow4dqy66ehuzeiwdxnjyosfuz90noqojlkbjupm4kfmejcihd5cwte725giumph6gtninrazmsj8jetk6jgvjtshiekcpw9kegyvbec7qb2qcbysaegcauq3wrxaomwln5gxofgzh95etspk45dcr5oirwrsrw5sa8oqak44azuwb2ogwbarsfwj <latexit sha1_base64="risbdnzg0uyft2tsppjxrggqt1u=">aaac0nicdvjda9swfjw9r877srfhvyiffydtwqqf9ivqwgz7gkydpi3ejpevorgvzspj61khjbhx/bq97sfsx0xodeub7ylgco6538oqzpsool+ef+v2nbv3tu4hdx4+evyks/30vjw1jhresl7k8wwrypmgi800p+evpljiod3llo4a/9knkhurxylevdqp8eywnbgshzv2fse5xmtefzaayq7fpik87nm/xcjlexo+f4v61skdibyunzdmnnfjlkbwdcyesaql1lahspojlsbczxn+tsvrenmdb6bojwbnhgszue5nbi1kop1pd8fwi9v6pjgodpakjsjppxv1o6xbtyba0awthaedn/g0jhvbhsyckzvguaut0/reolvbxctayxkbz3tsomafvylznstcl46zwryu7gknl+x6hmgfuosic8oc67m66wvif/ngtc73e8nevwsqykpqxnoos9jcf06zpetzhqoysoz6hwso3xg0+wwbwwk6ofimob30udrhh3e7b4ftorbac/achacbpxaa3ofjmale++dv3hfvq3/ix/nf/o8rqe+1mc/anfn//aemsdui</latexit> <latexit sha1_base64="fjnbw2rmlltusd5me5n3amt7ipu=">aaacqhicbzc7swnbemb34ju+opy2i0fiemkdcnoogo2fhyixqu5y7g3mkiv7d3bnhhjkt7pxt7cztrfqxnbktqwyhwmlh983w8z+glqkjbb9abwmpmdm5+yxiotlyyurpbx1k51kikodjzjr1whtieumdrqo4tpvwkjaqiponqzzxg0olzl4evspebhrxciunkgx/flddrxjuzsyhyjj2vdfk6+c7tjvwedbvz1w6senw26qratulrbiztkmrhkdllb9ut <latexit sha1_base64="lhermcp9fj2t27uwysvwaoutovg=">aaaconicbvbnaxsxfnsmazu6x05zzexufna0nsttac6g0fdiiyceajvgxs9awwsla7wl9dbpivz35zjfkvspvftqehlnd4js+jamhramm294epmuuhgigl/e2pp1p8+eb7xovhz1+s3b5ua7gcllzxif5tlxxwk1xarf+yba8uncc5olkg+t+d7ch55wbusufkbv8cijuyvswsg4kw4efy+vf9rgprymmjjlatutmq5nmcvw9eg93soh5cn4uirf2pohh0m79n/g6pnlfcyqwyuwas2mm2ipuzhutojosar+tmiktnakh3hzipzkrmy4aiapmsmsfbbzqkewyetgwbpeudanuz5yvngmm8gut6/xb6dmcjpr9xtgpxo/ywlmtjulbjkjmdmpvyx4p29uqrotwagkerhid4vsumli8ajhpbgam5cvi5rp4f6k2yy6csg13xalkicnpyadbocehxl0pbx7bvxhbtpg75g Backpropagation = Parameter Optimization by Gradient Descent First term for output layer: with (L+1) i sum-of-squared error function: @E n (w) @y (L+1) i = := @E n(w) @z (L+1) i E n (w) = 1 2 apple P 2 1 C @ 2 i=1 y (L+1) i (x n,w) t ni = y i (x n,w) t ni @y (L+1) i = @E n(w) @y (L+1) i CX i=1 @y (L+1) i @z (L+1) i y (L+1) i (x n,w) t ni 2 = y (L+1) i (x n,w) t ni @y (L+1) i @z (L+1) i = f 0 z (L+1) i "56

<latexit sha1_base64="uxnukpp993462mzygv6tprmfp94=">aaacghicbvfda9swfjw9ry77srfhvyifgco21bqdjkjgyqz22mhsfuleymp1olasjxs9khn/jv2vve3hdcandt3axracnxmpujo3k5v0gme/gvdw7tt37+3c7z14+ojxk/7u02nxvfbarbsqskczd6ckgqlkvhbawua6u3csnx9s9znvyj0szfdclzdtfglklgvht6x9h8kcfpjuzutidrt6mkzjbrmok5jblfzrt6mjlobnfff929zqmc3nswmlbpooydg6emli5xkfq5q4sqf12zg181q32ivwwi88j5vtnxzjng2jtd+ir/gm6e3aojagxr2l/z/johcvboncceemlc5xvrczcwvnl6kclfyc8yvmptrcg5vvmwab+tizc5ox1h+ddmp+7ai5dm6tm9+poa7cda0l/6dnk8zfz2ppygrbimuh8kprlgi7dbqqfgsqtqdcwolnpwlfff7od9bzibdrx74jjt+owdxix94ndj90ceyq5+qfiqgj++sqfczhzeie+r0mgtfbmzamo3avzjetydb5npf/kjz4ayhsv9c=</latexit> <latexit sha1_base64="elmf0crwayi7eutjtaasaymfpic=">aaab+hicbvbns8naej3ur1 <latexit sha1_base64="a/qtmvgdfbdzulf4fwrq88i3k7e=">aaacahicbzdlssnafizpvnz6i7pw4wawcbwhj <latexit sha1_base64="lrq/hp7kvqnkbjti4j8nxe5hmia=">aaab+nicbvbns8naen34wetxqkcvi0woccurqu9s8olbqwx7a Backpropagation = Parameter Optimization by Gradient Descent First term for all other layers: (l) i := @E n(w) @z (l) i m X (l+1) = f 0 z (l) i this can be evaluated recursively starting with the output layer L+1: j=1 w (l+1) ji (l+1) j (L+1) i w (l+1) 1i w (l+1) m (l+1) i "57

<latexit sha1_base64="xe6esrqbwrgdcp+boog7ktkbnr0=">aaacm3icbzdlssnafiyn9vbrrerszwar2ouleue3skeecvxbxqcpytkdtnnojmfmyikh7+tgf3ehiatf3poottqc2vrdwm93zuhm+d2qualm88xilcwula9kv3nr6xubw/ntnbomiofjdqcsee0xsciojzvffspnubdku4w03mffwm/ceyfpwg/vkcrth3u59shgsimnf217auhydpfqfdf46fdisjt8gket035yfxdzkunggbq7hcnk0ambi6efukorldj5glk2x4lzxpqaapiq6u <latexit sha1_base64="ou7jv5ivmelq3e/mqatd1v9md0k=">aaacshicbzblswmxfiuz9v1fvzdugkwoigvgbn0oggguk1gvounwj83uajizkoxshvl5bly68ze4cagio9mhqnulgcn37r1jtprypo3rpjulsfgjyanpmfls3pzcymvp+vwnmsk0srkeqmsinovm0qzhhtplvfeqeacx0e1rz7+4o0qzrj6zbkodar3jykbawbrwwvuw2fscmgc3xvve4xsf3sew/sj4c/sdeakwhysguz+cmgw4pg4lrt1vfn9kzlqik1w37vyl/xxeuftrsbph5clvjyqtvbrcqeuw56ymyhvrcadf2c80tyhcqoe2rjqgqa7yfhafxrekjene2smn7tofezkirbsisp0czlue9xrwp6+vmxgvyjlmm0mlgvwuzxybbpdsxw2mkdg8awuqxexbmbkgm5wx2zdtcn7ol/+k8+2659a9053q4cewjmm0itzqdxlofx2ie9ratutq Backpropagation = Parameter Optimization by Gradient Descent determine the required derivatives using: use derivatives to update all weights in iteration [t+1] : w[t + 1] (l) ij @E n (w) @w (l) ij = w[t](l) ij = (l) (l 1) i y j @E n (w) @w[t] (l) ij "58

Gradient & Training Only two things need to be implemented to define new layer: forward pass error back propagation Watch overfitting loss/energy test validation train epochs "59

Gradient & Training Only two things need to be implemented to define new layer: forward pass error back propagation Watch overfitting loss/energy test validation train epochs "60

Gradient & Training Only two things need to be implemented to define new layer: forward pass error back propagation Watch overfitting loss/energy test validation train epochs "61

Overview of Deep Learning "63

Multistage Hubel&Wiesel Architecture Slide: Y.LeCun [Hubel & Wiesel 1962] simple cells detect local features complex cells pool the outputs of simple cells within a retinotopic neighborhood. Cognitron / Neocognitron [Fukushima 1971-1982] Also HMAX [Poggio 2002-2006] Convolutional Networks [LeCun 1988-present]

Convolutional Neural Networks LeCun et al. 1989 Neural network with specialized connectivity structure

Convnet Successes Handwritten text/digits MNIST (0.17% error [Ciresan et al. 2011]) Arabic & Chinese [Ciresan et al. 2012] Simpler recognition benchmarks CIFAR-10 (9.3% error [Wan et al. 2013]) Traffic sign recognition 0.56% error vs 1.16% for humans [Ciresan et al. 2011] But (until recently) less good at more complex datasets E.g. Caltech-101/256 (few training examples)

Characteristics of Convnets Feed-forward: Convolve input Non-linearity (rectified linear) Pooling (local max) / (=subsampling) Supervised Train convolutional filters by back-propagating classification error Feature maps Pooling Non-linearity Convolution (Learned) Input Image [LeCun et al. 1989]

Application to ImageNet [Deng et al. CVPR 2009] used 1.2 million images of 1000 labeled classes (overall ~ 14 million labeled images, 20k classes) Images gathered from Internet Human labels via Amazon Turk [NIPS 2012]

Krizhevsky et al. [NIPS2012] Same model as LeCun 98 but: - Bigger model (8 layers) - More data (10 6 vs 10 3 images) - GPU implementation (50x speedup over CPU) - Better regularization (DropOut) 7 hidden layers, 650,000 neurons, 60,000,000 parameters Trained on 2 GPUs for a week

ImageNet Classification 2012 Krizhevsky et al. - 16.4% error (top-5) Next best (non-convnet) 26.2% error 30 22.5 Top-5 error rate % 15 7.5 0 SuperVision ISI Oxford INRIA Amsterdam

Commercial Deployment Google & Baidu, Spring 2013 for personal image search