深度学习 : 机器学习的新浪潮. Kai Yu Deputy Managing Director, Baidu IDL

Size: px

Start display at page:

Download "深度学习 : 机器学习的新浪潮. Kai Yu Deputy Managing Director, Baidu IDL"

Bernard Melton
5 years ago
Views:

1 深度学习 : 机器学习的新浪潮 Kai Yu Deputy Managing Director, Baidu IDL

2 Machine Learning the design and development of algorithms that allow computers to evolve behaviors based on empirical data, 7/23/2013 2

3 All Machine Learning Models in One Page Shallow Models Deep Models 7/23/2013 3

4 Shallow Models Since Late 80 s Neural Networks Boosting Support Vector Machines Maximum Entropy Given good features, how to do classification? 7/23/2013 4

5 Since 2000 Learning with Structures Kernel Learning Transfer Learning Semi-supervised Learning Manifold Learning Sparse Learning Matrix Factorization? Given good features, how to do classification? 7/23/2013 5

6 Mining$for$Structure$ Mission Yet Accomplished Massive$increase$in$both$computa: onal$power$and$the$amount$of$ data$available$from$web,$video$cameras,$laboratory$measurements.$ Images$&$Video$ Text$&$Language$$ Speech$&$Audio$ Gene$Expression$ Product$$ Recommenda: on$ Rela: onal$data/$$ Social$Network$ Climate$Change$ Geological$Data$ Mostly$Unlabeled$ $Develop$sta: s: cal$models$that$can$discover$underlying$structure,$cause,$or$ sta: s: cal$correla: on$from$data$in$unsupervised*or$semi,supervised*way.$$ Slide Courtesy: Russ Salakhutdinov $Mul: ple$applica: on$domains.$ 7/23/2013 6

7 The pipeline of machine visual perception Most Efforts in Machine Learning Low-level sensing Preprocessing Feature extract. Feature selection Inference: prediction, recognition Most critical for accuracy Account for most of the computation for testing Most time-consuming in development cycle Often hand-craft in practice 7/23/2013 7

8 Computer vision features SIFT Spin image HoG RIFT GLOH Slide Courtesy: Andrew Ng

9 Deep Learning: learning features from data Machine Learning Low-level sensing Preprocessing Feature extract. Feature selection Inference: prediction, recognition Feature Learning 7/23/2013 9

10 Convolution Neural Networks Coding Pooling Coding Pooling Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, /23/

11 Winter of Neural Networks Since 90 s Non-convex Need a lot of tricks to play with Hard to do theoretical analysis 7/23/

12 The Paradigm of Deep Learning

13 Deep Learning Since 2006 Neural networks 焕发第二春! 7/23/

14 Race on ImageNet (Top 5 Hit Rate) 72%, %, /23/

The Best system on ImageNet (by 2012.10) Local Gradients e.

15 The Best system on ImageNet (by ) Local Gradients e.g, SIFT, HOG Pooling Variants of Sparse Coding: LLC, Super-vector Pooling Linear Classifier - This is a moderate deep model - The first two layers are hand-designed 7/23/

16 Challenge to Deep Learners Key questions: What if no hand-craft features at all? What if use much deeper neural networks? -- By Geoff Hinton 7/23/

17 Answer from Geoff Hinton, %, %, %, /23/

18 The Architecture Slide Courtesy: Geoff Hinton 7/23/

19 The Architecture 7 hidden layers not counting max pooling. Early layers are conv., last two layers globally connected. Uses rectified linear units in every layer. Uses competitive normalization to suppress hidden activities. Slide Courtesy: Geoff HInton 7/23/

20 Revolution on Speech Recognition Word error rates from MSR, IBM, and the Google speech group Slide Courtesy: Geoff Hinton 7/23/

21 Deep Learning for NLP Natural Language Processing (Almost) from Scratch, Collobert et al, JMLR /23/

22 Biological & Theoretical Justification

23 [Thorpe] Why Hierarchy? Theoretical: well-known depth-breadth tradeoff in circuits design [Hastad 1987]. This suggests many functions can be much more efficiently represented with deeper architectures [Bengio & LeCun 2007] Biological: Visual cortex is hierarchical (Hubel-Wiesel Model, 1981 Nobel Prize)

24 Sparse DBN: Training on face images object models Deep Architecture in the Brain object parts (combination of edges) Area V4 Area V2 Area V1 Higher level visual abstractions Primitive shape detectors Edge detectors edges Retina pixels pixels Note: Sparsity important for these results [Lee, Grosse, Ranganath & Ng, 2009]

25 Sensor representation in the brain Auditory cortex learns to see. Auditory Cortex (Same rewiring process also works for touch/ somatosensory cortex.) Seeing with your tongue [Roe et al., 1992; BrainPort; Welsh & Blasch, 1997]

26 Deep Learning in Industry

27 Microsoft First successful deep learning models for speech recognition, by MSR in /23/

28 Google Brain Project Ley by Google fellow Jeff Dean Published two papers, ICML2012, NIPS2012 Company-wise large-scale deep learning infrastructure Significant success on images, speech, 7/23/

Deep Learning @ Baidu Started working on deep learning in 2012 summer Big success in speech recognition and image recognition, 6 products

29 Deep Baidu Started working on deep learning in 2012 summer Big success in speech recognition and image recognition, 6 products have been launched so far. Recently, very encouraging results on CTR Meanwhile, efforts are being tried on in areas like LTR, NLP 7/23/

30 Deep Learning Ingredients Building blocks RBMs, autoencoder neural nets, sparse coding, Go deeper: Layerwise feature learning Layer-by-layer unsupervised training Layer-by-layer supervised training Fine tuning via Backpropogation If data are big enough, direct fine tuning is enough Sparsity on hidden layers are often helpful. 7/23/

31 Building Blocks of Deep Learning

32 CVPR 2012 Tutorial: Deep Learning for Vision 09.00am: Introduction Rob Fergus (NYU) 10.00am: Coffee Break 10.30am: Sparse Coding Kai Yu (Baidu) 11.30am: Neural Networks Marc Aurelio Ranzato (Google) 12.30pm: Lunch 01.30pm: Restricted Boltzmann Honglak Lee (Michigan) Machines 02.30pm: Deep Boltzmann Ruslan Salakhutdinov (Toronto) Machines 03.00pm: Coffee Break 03.30pm: Transfer Learning Ruslan Salakhutdinov (Toronto) 04.00pm: Motion & Video Graham Taylor (Guelph) 05.00pm: Summary / Q & A All 05.30pm: End

33 Building Block 1 - RBM

34 Restricted Boltzmann Machine 7/23/2013 Slide Courtesy: Russ Salakhutdinov 34

35 Restricted Boltzmann Machine Restricted: Slide Courtesy: Russ Salakhutdinov 7/23/

36 Model Parameter Learning i.i.d. Approximate*maximum*likelihood*learning:** Slide Courtesy: Russ Salakhutdinov 7/23/

37 Contrastive Divergence 7/23/2013 Slide Courtesy: Russ Salakhutdinov 37

38 RBMs for Images 7/23/2013 Slide Courtesy: Russ Salakhutdinov 38

39 Layerwise Pre-training 7/23/2013 Slide Courtesy: Russ Salakhutdinov 39

40 DBNs for MNIST Classification 7/23/2013 Slide Courtesy: Russ Salakhutdinov 40

41 Deep Autoencoders for Unsupervised Feature Learning 7/23/2013 Slide Courtesy: Russ Salakhutdinov 41

42 Searching$Large$Image$Database$ using$binary$codes$ Image Retrieval using Binary Codes $Map$images$into$binary$codes$for$fast$retrieval.* $Small$Codes,$Torralba,$Fergus,$Weiss,$CVPR$2008$ *Spectral$Hashing,$Y.$Weiss,$A.$Torralba,$R.$Fergus,$NIPS$2008* *Kulis$and$Darrell,$NIPS$2009,$Gong$and$Lazebnik,$CVPR$20111$ *Norouzi$and$Fleet,$ICML$2011,$* Slide Courtesy: Russ Salakhutdinov 7/23/

43 Building Block 2 Autoencoder Neural Net

44 AUTO-ENCODERS NEURAL NETS Autoencoder Neural Network input encoder code decoder prediction Error input higher dimensional than code - error: prediction - input - training: back-propagation Ranzato Slide Courtesy: Marc'Aurelio Ranzato 7/23/

45 Sparse AutoencoderSparsity SPARSE AUTO-ENCODERS Penalty input input encoder Sparsity Penalty code encoder decoder code decoder prediction prediction Error Err sparsity penalty: code sparsity penalty: code error: prediction - input - error: prediction - input - loss: sum of square reconstruction error and sparsity loss: - training: sum back-propagation of square reconstruction error and sparsity - training: back-propagation 1 2 Ranzato Slide Courtesy: Marc'Aurelio Ranzato 7/23/ Ran

46 Sparse Autoencoder SPARSE AUTO-ENCODERS input input SPARSE AUTO-ENCODERS encoder Sparsity Penalty encoder code Sparsity Penalty decoder prediction decoder prediction Error Er sparsity penalty: code input: code: - error: prediction - input - training: back-propagation loss: sum of square reconstruction error and sparsity - loss: X h= W T X 117 L X ;W = W h X 2 j h j Ranzato Le et al. ICA with reconstruction cost.. NIPS 2011 Ran Slide Courtesy: Marc'Aurelio Ranzato 7/23/

47 Building Block 3 Sparse Coding

48 Sparse coding Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detection). Training: given a set of random patches x, learning a dictionary of bases [Φ 1, Φ 2, ] Coding: for data vector x, solve LASSO to find the sparse coefficient vector a 7/23/

49 Sparse coding: training time Input: Images x 1, x 2,, x m (each in R d ) Learn: Dictionary of bases f 1, f 2,, f k (also R d ). Alternating optimization: 1.Fix dictionary f 1, f 2,, f k, optimize a (a standard LASSO problem) 2.Fix activations a, optimize dictionary f 1, f 2,, f k (a convex QP problem)

50 Sparse coding: testing time Input: Novel image patch x (in R d ) and previously learned f i s Output: Representation [a i,1, a i,2,, a i,k ] of image patch x i. 0.8 * * * Represent x i as: a i = [0, 0,, 0, 0.8, 0,, 0, 0.3, 0,, 0, 0.5, ]

Sparse coding illustration 50 100 Natural Images Learned bases (f

50 500 300 100 50 100 150 200 250 300 350 400 450 500 150 350 400

400 450 500 50 100 150 200 250 300 350 400 450 500 Test example 0.

5 * [a 1,, a 64 ] = [0, 0,, 0, 0.8, 0,, 0, 0.3, 0,, 0, 0.

51 Sparse coding illustration Natural Images Learned bases (f 1,, f 64 ): Edges Test example 0.8 * * * f 63 x 0.8 * f * f * [a 1,, a 64 ] = [0, 0,, 0, 0.8, 0,, 0, 0.3, 0,, 0, 0.5, 0] (feature representation) Slide credit: Andrew Ng Compact & easily interpretable

52 RBM & autoencoders - also involve activation and reconstruction - but have explicit f(x) - not necessarily enforce sparsity on a - but if put sparsity on a, often get improved results [e.g. sparse RBM, Lee et al. NIPS08] a x f(x) encoding a x g(a) decoding 7/23/

53 Sparse coding: A broader view Any feature mapping from x to a, i.e. a = f(x), where -a is sparse (and often higher dim. than x) -f(x) is nonlinear -reconstruction x =g(a), such that x x a f(x) a g(a) x x Therefore, sparse RBMs, sparse auto-encoder, even VQ can be viewed as a form of sparse coding. 7/23/

54 Example of sparse activations (sparse coding) x 3 x 1 x 2 x m a 1 [ ] a 2 [ ] a 3 [ ] a m [ ] different x has different dimensions activated locally-shared sparse representation: similar x s tend to have similar non-zero dimensions 7/23/

55 Example of sparse activations (sparse coding) x 1 x 2 x 3 x m a 1 [ ] a 2 [ ] a 3 [ ] a m [ ] another example: preserving manifold structure more informative in highlighting richer data structures, i.e. clusters, manifolds, 7/23/

56 Sparsity vs. Locality sparse coding local sparse coding Intuition: similar data should get similar activated features Local sparse coding: data in the same neighborhood tend to have shared activated features; data in different neighborhoods tend to have different features activated. 7/23/

Sparse coding is not always local: example Case 1 independent subspaces Each basis is a direction Sparsity: each datum is a linear combination of only several bases.

57 Sparse coding is not always local: example Case 1 independent subspaces Each basis is a direction Sparsity: each datum is a linear combination of only several bases. 7/23/2013 Case 2 data manifold (or clusters) Each basis an anchor point Sparsity: each datum is a linear combination of neighbor anchors. Sparsity is caused by locality. 57

58 Two approaches to local sparse coding Approach 1 Coding via local anchor points Local coordinate coding Learning locality-constrained linear coding for image classification, Jingjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang. In CVPR Nonlinear learning using local coordinate coding, Kai Yu, Tong Zhang, and Yihong Gong. In NIPS Approach 2 Coding via local subspaces Super-vector coding Image Classification using Super-Vector Coding of Local Image Descriptors, Xi Zhou, Kai Yu, Tong Zhang, and Thomas Huang. In ECCV Large-scale Image Classification: Fast Feature Extraction and SVM Training, Yuanqing Lin, Fengjun Lv, Shenghuo Zhu, Ming Yang, Timothee Cour, Kai Yu, LiangLiang Cao, Thomas Huang. In CVPR /23/

- Sparsity achieved by explicitly ensuring locality - Sound theoretical

59 Two approaches to local sparse coding Approach 1 Coding via local anchor points Local coordinate coding Approach 2 Coding via local subspaces Super-vector coding - Sparsity achieved by explicitly ensuring locality - Sound theoretical justifications - Much simpler to implement and compute - Strong empirical success 7/23/

$Hierarchical Sparse Coding n x n ] R d n B R d p q \alpha w x (W,α)= L (W,α)+ λ 1$ w i i Bw i + λ2 w i Ω(α)w i i= 1 Σ(α W = (w 1 w 2 w n W) R p n S (W α R q W Ω(α) S

60 Hierarchical Sparse Coding n x n ] R d n B R d p q \alpha w x (W,α)= L (W,α)+ λ 1 W,α n W 1 + γ L (W,α) α 0, L (W,α) L (W,α)= 1 2n X BW 2 F + λ 2 S n n 1 1 n 2 x 2 w i i Bw i + λ2 w i Ω(α)w i i= 1 Σ(α W = (w 1 w 2 w n W) R p n S (W α R q W Ω(α) S (W ) 1 n Yu, Lin, & Lafferty, i= 1CVPR 11 7/23/ q n α k (φ k ) w i w T i 1. R

61 Hierarchical Sparse Coding on MNIST Yu, Lin, & Lafferty, CVPR 11 HSC vs. CNN: HSC provide even better performance than CNN more amazingly, HSC learns features in unsupervised manner! 61

62 Second-layer dictionary Yu, Lin, & Lafferty, CVPR 11 A hidden unit in the second layer is connected to a unit group in the 1 st layer: invariance to translation, rotation, and deformation 62

63 Adaptive Deconvolutional Networks for Mid and High Level Feature Learning Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 2011 Hierarchical Convolutional Sparse Coding. Trained with respect to image from all layers (L1-L4). Pooling both spatially and amongst features. Learns invariant midlevel features. L4 Feature Maps L3 Feature Maps L2 Feature Maps L1 Feature Maps Image Select L4 Features Select L3 Feature Groups Select L2 Feature Groups L1 Features

64 Recap: Deep Learning Ingredients Building blocks RBMs, autoencoder neural nets, sparse coding, Go deeper: Layerwise feature learning Layer-by-layer unsupervised training Layer-by-layer supervised training Fine tuning via Backpropogation If data are big enough, direct fine tuning is enough Sparsity on hidden layers are often helpful. 7/23/

65 Layer-by-layer unsupervised training + fine tuning Loss = supervised_error + unsupervised_error input prediction label Weston et al. Deep learning via semi-supervised embedding ICML 2008 Ra Slide Courtesy: Marc'Aurelio Ranzato 7/23/

66 Layer-by-Layer Supervised Training 7/23/

67 Geoff Hinton, ICASSP 2013

68 Large-scale training: Stochastic vs. batch

69 The challenge The Challenge A Large Scale problem has: lots of training samples (>10M) lots of classes (>10K) and lots of input dimensions (>10K). best optimizer in practice is on-line SGD which is naturally sequential, hard to parallelize. layers cannot be trained independently and in parallel, hard to distribute model can have lots of parameters that may clog the network, hard to distribute across machines Slide Courtesy: Marc'Aurelio Ranzato 7/23/

70 A solution by model parallelism Our Solution 1 st machine 2 nd machine 3 rd machine MODEL PARALLELISM + input #3 input #2 input #1 DATA PARALLELISM MODEL PARALLELISM 7/23/ Ranzato Le et al. Building high-level features using large-scale unsupervised learning ICML 2012 Slide Courtesy: Marc'Aurelio Ranzato Ra

71 input #3 input #2 input #1 input #3 input #2 input #1 MODEL PARALLELISM MODEL PARALLELISM + + DATA PARALLELISM DATA PARALLELISM Le et al. Building high-level features using large-scale unsupervised learning ICML 2012 Le 7/23/2013 et al. Building high-level features using large-scale unsupervised learning ICML Slide Courtesy: Marc'Aurelio Ranzato Ra Ran

72 Asynchronous SGD Asynchronous SGD L 1 PARAMETER SERVER 1 st replica 2 nd replica 3 rd replica 7/23/ Ranzato Slide Courtesy: Marc'Aurelio Ranzato

73 Asynchronous SGD Asynchronous SGD PARAMETER SERVER 1 1 st replica 2 nd replica 3 rd replica 7/23/ Ranzato Slide Courtesy: Marc'Aurelio Ranzato

74 PARAMETER SERVER L 2 1 st replica 2 nd replica 3 rd replica 7/23/ Ranzato Slide Courtesy: Marc'Aurelio Ranzato

75 PARAMETER SERVER 2 1 st replica 2 nd replica 3 rd replica 7/23/ Ranzato Slide Courtesy: Marc'Aurelio Ranzato

76 Unsupervised Learning With 1B Paramete Training A Model with 1 B Parameters Deep Net: 3 stages each stage consists of local filtering, L2 pooling, LCN - 18x18 filters - 8 filters at each location - L2 pooling and LCN over 5x5 neighborhoods training jointly the three layers by: - reconstructing the input of each layer - sparsity on the code Slide Courtesy: Marc'Aurelio Ranzato 7/23/

77 Conclusion Remark We see the promise of deep learning Big success on speech, vision, and many other Large-scale training is the key challenge More successful stories are coming 7/23/

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 CS 2750: Machine Learning Neural Networks Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 Plan for today Neural network definition and examples Training neural networks (backprop) Convolutional