Introduction to Neural Networks

Size: px

Start display at page:

Download "Introduction to Neural Networks"

Horace Garrett
6 years ago
Views:

1 ECE 5775 (Fall 17) High-Level Digital Design Automation Introduction to Neural Networks Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering

2 Rise of the Machines Neural networks have revolutionized the world Self-driving vehicles Advanced image recognition Game AI 1

3 Rise of the Machines Neural networks have revolutionized research 2

4 A Brief History 1950 s: First artificial neural networks created based on biological structures in the human visual cortex 1980 s 2000 s: NNs considered inferior to other, simpler algorithms (e.g. SVM, logistic regression) Mid 2000 s: NN research considered dead, machine learning conferences outright reject most NN papers : NNs begin winning large-scale image and document classification contests, beating other methods 2012 Now: NNs prove themselves in many industrial applications (web search, translation, image analysis) 3

classification problems Inputs Labels Cat

used to predict the labels of new inputs

5 Classification Problems We ll discuss neural networks for solving supervised classification problems Inputs Labels Cat Bird Given a training set consisting of labeled inputs Learn a function which maps inputs to labels This function is used to predict the labels of new inputs Inputs Predictions Person??? Training Set Test Set 4

6 A Simple Classification Problem Inputs: Pairs of numbers (x 1, x 2 ) Labels: 0 or 1 x 1 x 2 Label This is a binary decision problem Real-life analogue: Label = Raining or Not Raining x 1 = relative humidity x 2 = cloud coverage Training Set 5

7 Visualizing the Data x 1 x 2 Label x 1 x 2 Plot of the data points Training Set 6

8 Decision Function In this case, the data points can be classified with a linear decision boundary Decision boundary 7

9 The Perceptron The perceptron is the simplest possible neural network, containing only one neuron Described by the following equation:. y = σ(* w, x, + b) w, b σ,/0 = weights = bias = activation function 8

10 Visualizing the Perceptron A 2-input, 1-output perceptron: x 0 w 0 x 3 w 3 + b σ y Output Probability that the label should be 1 Linear function of x = x 0 x 3 Wx + b Non-linear function squashes output to the interval (0,1) 9

11 Activation Function The activation function σ is non-linear: Unit Step Sigmoid ReLU e 9: max (0, x) Discontinuous at origin, zero derivative everywhere Continuous and differentiable everywhere Continuous but non-differentiable at origin First perceptrons used Unit Step. Early neural networks used Sigmoid. Modern deep networks use ReLU. 10

12 The Perceptron Decision Boundary Let s plot the 2-input perceptron (sigmoid activation) y = σ(w 0 x 0 + w 3 x 3 + b) w 0 = 10 w 3 = 10 w 0 = 5 w 3 = 20 w 0 = 20 w 3 = 5 Ratio of weights change the direction of the decision boundary 11

13 The Perceptron Decision Boundary Let s plot the 2-input perceptron (sigmoid activation) y = σ(w 0 x 0 + w 3 x 3 + b) w 0 = 10 w 3 = 10 w 0 = 5 w 3 = 5 w 0 = 2 w 3 = 2 Magnitude of weights change the steepness of the decision boundary 12

14 The Perceptron Decision Boundary Let s plot the 2-input perceptron (sigmoid activation) y = σ(w 0 x 0 + w 3 x 3 + b) b = 0 b = 5 b = 5 Bias moves boundary away the from origin 13

15 Finding the Weights and Bias With the right weights and bias we can create any (linear) decision boundary we want! But how do we compute the weights and bias? Solution: backpropagation training 14

16 Backpropagation Let s take the 2-input perceptron and initialize the weights and bias randomly x w 0 + σ y x 3 w b 1.4 Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University 15

17 Backpropagation Send one input pair through and calculate the output x 1 x 2 Label 0.7 x w Output: 0.95 σ y Target: x 3 w b 1.4 Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University 16

18 Backpropagation Training How to change {w 0, w 3, b} to move y towards the target? Answer: partial derivatives JK JL M, JK JL N, JK JO 0.7 x w Output: 0.95 σ y Target: x 3 w b 1.4 Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University 17

19 Backpropagation We can calculate the derivatives function-by-function Green = partial derivative x 0 x w 0 w b 1.4 s y σ y = σ(s) = y s = e 9S y e 9S (1 + e 9S ) 3 = y y = 1.0 Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University 18

20 Backpropagation We can calculate the derivatives function-by-function Propagate partial derivatives backwards x 0 x w 0 w b 1.4 s y σ 1.0 y s = w 0 x 0 + w 3 x 3 + b s b = 1 s w 0 = x 0 s w 3 = x 3 Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University 19

21 Backpropagation What we ve done is apply the chain rule of derivatives x 0 x w 0 w b 1.4 s y σ 1.0 y y = y w 0 s s w 0 y s = s w 0 = 0.7 Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University 20

22 Backpropagation In general, any operator in a neural network defines a forwards pass and a backwards pass x E x = E f f x y f f(x, y) E E y = E f f y f Backpropagation allows us to build complex networks and still be able to update the weights Inspired by course slides from L. Fei-Fei, J. Johnson, S. Yeung, CS231n at Stanford University 21

23 Weight Update We now know how to compute JK JL M, JK JL N, JK JO Parameter update rule (for minimizing y): w = w η JK JL η = learning rate or step size This is optimization by gradient descent, as we move in the direction of the steepest gradient 22

24 Practical Differences For simplicity we minimized y, since label t = 0 In practice we use a loss function L(y, t) Square Loss: L y, t = (y t) 3 Cross Entropy: L y, t = t ln y 1 t ln (1 y) For simplicity we dealt with 1 training sample In practice we process multiple training samples (a minibatch) and sum the gradients before updating the weights 23

25 Deep Neural Network The perceptron s linear decision boundary is not very useful for complex problems A deep neural network (DNN) consists of many layers, each containing multiple neurons This network is fully connected Input Layer Hidden Layers Output Layer Image credit: 24

26 What Happens in a DNN? Each hidden unit learns a hidden feature Later features depends on earlier features in the network, so they will get more complex Simple Features More Complex Features Image credit: 25

27 Deep Neural Network A deep neural network creates highly non-linear decision boundaries 2 units 3 units 4 units Image credit: 26

Convolutional Neural Network Convolutional Layers Fully-connected Layers Convolutional neural network (CNN) is a neural network specifically built to analyze images State-of-the-art in

28 Convolutional Neural Network Convolutional Layers Fully-connected Layers Convolutional neural network (CNN) is a neural network specifically built to analyze images State-of-the-art in image classification Key difference: the layers at the front are convolutional layers Image credit: 27

Inputs and outputs are vectors of 2D feature maps 2.

29 The Convolutional Layer Fully-connected Layer 1. An output neuron is connected to each input neuron Convolutional Layer 1. Inputs and outputs are vectors of 2D feature maps 2. An output pixel is connected to its neighboring region on each input feature map 3. All pixels on an output feature map use the same weights Image credit: 28

The Convolutional Layer Input Feature Map Output Feature Map The conv layer moves a weight filter across the input map, doing a dot product at each location

30 The Convolutional Layer Input Feature Map Output Feature Map The conv layer moves a weight filter across the input map, doing a dot product at each location The weight filter detects local features in the input map Image credit: 29

Intuition on CNNs Convolutional layers detect visual

cause high activity, across 3 different conv layers in

31 Intuition on CNNs Convolutional layers detect visual features This shows features in the input image which cause high activity, across 3 different conv layers in a CNN Image credit: H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks, CACM Oct

32 Accelerating CNNs on FPGA Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Cheng Zhang 1, Peng Li 2, Guangyu Sun 1,3, Yijin Guan 1, Bingjun Xiao 2, Jason Cong 2,3,1 1 Center for Energy-Efficient Computing and Applications, Peking University 2 Computer Science Department, University of California, Los Angeles 3 PKU/UCLA Joint Research Institute in Science and Engineering FPGA 15, Feb

33 Convolutional Layer Code N K C M Filters R Input Feature Maps Output Feature Maps 1 for(row=0; row<r; row++) { 2 for(col=0; col<c; col++) { Loop over output pixels 3 for(to=0; to<m; to++) { Loop over output feature maps 4 for(ti=0; ti<n; ti++) { Loop over input feature maps 5 for(i=0; i<k; i++) { 6 for(j=0; j<k; j++) { Loop over filter pixels output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; }}}}}} 32

34 Challenges for FPGA We can t just unroll all the loops due to limited FPGA resources Must choose the right pragmas and code transformations Computation Resource... PE-1 PE-2 PE-n Limited compute and routing resources Solution: Loop pipelining Interconnect On-chip memory buffer1 buffer2 Limited on-chip storage Solution: Loop tiling Off-chip Bus External Memory Limited off-chip bandwidth Solution: Data reuse, compute-memory balance 33

35 Loop Tiling N K C M Filters R Input Feature Maps Output Feature Maps 1 for(row=0; row<r; row++) { 2 for(col=0; col<c; col++) { Loop over pixels in an output map 3 for(to=0; to<m; to++) { 4 for(ti=0; ti<n; ti++) { 5 for(i=0; i<k; i++) { 6 for(j=0; j<k; j++) { output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; }}}}}} 34

36 Loop Tiling N K C M Tr Filters Tc R Input Feature Maps Output Feature Maps 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { Loop over different tiles in output map 3 for(to=0; to<m; to++) { 4 for(ti=0; ti<n; ti++) { 5 for(trr=row; trr<min(row+tr, R); trr++) { Loops over pixels 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) { in each tile 7 for(i=0; i<k; i++) { 8 for(j=0; j<k; j++) { output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; }}}}}} }}}} 35

37 Code with Loop Tiling 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { // software: write output feature map // software: write input feature map + filters Software Portion Maximize memory reuse 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) { FPGA Portion 7 for(too=to; too<min(to+tm, M); too++) { Maximize 8 for(tii=ti; tii<(ti+tn, N); tii++) { computational 9 for(i=0; i<k; i++) { performance 10 for(j=0; j<k; j++) { output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; }}}}}} }}}} // software: read output feature map 36

38 Optimizing for On-Chip Performance 95 for(i=0; for(trr=row; i<k; trr<min(row+tr, i++) { R); trr++) { Reorder these two loops to the top level 10 6 for(j=0; for(tcc=col; j<k; tcc<min(tcc+tc, j++) { C); tcc++) { 57 for(trr=row; for(too=to; too<min(to+tm, trr<min(row+tr, M); R); too++) trr++) { 68 for(tcc=col; for(tii=ti; tii<(ti+tn, tcc<min(tcc+tc, N); tii++) C); { tcc++) { 79 for(too=to; for(i=0; #pragma i<k; HLS too<min(to+tm, i++) pipeline { M); too++) { Generated Hardware for(too=to; for(j=0; for(tii=ti; j<k; too<min(to+tm, tii<(ti+tn, Tn j++) { N); tii++) M); too++) { { 8 Performance determined for(tii=ti; #pragma output_fm[too][trr][tcc] HLS tii<(ti+tn, unroll N); tii++) += { 5 8 by tile for(trr=row; factors Tn for(tii=ti; trr<min(row+tr, and weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] weights[too][tii][i][j]*input_fm[ti][s*trr+i][s*tcc+j]; Tm tii<(ti+tn, R); N); trr++) tii++) += { { 6 }}}}}} for(tcc=col; weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] #pragma tcc<min(tcc+tc, HLS unroll C); tcc++) += { 7 Amount }}}}}} of for(too=to; data transfer weights[too][tii][i][j]*input_fm[ti][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] too<min(to+tm, M); += too++) { 8 determined }}}}}} by for(tii=ti; Tr weights[too][tii][i][j]*input_fm[ti][s*trr+i][s*tcc+j]; and tii<(ti+tn, Tc N); tii++) { 9 }}}}}} for(i=0; i<k; i++) { 10 for(j=0; j<k; j++) { output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][s*trr+i][s*tcc+j]; }}}}}} Tm 37

39 Design Space Complexity Challenge: Number of available optimizations present a huge space of possible designs Optimal loop order? Tile size for each loop? Implementing and testing each design by hand will be slow and error-prone Some designs exceed the on-chip resources Solution: Performance modeling + automated design space exploration 38

40 Performance Modeling We calculate the following design metrics: Total number of operations (FLOP) Depends on the CNN model parameters Total external memory access (Byte) Depends on the CNN weight and activation size Total execution time (S) Depends on the hardware architecture (e.g. tile factors Tm and Tn) Ignore resource constrains for now Execution Time = Number of Cycles Clock Period Number of Cycles ^ _` a _ b R C K 3 39

41 Performance Modeling Computation Resource... PE-1 PE-2 PE-n Interconnect On-chip memory buffer1 buffer2 Computational Throughput Total number of operations = Total execution time GFLOP/S Computation To Communication (CTC) Ratio Total number of operations = Total external memory access FLOP / (Byte Accessed) GByte S Off-chip Bus External Memory GFLOP/s = FLOP/(Byte Accessed) = Required Bandwidth Computational Throughput CTC Ratio GByte/S 40

42 Roofline Method GFLOP/S Computational Throughput Design Required Bandwidth Gbyte/S Computation To Communication (CTC) Ratio FLOP / Byte 41

43 Computational Throughput Roofline Method GFLOP/S Computational Roof (Max Exceeds device throughput) On-Chip Resources Bandwidth-Bottlenecked Theoretical Throughput Real Throughput Computation To Communication (CTC) Ratio Bandwidth Roof (Device memory bandwidth) The Roofline FLOP / Byte 42

44 Design Space Exploration 120 Bandwidth Roof Computational Throughput Computational Roof A B Computation To Communication (CTC) Ratio 43

45 Experimental Evaluation The accelerator design is implemented with Vivado HLS on the VC707 board containing a Virtex7 FPGA Hardware resource utilization Resource DSP BRAM LUT FF Used Available Utilization 80% 50% 61% 34% Comparison of theoretical and real performance Layer 2 Theoretical Measured Difference 5.11ms 5.25ms ~5% 44

46 Experimental Results Energy Speedup x x x x thread -O3 16 threads -O3 FPGA 0 1 thread -O3 16 threads -O3 FPGA CPU Xeon E (32nm) 16 cores 2.2 GHz FPGA Virtex7-485t (28nm) 448 PEs 100MHz gcc O3 OpenMP 3.0 Vivado Vivado HLS

47 FIN 46

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint