Neural Computer Architectures

Size: px

Start display at page:

Download "Neural Computer Architectures"

Stanley Hines
5 years ago
Views:

1 Neural Computer Architectures 5kk73 Embedded Computer Architecture By: Maurice Peemen Date:

2 Convergence of different domains Neurobiology Applications 1 Constraints Machine Learning Technology Innovations Neuromorphic

3 Biological Neural Networks 2

4 Biological Neural Networks 3 Presynaptic neuron Postsynaptic neuron Cell body Synapses

5 Perceptron Model (1957) 4 Feed forward processing Tuning the weights by learning Non-linear separability (1969) y b xi w i i step x[1] x[2] x[3] x[k] w[1] w[2] w[3] w[k] K-1 Σ = p φ(p) y k=1 bias sigmoid

6 Convergence of different domains Neurobiology Applications 5 Constraints Machine Learning Technology Innovations Neuromorphic

7 Multi Layer Perceptron (1979) 6 Training is done by error back-propagation Target Input Layer Hidden Layer Output Layer 0 0

8 The Hype Curve of Neural Networks 7 level of interest Non-Linear Separability 1969 SVM 1998 Today Perceptron 1957 Multi Layer Perceptron time

Deep Big Neural Networks 8 Deep Big Neural networks outperform SVM ANNs are now state-of-the-art classifiers again 5 layers 1000s of nodes connection constraints Big Deep

9 Deep Big Neural Networks 8 Deep Big Neural networks outperform SVM ANNs are now state-of-the-art classifiers again 5 layers 1000s of nodes connection constraints Big Deep Network H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, ICML 2007

10 Convergence of different domains Neurobiology Applications 9 Constraints Machine Learning Technology Innovations Neuromorphic

11 Classification: Face detection 10

12 Intelligent Vision Applications 11 Emerging field of research Applications in many domains Examples: Security, Industrial, Medical, Automotive

13 Intelligent Vision Applications 12 Emerging field of research Applications in many domains Examples: Security, Industrial, Medical, Automotive

14 Intelligent Vision Applications 13 Emerging field of research Applications in many domains Examples: Security, Industrial, Medical, Automotive Old man Breathing Heart beat No action

15 Intelligent Vision Applications 14 Emerging field of research Applications in many domains Examples: Security, Industrial, Medical, Automotive

16 Intelligent Vision Applications 15 Emerging field of research Applications in many domains Examples: Security, Industrial, Medical, Automotive

17 Classical recognition systems are stupid 16 Design is based on knowledge of the task Carefully tuned pipeline of algorithms Really complex for real world problems Design must be redone if the task changes light correction histogram stretch colour thresholding edge detection corner detection shape recognition hough transform matching neural networks

Train a Neural Network for the task 17 Focus on data instead of algorithm complexity Pre-process data to generate more examples Use a test set to

18 Train a Neural Network for the task 17 Focus on data instead of algorithm complexity Pre-process data to generate more examples Use a test set to verify generalization 30 km/h 50 km/h 60 km/h 70 km/h 80 km/h 90 km/h 100 km/h Background images hard to suppress Random background image patches

19 Biologically inspired object recognition 18 Convolutional Neural Network A deep and big neural network input 32 x 32 C 1 feature maps 28 x 28 C 2 feature maps S 1 10 x 10 feature maps 14 x 14 S 2 feature maps 5 x 5 n 1 n 2 output sign x5 convolution 2x2 subsampling 5x5 convolution feature extraction 2x2 5x5 subsampling convolution 100 1x1 convolution classification

20 Detection and Recognition Application 19

21 Detection and Recognition Application 20

22 Detection and Recognition Application 21

23 Detection and Recognition Application 22

24 Speed Sign Detection and Recognition 23

25 Advantage of flexibility 24 Extend existing trained network Add new road signs and restart training New weight file is new functionality Send new weight file to users (100 KB)

26 Advantage of flexibility 25 Extend existing trained network Add new road signs and restart training New weight file is new functionality Send new weight file to users (100 KB)

27 Major road detection 26

28 What can these NN further do 27 Classification Approximation Optimization Clustering

29 Function Approximation 28 Stock market prediction: Black Scholes

30 Placement Optimization 29 Chip routing: Canneal Minimize wire length Hopfield Neural Network

31 Convergence of different domains Neurobiology Applications 30 Constraints Machine Learning Technology Innovations Neuromorphic

32 Technology Constraints 31 Dark Silicon Defect tolerance

33 Dark Silicon 32 What to do with chips that are too hot? Reduce clock frequency Go multi-core If chip is still too hot? Turn parts of the chip off! Generates Dark Silicon

34 Energy Efficiency 33 Super Computer (K computer, Fujitsu) 8.2 billion Megaflops => 9.9 million watts ~ 800 Megaflops / watt ipad Megaflops => 2.5 watts ~ 68 Megaflops / watt Human Brain 2.2 billion Megaops => 20 watts ~ 110 Teraops / watt

35 Toward Heterogeneous Systems 34 Efficient accelerators Multi-purpose ASICs ANN is a candidate Flexible functionality State-of-the-art results Parallelism

36 Developing ANN Accelerators 35 for i = 1:N Y[i] = Bias[i] for k = 1:K Y[i] += X[k] * W[i][k] Y[i] = Sigmoid(Y[i]) y b x w i i k ik k

37 Time-Multiplexed Accelerator 36 for i = 1:N Y[i] = Bias[i] for k = 1:K Y[i] += X[k] * W[i][k] Y[i] = Sigmoid(Y[i]) y b x w i i k ik k 1 () v 1 exp( a x) Load Bias X[1:N] = 1 W[i][1] = Bias[i] Perform MACC Sigmoid Approximate

38 Analog Intel ETANN Electrically Trainable Analog Neural Network Analog Gilbert- Multiplier Circuits Sum differential currents from synapses and convert to voltage Weights stored as Electrical Charge on floating gates Analog sigmoid activation function

39 Digital Implementation 38 Sigmoid Function Look Up Table Use linear approximation ( x) bi ai x Multiply Accumulate

40 SIMD design Adaptive Solutions N

41 Conversion to vector operations 40 y [ n] b x [ n] w k i i k ik y[ n] b x[ n] W Y b X W

42 Systolic Matrix Multiplication 41 Siemens MA16 High efficiency Low flexibility

43 An example state-of-the-art accelerator 42

44 Systolic 2D Convolution 43

45 Convolutional Neural Network 44 Data reuse input 32 x 32 C 1 feature maps 28 x 28 C 2 feature maps S 1 10 x 10 feature maps 14 x 14 S 2 feature maps 5 x 5 n 1 n 2 output sign x5 convolution 2x2 subsampling 1x1 convolution 5x5 convolution 2x2 5x5 subsampling convolution

46 Reduce Memory Accesses 45 Configurable Number of Input Maps Configurable Number of Output Maps

47 Is it worth the effort? 46 More important the energy efficiency

48 More Flexibility and Better Memory Behaviour? 47

49 Energy for Data Transfer [J] The performance bottleneck 48 Huge data transfer requirements (3.4 billion per layer) Exploit data reuse with local memories DRAM Cache Total On-Chip Cache Size [Words]

50 Accelerator Template 49 FPGA prototyping platform: Xilinx Virtex 6 Designed with Vivado High Level Synthesis (HLS) MACC in_img weight * + acc bias out_img FSLs In Ctrl in_img weight bias MACC MACC MACC MACC MACC MACC MACC MACC DDR Out Ctrl out_img Activation LUT MACC Select saturate MACC

51 Programmable Buffers 50 Image Coefficients addr select wr demux Input FSLs X 0 BRAM X 1 BRAM X 2 BRAM X 3 BRAM addr select rd addr select wr rotate mux sigmoid LUT out-img BRAM Output FSLs weight BRAM addr select rd/wr addr select rd

52 Programmable Buffers 51 Image Coefficients addr select wr demux Input FSLs X 0 BRAM X 1 BRAM X 2 BRAM X 3 BRAM addr select rd addr select wr rotate mux sigmoid LUT out-img BRAM Output FSLs weight BRAM addr select rd/wr addr select rd

53 Programmable Buffers 52 buffer address x 00, x 01, x 02, x 03 x 50 x 51 x 52 x 53 x 54 x 55 x 56 x 57 x 58 x x 00 x 01 x 02 x 03 x 04 x 05 x 06 x 07 x 08 x x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x x 01, x 02, x 03, x 04 x 02, x 03, x 04, x 05 x 03, x 04, x 05, x 06 x 04, x 05, x 06, x 07 x 10, x 11, x 12, x 13 addr select wr demux Input FSLs X 0 BRAM X 1 BRAM X 2 BRAM X 3 BRAM addr select rd addr select wr rotate mux sigmoid LUT out-img BRAM Output FSLs weight BRAM addr select rd/wr addr select rd

Flexible Reuse Buffers input 720 x 1280 Layer 1 6x358x638

54 Flexible Reuse Buffers input 720 x 1280 Layer 1 6x358x addr select wr 6x6 conv. with 2x2 subsample demux Input FSLs demux X 0 BRAM X 1 BRAM X 2 BRAM X 3 BRAM addr select rd addr select wr weight BRAM weight BRAM rotate addr select rd

55 Flexible Reuse Buffers 54 Image Coefficients addr select wr demux Input FSLs demux X 0 BRAM X 1 BRAM X 2 BRAM X 3 BRAM addr select rd addr select wr weight BRAM weight BRAM rotate addr select rd

56 Flexible Reuse Buffers 55 Image Coefficients addr select wr demux Input FSLs demux X 0 BRAM X 1 BRAM X 2 BRAM X 3 BRAM addr select rd addr select wr weight BRAM weight BRAM rotate addr select rd

57 Flexible Reuse Buffers 56 x 05 x 15 x 25 x 35 buffer address x 04, x 14, x 24, x 34 x x 00 x 40 x 01 x 41 x 02 x 42 x 03 x 43 x 04 x 44 y x 14, x 24, x 34, x 44 x 55 x 10 x 50 x 11 x 51 x 12 x 52 x 13 x 53 x 14 x 54 y 10 x 24, x 34, x 44, x 54 x 65 x x 34, x 44, x 54, x x 60 x 21 x 61 x 22 x 62 x 23 x 63 x 24 x 64 y 20 x 75 x 44, x 54, x 64, x x x 70 x 31 x 71 x 32 x 72 x 33 x 73 x 34 x 74 y 30 y 00, y 10, y 20, y 30 addr select wr demux Input FSLs demux X 0 BRAM X 1 BRAM X 2 BRAM X 3 BRAM addr select rd addr select wr weight BRAM weight BRAM rotate addr select rd

58 Support for Subsampling 57 Image Coefficients x 05 x 15 x 25 x 35 x 45 x 55 x 65 x 75 buffer address x 00 x 10 x 80 x 90 x 01 x 11 x 81 x 20 x 30 x a0 x 21 x 31 x a1 x 40 x 60 x 50 x 70 x 41 x 61 x 51 x x 91 x 02 x 22 x 42 x 62 x 00, x 20, x 40, x 60 x 10, x 30, x 50, x 70 x 20, x 40, x 60, x 80 x 30, x 50, x 70, x 90 x 40, x 60, x 80, x a0 x 01, x 21, x 41, x 61

59 Support for Subsampling 58 Image Coefficients x 05 x 15 x 25 x 35 x 45 x 55 x 65 x 75 buffer address x 00 x 10 x 80 x 90 x 01 x 11 x 81 x 20 x 30 x a0 x 21 x 31 x a1 x 40 x 60 x 50 x 70 x 41 x 61 x 51 x x 91 x 02 x 22 x 42 x 62 x 00, x 20, x 40, x 60 x 10, x 30, x 50, x 70 x 20, x 40, x 60, x 80 x 30, x 50, x 70, x 90 x 40, x 60, x 80, x a0 x 01, x 21, x 41, x 61

60 What would be the best compute order? 59 Small memories have low energy per access Area and Latency advantage Big memories can exploit more data reuse

61 Improve by locality driven synthesis 60 Loop Transformations Interchange Tiling Reduce reuse distance A huge design space! Use a framework with: Reuse detection Model utilized reuse Model required buffer size Optimize for buffer size Cost models

62 Compared to manually optimized order 61 Up to 13x resource reduction Up to 11x performance increase

63 Memory bandwidth requirements? 62 Data layout transformation Bandwidth up to 150 MB/s Better than an optimized Intel implementation

64 What do we achieve? 63 Flexible architecture template HLS vision cores Iteration reordering models to minimize data transfer Small but flexible accelerators Up to 13x smaller Up to 11x faster XPower Analyzer 4.5 Watt External RAM 0.5 Watt

65 Beyond Energy: Defects-Tolerant Accelerators? 64 Growing number of defect Design of micro-architectures Homogeneous Architectures Core redundancy Switch-off the defect cores How about Heterogeneous designs? A little story Defect tolerant accelerators

66 Defects-Tolerant ANNs 65 Memory decoder Spatially unfolding a network Power reduction, Memory BW Time-Multiplexing

67 Hardware ANN Robustness 66 ANN 90 inputs 10 outputs Olivier Temam: A Defect-Tolerant Accelerator for Emerging High-Performance Applications, ACM/IEEE International Symposium on Computer Architecture (ISCA), June 2012

68 Convergence of different domains Neurobiology Applications 67 Constraints Machine Learning Technology Innovations Neuromorphic

69 Beyond ANNs: Biological NNs 68 Understand the mind by simulating the brain Model perception Model memory Etc. Understand brain diseases Parkinson Alzheimer Etc. Software simulators Emergent NEURON Neurons Synapses Hz

70 Can computers do the same? 69 Blue Brain Project IBM/EPFL Molecular level 10 4 neurons 10 3 cores Spinnaker Integrate & fire 10 9 neurons 10 4 Arm9 cores

71 Spinnaker Chip Architecture 70

72 Spinnaker interconnect 71 Connection Hierarchy Group neurons to reduce inter-chip communication 128 MB SDRam Small Packets bit Routing tables

73 Convergence of different domains Neurobiology Applications 72 Constraints Machine Learning Technology Innovations Neuromorphic

74 Size 73 Digital CMOS Technology available Implementation of useful accelerators Not dense enough for largest bio-inspired networks Analog Much more dense implementation Recall Biological Neuron

75 Analog Spiking Neurons 74 Kirchhoff s law Capacitive integration Leakage ~14 transistors

76 Architecture Facets Project 75 Facets Integrate & Fire neurons wafer 60 million synapses Most area used for synapses Storage of connection strength Interconnect 2-D

77 Convergence of different domains Neurobiology Applications 76 Constraints Machine Learning Technology Innovations Neuromorphic

78 Synapses as Memristors Intel (2012) 77 Memristor can be used as switch Also analog storage of memristance

79 Beyond Silicon 78 Infineon NeuroChip (2003) Directly uses biological networks Difficult to connect to other devices

80 Convergence of different domains Neurobiology Applications 79 Constraints Machine Learning Technology Innovations Neuromorphic

Neural Computer Architectures

Neural Computer Architectures Accelerating Deep Learning Applications By: Maurice Peemen Date: 19-12-2018 Background Maurice 1 Masters Electrical Engineering at TU/e PhD work at TU/e Thesis work with Henk