Scale-Out Acceleration for Machine Learning

Size: px

Start display at page:

Download "Scale-Out Acceleration for Machine Learning"

Kelly Turner
5 years ago
Views:

1 Scale-Out Acceleration for Machine Learning Jongse Park Hardik Sharma Divya Mahajan Joon Kyung Kim Preston Olds Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology University of California, San Diego MICRO50

2 Data grows at an unprecedented rate 1 minute over the internet 900,000 Logins 452,000 Tweets Sent 40,000 Hours Listened 156 Million s Sent 1.8 Millions Snaps Created 46,200 Posts Uploaded $751,522 Spent Online 70,107 Hours Watched 3.5 Million Search Queries 342,000 Apps Downloaded 1

3 Machines learn to extract insights from data Training Phase Trained Model Inference Phase Training Data ML Algorithm Predictions Unseen Data 2

4 Machines learn to extract insights from data Training Phase Trained Model Inference Phase Training Data ML Algorithm Predictions Unseen Data 3

5 Two disjoint solutions for ML training Distributed Computing Single-Node FPGA/ASIC Accelerators SBCCI 03 FCCM 09 FCCM 10 AHS 11 CVPRW 11 DAC 12 ASPLOS 14 ASPLOS 15 HPCA 15 HPCA 16 MICRO 16 4

6 Two disjoint solutions for ML training Distributed Computing Single-Node FPGA/ASIC Accelerators CoSMIC SBCCI 03 FCCM 09 FCCM 10 AHS 11 CVPRW 11 DAC 12 ASPLOS 14 ASPLOS 15 HPCA 15 HPCA 16 MICRO 16 5

7 Challenges 1 How to distribute 2 How to design 3 ML training? customizable accelerators? How to reduce the overhead of distributed coordination? 6

8 Challenges 1 How to distribute 2 How to design 3 ML training? customizable accelerators? How to reduce the overhead of distributed coordination? We need a full stack 7

9 CoSMIC Computing Stack for ML Acceleration In the Cloud Programming layer Compilation layer System layer Architecture layer Circuit layer High-level mathematical language Accelerator operation scheduling and data mapping System software orchestrating distributed accelerators Multi-threaded template architecture Constructing RTL Verilog 8

CoSMIC workflow Multi-threaded Template Architecture Thread ID Mem PE Addr Offset Pipelined Memory Bus Tree Bus Model Specification Translator Dataflow Graph source mu x[0] w[0] w[1] x[1] 1 0 y x[2]

10 CoSMIC workflow Multi-threaded Template Architecture Thread ID Mem PE Addr Offset Pipelined Memory Bus Tree Bus Model Specification Translator Dataflow Graph source mu x[0] w[0] w[1] x[1] 1 0 y x[2] w[2] * * - * + + * > * * * * Planner Thread Index Table Memory Schedule Memory Interface Shi$er Prefetch Buffer PE ₁,n PE ₁,₂ Shared Row Bus PE ₂,n PE ₂,₂ PE ₁,₁ PE ₂,₁ Worker Thread #1 Accelerator Architecture PE m-₁,n PE m,n PE m-₁,₂ PE m,₂ PE m-₁,₁ PE m,₁ Worker Thread #T * * * sink Compiler Operation Schedule/Map Constructor V RTL Verilog for Accelerator Circuit Constraints System Specification System Director System Software 9

11 Challenges 1 How to distribute 2 How to design 3 ML training? customizable accelerators? How to reduce the overhead of distributed coordination? We need a full stack 10

12 1 How to distribute? Understanding machine learning Training Data Input (X) Output (Y*) 25,1,76,0 1 6,43,9,93 6 : : 23,56,2, ,0,9,0 0 w i Intermediate Model Output (Y*) Predicted Output (Y) Loss (w ' ) Loss w i = + Y Y '

13 Algorithmic commonalities Training Data Input (X) 25,1,76,0 1 6,43,9,93 6 : : 23,56,2, ,0,9,0 0 Learning is solving an iterative optimization problem! Output (Y*) w i Intermediate Model Output (Y*) Predicted Output (Y) Loss (w ' ) Trained Model find w i Loss w i = Y Y ' is minimized

14 Stochastic gradient descent solver Initial point Loss function (w ' ) = f w ; (<), w> (<),...., w@ (<) w ' (<A;) (<) (<) (<) f w (<) ;, w>,...., w@ = w ' u (<) w ' W1 W2 Optimal point

Leverage linearity of differentiation for distributed learning Delta Node 1 Delta Node i W (t+1) 1 = f W (t) Data Partition 1 Accelerator Data Partition i Accelerator W

15 Leverage linearity of differentiation for distributed learning Delta Node 1 Delta Node i W (t+1) 1 = f W (t) Data Partition 1 Accelerator Data Partition i Accelerator W (t+1) i = f W (t) Delta Node i+1 Sigma Node W (t+1) i+1 = f W (t) Data Partition i+1 Accelerator Data Partition M Accelerator W (t+1) M = W (t+1) = f W (t) Pi W (t+1) i M

16 Abstraction between algorithm and scale-out acceleration system Support Vector Machines Gradient Loss Function Regression Analysis Gradient Loss Function Machine Learning Algorithms Back Propagation Gradient Loss Function Collaborative Filtering Gradient Loss Function Kalman Filters Gradient Loss Function Parallelized Stochastic Gradient Descent Solver (Abstraction between Hardware and Software) Xeon Xeon Phi Phi GPU FPGA CGRA ASIC Xeon Phi GPU GPU FPGA FPGA CGRA CGRA ASIC ASIC 15

17 CoSMIC programming Math formulations CoSMIC programming loss w ' = (+ w ' x ' ' y)x ' h = sum[i](w[i] * x[i]); d = h y g[i] = d * x[i] w (<A;) = (<A;) E w E n aggregator(n) { w[i] = (sum[j](w[i])) / n; } 16

18 CoSMIC compilation CoSMIC programming CoSMIC compilation h = sum[i](w[i] * x[i]); d = h y g[i] = d * x[i] Dataflow graph aggregator(n) { w[i] = (sum[j](w[i])) / n; } Software aggregator 17

19 Challenges 1 How to distribute 2 How to design 3 ML training? customizable accelerators? How to reduce the overhead of distributed coordination? We need a full stack 18

20 2 How to design customizable accelerator? Template architecture N columns Chip Memory Interface PE 1,1 PE 1,2 PE 1,N PE 2,1 PE 2,2 PE 2,N M rows PE m-1,1 PE m-1,2 PE m-1,n PE m,1 PE m,2 PE m,n 19

21 Multi-threading acceleration Memory Interface Worker Thread #1 PE 1,1 PE 1,2 PE 1,N PE 2,1 PE 2,2 PE 2,N Worker Thread #T PE m-1,1 PE m-1,2 PE m-1,n PE m,1 PE m,2 PE m,n 20

22 Connectivity and bussing Pipelined Memory Bus Worker Thread #1 Memory Interface PE 1,1 PE 1,2 PE 1,N PE 2,1 PE 2,2 PE 2,N Shared Row Bus Tree Bus Refer the paper Bi-directional Link Worker Thread #T PE m-1,1 PE m-1,2 PE m-1,n PE m,1 PE m,2 PE m,n 21

23 PE architecture Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Shared Bus Le3 PE Right PE Data Buffer Model Buffer Interim Buffer Non- Linear > ALU + x 22

24 Challenges 1 How to distribute 2 How to design 3 ML training? customizable accelerators? How to reduce the overhead of distributed coordination? We need a full stack 23

25 3 How to reduce overhead of distributed coordination? Specialized system software in CoSMIC Delta Node Software Internally Managed Thread Pools for Networking and Aggregation Sigma Node Software Network handler Circular Buffers Aggregator Networking Thread Pool Aggregation Thread Pool Delta Node Software Block by block Produced-Consumer Semantics 24

26 Benchmarks for distributed learning Name Description Algorithm Model Size (KB) Training Data Size (GB) Lines of Code mnist acoustic Handwritten digit recognition Speech recognition modeling Backpropagation 2,432 KB 1,527 KB 2.9 GB 5.6 GB stock texture Stock price prediction Image texture recognition Linear Regression 31 KB 64 KB 14.7 GB 17.9 GB tumor cancer1 Tumor classification Prostate cancer diagnosis Logistic Regression 8 KB 24 KB 10.4 GB 13.5 GB movielens netflix Movielens recommender system Netflix recommender system Collaborative Filtering 1,176 KB 2,854 KB 0.6 GB 2.0 GB face cancer2 Human face detection Cancer diagnosis Support Vector Machine 7 KB 28 KB 15.9 GB 20.0 GB

27 List of results 1. Performance comparison with Spark 2. Breakdown of performance between computation and system coordination 3. Performance comparison of FPGA-CoSMIC with Programmable ASIC and GPUs 4. Power Efficiency comparison between FPGA, Programmable ASICs and GPUs 5. Sensitivity to mini-batch size, compute units, memory bandwidth 6. Design space exploration for multithreading 7. Comparison of template architecture with prior accelerator designs 26

28 List of results 1. Performance comparison with Spark 2. Breakdown of performance between computation and system coordination 3. Performance comparison of FPGA-CoSMIC with Programmable ASIC and GPUs 4. Power Efficiency comparison between FPGA, Programmable ASICs and GPUs 5. Sensitivity to mini-batch size, compute units, memory bandwidth 6. Design space exploration for multithreading 7. Comparison of template architecture with prior accelerator designs 27

29 Speedup in comparison with Spark Speedup over 4-CPU-Spark (log-scale) mnist acoustic stock texturetumor CoSMIC 16-node CoSMIC with UltraScale+ FPGAs offer 18.8 speedup over 16-node Spark with Xeon E3 Skylake CPUs Scaling from 4 to 16 nodes with CoSMIC yields 2.7 improvement while Spark offers 1.8. Spark cancer1 movielens netflix face cancer2 geomean 28

30 Speedup breakdown between computation and system coordination Speedup / Spark (Log Scale) 1, mnist acoustic mnist acoustic stock stock texture texture tumor tumor cancer1 FPGA movielens cancer1 movielens System Software netflix face cancer2 geomean netflix face cancer2 geomean CoSMIC system software makes system coordination (34%) 28.4 faster CoSMIC FPGA hardware makes computation (66%) 20.7 faster The overall speedup is 22.8 speedup 29

31 Hardware is not enough, we need a novel system stack Speedup over FPGA-CoSMIC mnist acoustic stocktexture P-ASIC-CoSMIC GPU-CoSMIC tumor cancer1 movielens netflix face cancer2 geomean P-ASIC, and GPU provide 2.3 and 1.5 extra speedup over FPGA CoSMIC, which is 22.8 faster than Spark 30

32 Performance-per-Watt comparison Improvement in Performance-per-Watt mnist acoustic stocktexture FPGA-CoSMIC P-ASIC-CoSMIC tumor cancer1 movielens netflix face cancer2 geomean With CoSMIC FPGAs and P-ASICs provide 4.2 and 8.2 higher Performance-per-Watt than GPUs 31

33 Conclusion CoSMIC Full-stack solution with an algorithmic angle What is next Reduce communication Specialize the networking stack Design more template architecture 32

TABLA: A Unified Template-based Framework for

Appears in the Proceedings of the 22 nd Annual IEEE International Symposium on High Performance Computer Architecture, 2016 TABLA: A Unified Template-based Framework for Accelerating Statistical Machine