Near-Data Processing for Differentiable Machine Learning Models

Size: px

Start display at page:

Download "Near-Data Processing for Differentiable Machine Learning Models"

Sibyl Ross
6 years ago
Views:

1 Near-Data Processing for Differentiable Machine Learning Models Hyeokjun Choe 1, Seil Lee 1, Hyunha Nam 1, Seongsik Park 1, Seijoon Kim 1, Eui-Young Chung 2 and Sungroh Yoon 1,3 1 Electrical and Computer Engineering, Seoul National University 2 Electrical and Electronic Engineering, Yonsei University 3 Neurology and Neurological Sciences, Stanford University Correspondence: sryoon@snu.ac.kr Homepage: May 19th, 2017 Hyeokjun Choe et al. MSST 2017 May 19th, / 33

2 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33

3 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33

4 Machine Learning s Success Big data Powerful parallel processors Sophisticated models Hyeokjun Choe et al. MSST 2017 May 19th, / 33

5 Issues on Conventional Memory Hierachy Data movement in memory hierarchy Computational efficiency Power consumption Hyeokjun Choe et al. MSST 2017 May 19th, / 33

6 Near-data Processing (NDP) Memory or storage with intelligence (i.e., computing power) Process the data stored in memory or storage Reduce the data movements, CPU offloading Hyeokjun Choe et al. MSST 2017 May 19th, / 33

7 ISP-ML ISP-ML: a full-fledged ISP-supporting SSD platform Easy to implement machine learning algorithm in C/C++ For validation, three SGD algorithms were implemented and experimented with ISP-ML User Application SSD controller SSD OS CPU Embedded Processor(CPU) SRAM ISP SW Host I/F Main Memory DRAM SSD controller SRAM ARM Processor ISP SW SSD Channel Controller ISP HW Cache Controller Channel ISP HW Controller ISP HW NAND Flash NAND Flash NAND Flash NAND Flash Host I/F DRAM Cache Controller ISP HW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash NAND Flash NAND Flash Hyeokjun Choe et al. MSST 2017 May 19th, / 33

8 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33

9 Machine Learning as an Optimization Problem Machine learning categories Supervised learning, unsupervised learning, reinforcement learning The main purpose of supervised machine learning Find the optimal θ that minimizes F (D; θ) F (D, θ) = L(D, θ) + r(θ) (1) D : input data θ : model parameters L : loss function r : regularization term F : objective function Input layer Output layer Hyeokjun Choe et al. MSST 2017 May 19th, / 33

10 Gradient Descent θ t+1 = θ t η F (D, θ t ) (2) = θ t η i F (D i, θ t ) (3) η : learning rate t : iteration index i : data sample index 1st-order iterative optimization algorithm Use all samples per iteration Stochastic gradient descent (SGD) Use only one sample per iteration. Minibatch stochastic gradient descent Between gradient descent and SGD Use multiple samples per iteration Hyeokjun Choe et al. MSST 2017 May 19th, / 33

Downpour SGD Workers communicate with parameter server a Elastic Average SGD

11 Parallel and Distributed SGD Synchornous SGD Parameter server aggregates θ slave synchronously. Downpour SGD Workers communicate with parameter server asynchronously. Elastic Average SGD (EASGD) Each worker has own parameters Workers transfer (θ slave θ master ), not θ slave Hyeokjun Choe et al. MSST 2017 May 19th, / 33

Cache controller Channel controller DRAM Cache and Buffer 512MB - 2GB NAND flash

12 Fundamentals of Solid-State Drives (SSDs) SSD Controller Embedded processor for FTL HDD emulation Wear Leveling, Garbage collection, etc. Cache controller Channel controller DRAM Cache and Buffer 512MB - 2GB NAND flash arrays Simultaneously accessible Host interface logic SATA, PCIe Hyeokjun Choe et al. MSST 2017 May 19th, / 33

HMC) is used for PIM recently Implement processing unit in Logic Layer

13 Previous Work on Near-Data Processing:PIM Perform computation inside the main memory 3D stacked memory (e.g. HMC) is used for PIM recently Implement processing unit in Logic Layer Applications: sorting, string matching, CNN, matrix multiplication etc. Hyeokjun Choe et al. MSST 2017 May 19th, / 33

14 Previous Work on Near-Data Processing:ISP Perform computation inside the storage ISP with embedded processor Pros: easy to implement, flexible Cons: no parallelism ISP with dedicated hardware logic Pros: channel parallelism, hardware acceleration Cons: hard to implement and change Applications: DB query (scan, join), linear regression, k-means, string match etc. Hyeokjun Choe et al. MSST 2017 May 19th, / 33

15 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33

Host I/F SSD controller ARM Processor Cache Controller ISP HW DRAM SRAM ISP SW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash NAND Flash NAND Flash ISP-ML: ISP Platform for

16 Host I/F SSD controller ARM Processor Cache Controller ISP HW DRAM SRAM ISP SW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash NAND Flash NAND Flash ISP-ML: ISP Platform for Machine Learning on SSDs ISP-supporting SSD simulator Implemented in SystemC on the Synopsys Platform Architect Software/Hardware co-simulation Easily executes various machine learning algorithms in C/C++ Transaction level simulator For reasonable simulation speed ISP components SRAM ISP SW, ISP HW User Application OS CPU Main Memory SSD Host I/F DRAM SSD controller Embedded Processor(CPU) Cache Controller ISP HW SRAM ISP SW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash SSD NAND Flash NAND Flash Embedded Processor clk/rst Host I/F Cache Controller DRAM Channel Controller NAND Flash Hyeokjun Choe et al. MSST 2017 May 19th, / 33

17 Host I/F ARM Processor Cache Controller ISP HW DRAM SRAM ISP SW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash NAND Flash NAND Flash ISP-ML: ISP Platform for Machine Learning on SSDs We implemented two types of ISP hardware components. Channel controller: perform primitive operations on the stored data. Cache controller: collect the results from each of the channel controller. Master-slave architecture They communicate with each other. User Application OS CPU Main Memory SSD controller SSD Host I/F DRAM SSD controller Embedded Processor(CPU) Cache Controller ISP HW SRAM ISP SW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash SSD NAND Flash NAND Flash Hyeokjun Choe et al. MSST 2017 May 19th, / 33

18 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

19 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

20 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

21 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

22 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

23 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

24 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

25 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

26 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

27 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

28 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

29 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

30 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

31 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

32 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

33 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

34 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

35 Methodology for IHP-ISP Performance Comparison Ideal Ways to Fairly Compare ISP and IHP 1 Implementing ISP-ML in a real semiconductor chip High chip manufacturing costs 2 Simulating IHP in the ISP-ML framework. High simulation time to simulate IHP 3 Implementing both ISP and IHP using FPGAs. Require another significant development efforts. Hard to fairly compare the performances of ISP and IHP We propose a practical comparison methodology Hyeokjun Choe et al. MSST 2017 May 19th, / 33

36 Methodology for IHP-ISP Performance Comparison (a) Host Real System Simulator ISP Cmd (b) In Host Measure observed IHP execution time(t total ) Measure data IO time(t IO ) IO Trace Extract IO trace while executing application Storage ISP-ML (baseline) ISP-ML (ISP implemented) In SSD (Sim) Measure baseline simulation time with IO trace(t IOsim ) Observed IHP execution time = T total = T nonio + T IO. Expected IHP simulation time = T nonio + T IOsim = T total - T IO + T IOsim. T IO T nonio T IOsim : Data IO latency time of the storage : Non-data IO time : Data IO time of the baseline SSD in ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33

37 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33

38 Setup and Implementation Host specifications CPU 8-core Intel(R) Core i7-3770k (3.50GHz) Main memory DDR3 32GB RAM Storage Samsung SSD 840 Pro OS Ubuntu LTS ISP-ML specifications Embedded processor ARM 926EJ-S (400MHz) FTL DFTL Page size 8KB t prog / t read / t blockerase 300us / 75us / 5ms FPU 0.5 instruction/cycle(pipelined) Dataset x10 amplified MNIST(handwritten digits) Hyeokjun Choe et al. MSST 2017 May 19th, / 33

Performance Comparison:ISP-Based Optimization EASGD showed best performance in this

was faster than Downpour SGD 0.94 (a) 4-Channel 0.94 (b) 8-Channel 0.

82 Synchronous SGD Downpour SGD EASGD Test accuracy 0 2 4 6 8 10 12 Time(sec) 0.

39 Performance Comparison:ISP-Based Optimization EASGD showed best performance in this experiment. x2.96 against synchornous SGD on average. x1.41 against Downpour SGD on average. For 4,8 Ch, synchronous SGD was slower than Downpour SGD For 16 Ch, synchronous SGD was faster than Downpour SGD 0.94 (a) 4-Channel 0.94 (b) 8-Channel 0.94 (c) 16-Channel Test accuracy Synchronous SGD Downpour SGD EASGD Test accuracy Time(sec) Synchronous SGD Downpour SGD EASGD Test accuracy Time(sec) Synchronous SGD Downpour SGD EASGD Time(sec) Hyeokjun Choe et al. MSST 2017 May 19th, / 33

Performance Comparison:IHP versus ISP Compared IHP in memory shortage situation with ISP In large-scale machine learning, the computing systems used may suffer from memory shortages.

40 Performance Comparison:IHP versus ISP Compared IHP in memory shortage situation with ISP In large-scale machine learning, the computing systems used may suffer from memory shortages. Assumption: The host had already loaded all the data to main memory for IHP. ISP-based EASGD with 16 channels obtained the best performance in our experiments Test accuracy ISP(EASGD, 4CH) IHP(2GB-memory) IHP(16GB-memory) ISP(EASGD, 8CH) IHP(4GB-memory) IHP(32GB-memory) ISP(EASGD, 16CH) IHP(8GB-memory) Time(sec) Hyeokjun Choe et al. MSST 2017 May 19th, / 33

Channel Parallelism The speed-up tends to be proportional to the number of channels.

In distributed computing systems, communication bottleneck commonly occurs. 0.94 (a) Synchronous SGD 0.

41 Channel Parallelism The speed-up tends to be proportional to the number of channels. Because the communication overhead in ISP is negligible. In distributed computing systems, communication bottleneck commonly occurs (a) Synchronous SGD 0.94 (b) Downpour SGD Test accuracy Test accuracy (c) EASGD 4-Channel 8-Channel 16-Channel Time(sec) 4-Channel 8-Channel 16-Channel Time(sec) Test accuracy Speed up Channel 8-Channel 16-Channel Time(sec) Synchronous SGD Downpour SGD EASGD (d) Speed up Channel Hyeokjun Choe et al. MSST 2017 May 19th, / 33

Effects of Communication Period in Async.

communication period [τ =16; 64] EASGD Communication period, convergence speed In contrast

90 (a) Downpour SGD 0.92 (b) EASGD Test accuracy 0.80 0.70 0.60 τ = 1 τ = 4 τ = 16 τ = 64 0.

42 Effects of Communication Period in Async. SGD Downpour SGD High speed for a low communication period [τ=1; 4] Unstable for a high communication period [τ =16; 64] EASGD Communication period, convergence speed In contrast to the distributed computing system Because of the low communication overhead 0.90 (a) Downpour SGD 0.92 (b) EASGD Test accuracy τ = 1 τ = 4 τ = 16 τ = Time(sec) Test accuracy τ = 1 τ = 4 τ = 16 τ = Time(sec) Hyeokjun Choe et al. MSST 2017 May 19th, / 33

43 Experimental Results Summary 1 EASGD shows the best performance in our ISP-ML environment. 2 ISP is more efficient than IHP while host suffers from insufficient main memory. ISP may be useful in large scale machine learning. 3 The speed-up by parallelizing is proportional to the number of channels. Because of the ultra fast on-chip communication. 4 The performance of EASGD decreases while the communication period increases unlike conventional distributed system. Hyeokjun Choe et al. MSST 2017 May 19th, / 33

44 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33

45 Parallelism in ISP ISP can provide various advantages for data processing involved in machine learning. E.g. ultra-fast on-chip communication Increase energy efficiency, security, and reliability High degree of parallelism could be achieved. By increasing the number of channels inside an SSD. Exploiting a hierarchy of parallelism Distributed systems + ISP-based SSDs Hyeokjun Choe et al. MSST 2017 May 19th, / 33

46 Opportunities for Future Research 1 Implementing deep neural networks in ISP-ML framework 2 Implementing adaptive optimization algorithms E.g. Adagrad and Adadelta 3 Pre-computing metadata during data writes 4 Implementing data shuffling functionality 5 Investigate the effect of NAND flash design on performance Hyeokjun Choe et al. MSST 2017 May 19th, / 33

47 Conclusion Create full-fledged ISP-supporting SSD simulator supporting ML Implement and compare multiple versions of parallel SGD Propose fair comparison methodology between IHP and ISP Intrigue future research opportunities in terms of exploiting the channel parallelism Hyeokjun Choe et al. MSST 2017 May 19th, / 33

48 Acknowledgments Hyeokjun Choe et al. MSST 2017 May 19th, / 33

49 Q/A Hyeokjun Choe et al. MSST 2017 May 19th, / 33

Near-Data Processing for Differentiable Machine Learning Models

Near-Data Processing for Differentiable Machine Learning Models Hyeokjun Choe 1, Seil Lee 1, Hyunha Nam 1, Seongsik Park 1, Seijoon Kim 1, Eui-Young Chung 2, Sungroh Yoon 1,3 1 Electrical and Computer