A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

Size: px

Start display at page:

Download "A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)"

Eustace Hopkins
5 years ago
Views:

1 A Study of Data Partitioning on OpenC-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1

2 Outline Background and Motivations Data Partitioning on FPGA OpenC on FPGA Design Experiment Conclusion 2

3 What is Data Partitioning? Data partitioning divides the input table (of tuples) into a number of partitions according to input partitioning function. It splits the big input table into many small subtables (divide-and-conquer manner). It is a building block in many database applications (e.g., hash join and aggregation). 3

4 What is Data Partitioning? Input tuples 2 1 P Partitioning function P Sequential memory read Partitions P Random memory write It is a memory intensive operation. 4

5 Bandwidth (GB/s) Bandwidth (GB/s) Benchmarking Memory Subsystem Sequential memory access Byte Short Int ong ong2ong4ong Random memory access Byte Short Int ong ong2 ong4 ong8 1, Sequential bandwidth > Random bandwidth 5

6 Bandwidth (GB/s) Bandwidth (GB/s) Benchmarking Memory Subsystem 15 Sequential memory access Sub-linear Byte Short Int ong ong2ong4ong Random memory access inear Byte Short Int ong ong2 ong4 ong8 1, Sequential bandwidth > Random bandwidth 2, Random memory access is more sensitive to data access type. Use ong8, not byte 6

7 Outline Background and Motivations Data Partitioning on FPGA OpenC on FPGA Design Experiment Conclusion 7

8 What is OpenC? OpenC has been developed for heterogeneous computing environments, e.g. CPU+GPU/FPGA, with a host-accelerator model of program execution. 8

9 OpenC on FPGA 9 Global memory: external DDR. ocal memory: on-chip memory blocks. Pipeline: DSP blocks, memory blocks and logic blocks. B R A M B R A M B R A M B R A M B R A M B R A M B R A M B R A M B R A M DSP DSP DSP DSP DSP DSP DSP DSP DSP block Memory block ogic block Pipeline... ocal Memory... Pipeline ocal Memory Global Memory Interconnect... DDR Kernel-1 Kernel-N... OpenC SDK DDR

10 OmniDB on FPGA OmniDB [1]: State-of-the-art OpenC-based query processor on CPU/GPU. Mature. Good performance. How OmniDB performs on FPGA? ock overhead [1] Shuhao Zhang and et al. OmniDB: Towards Portable and Efficient Query Processing on Parallel CPU/GPU Architectures, VDB

11 Why ock is Required? Input tuples Partitions P Partitioning function Conflict P P 4 work items Consistency: one lock for each partition. 11

12 Existing Approaches Kernel-1 Pipeline Kernel-N Pipeline... ocal Memory ocal Memory Global Memory Global lock High latency Multiple kernels ocal lock ow latency One kernel 12

13 Elapsed time(ms) ock Overhead ocal is better global : global lock. local : local lock. xcu : x compute units (kernels). dummy : just get lock and release lock. 13

14 Elapsed time(ms) ock Overhead Big overhead global : global lock. local : local lock. xcu : x compute units (kernels). dummy : just get lock and release lock. Both approaches are not good enough. 14

15 Optimal Approach Global lock High latency Multiple kernels ocal lock ow latency One kernel Optimal ow latency Multiple kernels We need help from new OpenC feature (channel). 15

16 Impact of Channel Kernel 1 Kernel 2 Kernel 1 channel Kernel 2 DDR DDR Kernel : Verilog Module Channel : FIFO 16

17 Outline Background and Motivations Design Experiment Conclusion 17

18 Our Proposal Multi-kernel partitioning with channel is presented to attack the lock overhead. On-chip buffers are used to efficiently utilize memory subsystem on FPGAs. 18

Multi-kernel Partitioning Dispatching producer stage CD consumer stage Skewed_handling kernel One partition CD Data_out kernel 1 DDR Data_in kernel

19 Multi-kernel Partitioning Dispatching producer stage CD consumer stage Skewed_handling kernel One partition CD Data_out kernel 1 DDR Data_in kernel CD channel Data_out kernel 2... DDR FPGA CD Data_out kernel DO Multiple kernels execute concurrently in producer-consumer manner. Part of partitions 19

20 Data_in Kernel 1, oad W tuples from DDR to tuples[w]. 2, For (i 0 to W ) do Compute index j of consumer kernel for tuple[i]. Write tuple[i] to consumer kernel j via channel. consumer kernel: Data_out or Skewed_handling kernel. Dispatch rate: one cycle for one tuple. 1 W memory read transactions. 20

21 Data_out Kernel 1, Read tuple from Data_in kernel via channel. 2, Compute the partition index of tuple. 3, Update the counter (local) of partition. 4, Store the tuple to on-chip buffer. 5, If (buffer has S tuples) then Store the whole buffer to global memory. ock handling rate: seven cycles for one tuple. 1 S memory write transactions. 21

22 Skewed_handling Kernel 1, Read tuple from Data_in kernel via channel. 2, Update counter (private) of skewed partition. 3, Store tuple to on-chip buffer. 4, If (buffer has S tuples) then Store the whole buffer to global memory. ock handling rate: one cycle for one tuple. 1 S memory write transactions. 22

23 Cost Model Given the limitation of FPGA resource, choosing the optimal configuration for two parameters is challenging: DO: number of Data_out kernels at the consumer stage, [1, 2, 4, 8,16]. S: number of slots in the on-chip buffer for each partition, [1, 2, 4, 8, 16, 32]. The ranges of S and DO are small, so we consider all the possible combinations. The cost model is required to predict the performance for each combination. 23

24 Outline Background and Motivations Design Experiment Conclusion 24

25 Experimental Setup Platform: Terasic s DE5-Net board: Altera Stratix V A7 and 4GB 2-bank DDR3. Altera OpenC SDK version Data Sets: Tuple format: <key, payload>. Both keys and payloads are 4-bytes. The probability of individual keys follows a Zipf distribution, with the Zipf factor [0, 1.75]. 25

26 Elapsed time(ms) Evaluation of Cost Model Memory Measured Estimated (DO=8) 0 ock S We omit the cases (DO=1, 2, 4, 16). Our cost model can roughly predict the performance for each combination. Optimal combination: (DO = 8, S = 16). 26

27 Elapsed time(ms) Impact of Skewed_handling Kernel Original Skewed_handling 3.1X Zipf factor Significant speedup for the skewed data set. Optimal combination: (DO = 8, S = 16) 27

28 Elapsed time(ms) Impact of Number of Partitions local_1cu multi-kernel K 2K 4K 8K 16K More Stable Optimal combination: (DO = 8, S = 16) 28

29 Elapsed time(ms) Impact of Number of Tuples local_1cu multi-kernel 8M 16M 32M 64M 128M 196M Good scalability 10.7X Optimal combination: (DO = 8, S = 16) 29

30 Outline Background and Motivations Design Experiment Conclusion 30

31 Conclusion We demonstrate the significant overheads of data partitioning on FPGAs. We develop a new multi-kernel partitioning approach with on-chip buffers. Our proposed approach can achieve 10.7X speedup over the existing implementation. Further work: We want to accelerate all the database operators on OpenC-based FPGAs. 31

32 Q & A Our Terasic s DE5-Net FPGA board is denoted by Altera University Program. Our research group: Xtra Computing Group 32

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big